Plane2Depth: Hierarchical Adaptive Plane Guidance for Monocular Depth Estimation

Abstract

Monocular depth estimation aims to infer a dense depth map from a single image, which is a fundamental and prevalent task in computer vision. Many previous works have shown impressive depth estimation results through carefully designed network structures, but they usually ignore the planar information and therefore perform poorly in low-texture areas of indoor scenes. In this paper, we propose Plane2Depth, which adaptively utilizes plane information to improve depth prediction within a hierarchical framework. Specifically, in the proposed plane guided depth generator (PGDG), we design a set of plane queries as prototypes to softly model planes in the scene and predict per-pixel plane coefficients. Then the predicted plane coefficients can be converted into metric depth values with the pinhole camera model. In the proposed adaptive plane query aggregation (APGA) module, we introduce a novel feature interaction approach to improve the aggregation of multi-scale plane features in a top-down manner. Extensive experiments show that our method can achieve outstanding performance, especially in low-texture or repetitive areas. Furthermore, under the same backbone network, our method outperforms the state-of-the-art methods on the NYU-Depth-v2 dataset, achieves competitive results with state-of-the-art methods KITTI dataset and can be generalized to unseen scenes effectively.

Framework

The overall architecture of Plane2Depth. We use a set of plane queries to predict plane coefficients through E-MLP, N-MLP, and T-MLP, respectively. Then the predicted plane coefficients are converted to metric depth maps through the pinhole camera model. For consistent query prediction, we adopt the APQA module to aggregate multi-scale image features and adaptively modulate them via AF modulators.

Comparison in NYU-Depth-v2 dataset

Please zoom in for better visualization.

Comparison in NYU-Depth-v2 dataset (point cloud)

Please zoom in for better visualization.

Comparison in KITTI dataset

The first column presents the original image along with its corresponding depth ground truth. The subsequent columns illustrate the predicted depth maps and error maps generated by each model. The red box on the image represents areas of high curvature (such as the surfaces on the car, cylindrical tree trunks, leaves, etc.). Our approach yields satisfactory results in outdoor scenes thanks to the smooth modeling of planes, particularly in prominent plane areas.

Please zoom in for better visualization.

Comparison in unseen scenes

Please zoom in for better visualization.

BibTeX


@article{liu2024plane2depth,
  title={Plane2Depth: Hierarchical Adaptive Plane Guidance for Monocular Depth Estimation},
  author={Liu, Li and Zhu, Ruijie and Deng, Jiacheng and Song, Ziyang and Yang, Wenfei and Zhang, Tianzhu},
  journal={IEEE Transactions on Circuits and Systems for Video Technology},
  year={2024},
  publisher={IEEE}
}