Plane2Depth: Hierarchical Adaptive Plane Guidance for Monocular Depth Estimation

School of Information Science and Technology,
University of Science and Technology of China
TCSVT 2024
teaser

We present an example of “visual deception” in (a). The significant color discrepancies mislead the network into predicting an incorrect depth map. Our method successfully mitigates this issue by using plane information. In (b), pixels within the yellow bounding box correspond to different depths but share the same surface normal (indicated by identical colors representing the corresponding values). Since the generation of ground truth for surface normal depends on conventional algorithms, there exists a slight offset.

Abstract

Monocular depth estimation aims to infer a dense depth map from a single image, which is a fundamental and prevalent task in computer vision. Many previous works have shown impressive depth estimation results through carefully designed network structures, but they usually ignore the planar information and therefore perform poorly in low-texture areas of indoor scenes. In this paper, we propose Plane2Depth, which adaptively utilizes plane information to improve depth prediction within a hierarchical framework. Specifically, in the proposed plane guided depth generator (PGDG), we design a set of plane queries as prototypes to softly model planes in the scene and predict per-pixel plane coefficients. Then the predicted plane coefficients can be converted into metric depth values with the pinhole camera model. In the proposed adaptive plane query aggregation (APGA) module, we introduce a novel feature interaction approach to improve the aggregation of multi-scale plane features in a top-down manner. Extensive experiments show that our method can achieve outstanding performance, especially in low-texture or repetitive areas. Furthermore, under the same backbone network, our method outperforms the state-of-the-art methods on the NYU-Depth-v2 dataset, achieves competitive results with state-of-the-art methods KITTI dataset and can be generalized to unseen scenes effectively.

Framework

The overall architecture of Plane2Depth. We use a set of plane queries to predict plane coefficients through E-MLP, N-MLP, and T-MLP, respectively. Then the predicted plane coefficients are converted to metric depth maps through the pinhole camera model. For consistent query prediction, we adopt the APQA module to aggregate multi-scale image features and adaptively modulate them via AF modulators.

framework

Comparison in NYU-Depth-v2 dataset

framework

Please zoom in for better visualization.

Comparison in NYU-Depth-v2 dataset (point cloud)

framework

Please zoom in for better visualization.

Comparison in KITTI dataset

The first column presents the original image along with its corresponding depth ground truth. The subsequent columns illustrate the predicted depth maps and error maps generated by each model. The red box on the image represents areas of high curvature (such as the surfaces on the car, cylindrical tree trunks, leaves, etc.). Our approach yields satisfactory results in outdoor scenes thanks to the smooth modeling of planes, particularly in prominent plane areas.

framework

Please zoom in for better visualization.

Comparison in unseen scenes

framework

Please zoom in for better visualization.

BibTeX


@article{liu2024plane2depth,
  title={Plane2Depth: Hierarchical Adaptive Plane Guidance for Monocular Depth Estimation},
  author={Liu, Li and Zhu, Ruijie and Deng, Jiacheng and Song, Ziyang and Yang, Wenfei and Zhang, Tianzhu},
  journal={IEEE Transactions on Circuits and Systems for Video Technology},
  year={2024},
  publisher={IEEE}
}