Existing monocular depth estimation methods have achieved satisfactory performance on wild datasets. However, these methods are usually trained and tested on a single dataset, which makes them difficult to generalize to other scenarios. To learn diverse scene priors from multiple datasets, we propose a hierarchical framework with adaptive bins for robust monocular depth estimation, which consists of two critical components: a group-wise query generator to assign hierarchical bins and a correlation-aware transformer decoder to generate adaptive bin features. The proposed HA-Bins enjoys several merits. First, the group-wise query generator progressively increases the number of bin queries for multi-scale image features, resulting in a hierarchical bin distribution robust to diverse scenarios. Second, the correlation-aware transformer decoder refines the correlation of bin queries and image features, effectively improving adaptive image feature aggregation. We visualize the query activation maps on NYUDepthv2 dataset, showing that the proposed network effectively suppresses the depth-irrelevant regions. Experiments on KITTI, Sintel, and RabbitAI benchmarks show that without any fine-tuning, our model jointly trained on multiple datasets achieves competitive performance with the state-of-theart and solid robustness toward diverse scenarios. In addition, our method wins second place in Robust Vision Challenge 2022 towards challenging scenarios with different characteristics.
@ARTICLE{zhu2023habins,
author={Zhu, Ruijie and Song, Ziyang and Liu, Li and He, Jianfeng and Zhang, Tianzhu and Zhang, Yongdong},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
title={HA-Bins: Hierarchical Adaptive Bins for Robust Monocular Depth Estimation Across Multiple Datasets},
year={2024},
volume={34},
number={6},
pages={4354-4366},
doi={10.1109/TCSVT.2023.3335316}}