ER-Depth: Enhancing the Robustness of Self-Supervised Monocular Depth Estimation in Challenging Scenes

Ziyang Song,^* Ruijie Zhu,^* Chuxin Wang, Jianfeng He, Jiacheng Deng, Tianzhu Zhang

University of Science and Technology of China
^*Indicates Equal Contribution

Paper Code arXiv

MY ALT TEXT

The first-stage training framework of EC-Depth. In the first stage, we construct an image triplet with one standard sample and two augmented challenging samples. For the standard sample, we exploit the photometric loss to provide supervision signals. Then, the perturbation-invariant depth consistency loss is applied to constrain the consistency of depth predictions under different perturbations for reliable supervision on challenging samples.

MY ALT TEXT

The second-stage training framework of EC-Depth. In the second stage, we leverage the Mean Teacher paradigm to generate pseudo-labels for self-distillation. To enhance the overall quality of the pseudo-labels, we propose a depth consistency-based filter (DC-Filter) to select pseudo-labels robust against various perturbations and a geometric consistency-based filter (GC-Filter) to select pseudo-labels accurate and reliable enough.

Abstract

Self-supervised monocular depth estimation holds significant importance in the fields of autonomous driving and robotics. However, existing methods are typically trained and evaluated on clear, sunny datasets, overlooking the impact of various adverse conditions commonly encountered in real-world applications, such as rainy weather, low visibility, and motion blur. As a result, they often struggle in challenging scenarios and produce artifacts. To address this issue, we propose ER-Depth, a novel two-stage self-supervised framework designed for robust depth estimation. In the first stage, we propose perturbation-invariant depth consistency regularization to propagate reliable supervision from standard to challenging scenes. In the second stage, we adopt the Mean Teacher paradigm for self-distillation and present a novel consistency-based pseudo-label filtering strategy to improve the quality of pseudo-labels. Extensive experiments demonstrate that our method exhibits exceptional robustness in challenging scenarios while maintaining high performance in standard scenes, significantly outperforming existing state-of-the-art methods on challenging KITTI-C, DrivingStereo, and NuScenes-Night benchmarks. Project page: https://ruijiezhu94.github.io/ERDepth_page. \end{abstract}

Comparison of depth predictions in standard and challenging scenes. The first and second columns show the predictions for standard and challenging scenes respectively, while the third column displays the difference between the two. MonoViT produces significant artifacts in depth predictions under challenging conditions, while our method can robustly deal with perturbations, delivering accurate and consistent depth predictions in both domains.

Illustration of photometric loss reliability in different situations. (a) demonstrates the construction of photometric loss in a standard scene, where depth and camera pose are used to establish pixel correspondences between adjacent frames, and photometric loss ensures color consistency for corresponding points. (b) highlights the issue of occlusion caused by raindrops, which leads to inconsistency in the color of corresponding points at the correct depth, resulting in unreliable supervision signals. (c) depicts a scenario where photometric loss becomes unreliable due to object motion.

Quantitative results on KITTI-C (challenging) and KITTI (standard). Benefiting from the delicately designed training strategy, our method outperforms other methods by a large margin on the challenging benchmark KITTI-C while maintaining high performance on the standard benchmark KITTI.

Qualitative results on KITTI and KITTI-C. Two rows are grouped together, where the upper row shows the predictions in the standard clear scenario, and the lower row shows the predictions on the perturbed challenging image. As highlighted with green boxes, our method can predict accurate and consistent depth maps even under severe perturbations while other methods produce artifacts or miss the real structures in challenging scenarios.

Zero-shot generalization results on the DrivingStereo dataset. Our method demonstrates excellent performance under various weather conditions in real-world scenes. While other methods exhibit different degrees of incomplete structures or artifacts, our approach is able to accurately recover the fine-grained details of objects.

BibTeX

@article{zhu2023ecdepth,
  title={EC-Depth: Exploring the consistency of self-supervised monocular depth estimation in challenging scenes},
  author={Song, Ziyang and Zhu, Ruijie and Wang, Chuxin and Jiacheng Deng and He, Jianfeng and Zhang, Tianzhu},
  journal={arXiv preprint arXiv:2310.08044},
  year={2023}
}