Self-Supervised Monocular Depth Estimation With Positional Shift Depth Variance and Adaptive Disparity Quantization
Recently, attempts to learn the underlying 3D structures of a scene from monocular videos in a fully self-supervised fashion have drawn much attention. One of the most challenging aspects of this task is to handle independently moving objects as they break the rigid-scene assumption. In this paper,...
Veröffentlicht in: | IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 33(2024) vom: 18., Seite 2074-2089 |
---|---|
1. Verfasser: | |
Weitere Verfasser: | , |
Format: | Online-Aufsatz |
Sprache: | English |
Veröffentlicht: |
2024
|
Zugriff auf das übergeordnete Werk: | IEEE transactions on image processing : a publication of the IEEE Signal Processing Society |
Schlagworte: | Journal Article |
Zusammenfassung: | Recently, attempts to learn the underlying 3D structures of a scene from monocular videos in a fully self-supervised fashion have drawn much attention. One of the most challenging aspects of this task is to handle independently moving objects as they break the rigid-scene assumption. In this paper, we show for the first time that pixel positional information can be exploited to learn SVDE (Single View Depth Estimation) from videos. The proposed moving object (MO) masks, which are induced by the depth variance to shifted positional information (SPI) and are referred to as 'SPIMO' masks, are highly robust and consistently remove independently moving objects from the scenes, allowing for robust and consistent learning of SVDE from videos. Additionally, we introduce a new adaptive quantization scheme that assigns the best per-pixel quantization curve for depth discretization, improving the fine granularity and accuracy of the final aggregated depth maps. Finally, we employ existing boosting techniques in a new way that self-supervises moving object depths further. With these features, our pipeline is robust against moving objects and generalizes well to high-resolution images, even when trained with small patches, yielding state-of-the-art (SOTA) results with four- to eight-fold fewer parameters than the previous SOTA techniques that learn from videos. We present extensive experiments on KITTI and CityScapes that show the effectiveness of our method |
---|---|
Beschreibung: | Date Revised 18.03.2024 published: Print-Electronic Citation Status PubMed-not-MEDLINE |
ISSN: | 1941-0042 |
DOI: | 10.1109/TIP.2024.3374045 |