CI3D : Context Interaction for Dynamic Objects and Static Map Elements in 3D Driving Scenes

Multi-view 3D visual perception including 3D object detection and Birds'-eye-view (BEV) map segmentation is essential for autonomous driving. However, there has been little discussion about 3D context attention between dynamic objects and static elements with multi-view camera inputs, due to th...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 33(2024) vom: 13., Seite 2867-2879
1. Verfasser: Cai, Feipeng (VerfasserIn)
Weitere Verfasser: Chen, Hao, Deng, Liuyuan
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2024
Zugriff auf das übergeordnete Werk:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:Journal Article
LEADER 01000caa a22002652 4500
001 NLM36581573X
003 DE-627
005 20240416232406.0
007 cr uuu---uuuuu
008 231226s2024 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2023.3340607  |2 doi 
028 5 2 |a pubmed24n1377.xml 
035 |a (DE-627)NLM36581573X 
035 |a (NLM)38090848 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Cai, Feipeng  |e verfasserin  |4 aut 
245 1 0 |a CI3D  |b Context Interaction for Dynamic Objects and Static Map Elements in 3D Driving Scenes 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 15.04.2024 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a Multi-view 3D visual perception including 3D object detection and Birds'-eye-view (BEV) map segmentation is essential for autonomous driving. However, there has been little discussion about 3D context attention between dynamic objects and static elements with multi-view camera inputs, due to the challenging nature of recovering the 3D spatial information from images and performing effective 3D context interaction. 3D context information is expected to provide more cues to enhance 3D visual perception for autonomous driving. We thus propose a new transformer-based framework named CI3D in an attempt to implicitly model 3D context interaction between dynamic objects and static map elements. To achieve this, we use dynamic object queries and static map queries to gather information from multi-view image features, which are represented sparsely in 3D space. Moreover, a dynamic 3D position encoder is utilized to precisely generate queries' positional embeddings. With accurate positional embeddings, the queries effectively aggregate 3D context information via a multi-head attention mechanism to model 3D context interaction. We further reveal that sparse supervision signals from the limited number of queries result in the issue of rough and vague image features. To overcome this challenge, we introduce a panoptic segmentation head as an auxiliary task and a 3D-to-2D deformable cross-attention module, greatly enhancing the robustness of spatial feature learning and sampling. Our approach has been extensively evaluated on two large-scale datasets, nuScenes and Waymo, and significantly outperforms the baseline method on both benchmarks 
650 4 |a Journal Article 
700 1 |a Chen, Hao  |e verfasserin  |4 aut 
700 1 |a Deng, Liuyuan  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g 33(2024) vom: 13., Seite 2867-2879  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnns 
773 1 8 |g volume:33  |g year:2024  |g day:13  |g pages:2867-2879 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2023.3340607  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 33  |j 2024  |b 13  |h 2867-2879