Lowis3D : Language-Driven Open-World Instance-Level 3D Scene Understanding

Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progr...

Description complète

Détails bibliographiques
Publié dans:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 46(2024), 12 vom: 22. Dez., Seite 8517-8533
Auteur principal: Ding, Runyu (Auteur)
Autres auteurs: Yang, Jihan, Xue, Chuhui, Zhang, Wenqing, Bai, Song, Qi, Xiaojuan
Format: Article en ligne
Langue:English
Publié: 2024
Accès à la collection:IEEE transactions on pattern analysis and machine intelligence
Sujets:Journal Article
LEADER 01000caa a22002652c 4500
001 NLM373312407
003 DE-627
005 20250306065608.0
007 cr uuu---uuuuu
008 240608s2024 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2024.3410324  |2 doi 
028 5 2 |a pubmed25n1243.xml 
035 |a (DE-627)NLM373312407 
035 |a (NLM)38843054 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Ding, Runyu  |e verfasserin  |4 aut 
245 1 0 |a Lowis3D  |b Language-Driven Open-World Instance-Level 3D Scene Understanding 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 08.11.2024 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progress in 2D open-world perception is the availability of large-scale image-text pairs from the Internet, which cover a wide range of vocabulary concepts. However, this success is hard to replicate in 3D scenarios due to the scarcity of 3D-text pairs. To address this challenge, we propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for multi-view images of 3D scenes. This allows us to establish explicit associations between 3D shapes and semantic-rich captions. Moreover, to enhance the fine-grained visual-semantic representation learning from captions for object-level categorization, we design hierarchical point-caption association methods to learn semantic-aware embeddings that exploit the 3D geometry between 3D points and multi-view images. In addition, to tackle the localization challenge for novel classes in the open-world setting, we develop debiased instance localization, which involves training object grouping modules on unlabeled data using instance-level pseudo supervision. This significantly improves the generalization capabilities of instance grouping and, thus, the ability to accurately locate novel objects. We conduct extensive experiments on 3D semantic, instance, and panoptic segmentation tasks, covering indoor and outdoor scenes across three datasets. Our method outperforms baseline methods by a significant margin in semantic segmentation (e.g. 34.5%  ∼ 65.3%), instance segmentation (e.g. 21.8%  ∼ 54.0%), and panoptic segmentation (e.g. 14.7%  ∼ 43.3%). Code will be available 
650 4 |a Journal Article 
700 1 |a Yang, Jihan  |e verfasserin  |4 aut 
700 1 |a Xue, Chuhui  |e verfasserin  |4 aut 
700 1 |a Zhang, Wenqing  |e verfasserin  |4 aut 
700 1 |a Bai, Song  |e verfasserin  |4 aut 
700 1 |a Qi, Xiaojuan  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g 46(2024), 12 vom: 22. Dez., Seite 8517-8533  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnas 
773 1 8 |g volume:46  |g year:2024  |g number:12  |g day:22  |g month:12  |g pages:8517-8533 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2024.3410324  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 46  |j 2024  |e 12  |b 22  |c 12  |h 8517-8533