To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training

Recently, there exists an increased research interest in embodied artificial intelligence (EAI), which involves an agent learning to perform a specific task when dynamically interacting with the surrounding 3D environment. There into, a new challenge is that many unseen objects may appear due to the...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 33(2024) vom: 18., Seite 5370-5381
1. Verfasser:	Su, Ke (VerfasserIn)
Weitere Verfasser:	Zhang, Xingxing, Zhang, Siyang, Zhu, Jun, Zhang, Bo
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2024
Zugriff auf das übergeordnete Werk:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:	Journal Article


LEADER	01000caa a22002652c 4500
001	NLM377795917
003	DE-627
005	20250306162303.0
007	cr uuu---uuuuu
008	240919s2024 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TIP.2024.3459800 \|2 doi
028	5	2	\|a pubmed25n1258.xml
035			\|a (DE-627)NLM377795917
035			\|a (NLM)39292596
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Su, Ke \|e verfasserin \|4 aut
245	1	3	\|a To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training
264		1	\|c 2024
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Revised 03.10.2024
500			\|a published: Print-Electronic
500			\|a Citation Status PubMed-not-MEDLINE
520			\|a Recently, there exists an increased research interest in embodied artificial intelligence (EAI), which involves an agent learning to perform a specific task when dynamically interacting with the surrounding 3D environment. There into, a new challenge is that many unseen objects may appear due to the increased number of object categories in 3D scenes. It makes developing models with strong zero-shot generalization ability to new objects necessary. Existing work tries to achieve this goal by providing embodied agents with massive high-quality human annotations closely related to the task to be learned, while it is too costly in practice. Inspired by recent advances in pre-trained models in 2D visual tasks, we attempt to boost zero-shot generalization for embodied reasoning with vision-language pre-training that can encode common sense as general prior knowledge. To further improve its performance on a specific task, we rectify the pre-trained representation through masked scene graph modeling (MSGM) in a self-supervised manner, where the task-specific knowledge is learned from iterative message passing. Our method can improve a variety of representative embodied reasoning tasks by a large margin (e.g., over 5.0% w.r.t. answer accuracy on MP3D-EQA dataset that consists of many real-world scenes with a large number of new objects during testing), and achieve the new state-of-the-art performance
650		4	\|a Journal Article
700	1		\|a Zhang, Xingxing \|e verfasserin \|4 aut
700	1		\|a Zhang, Siyang \|e verfasserin \|4 aut
700	1		\|a Zhu, Jun \|e verfasserin \|4 aut
700	1		\|a Zhang, Bo \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society \|d 1992 \|g 33(2024) vom: 18., Seite 5370-5381 \|w (DE-627)NLM09821456X \|x 1941-0042 \|7 nnas
773	1	8	\|g volume:33 \|g year:2024 \|g day:18 \|g pages:5370-5381
856	4	0	\|u http://dx.doi.org/10.1109/TIP.2024.3459800 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_NLM
912			\|a GBV_ILN_350
951			\|a AR
952			\|d 33 \|j 2024 \|b 18 \|h 5370-5381