Progressive Language-Customized Visual Feature Learning for One-Stage Visual Grounding

Visual grounding is a task to localize an object described by a sentence in an image. Conventional visual grounding methods extract visual and linguistic features isolatedly and then perform cross-modal interaction in a post-fusion manner. We argue that this post-fusion mechanism does not fully util...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 31(2022) vom: 16., Seite 4266-4277
1. Verfasser:	Liao, Yue (VerfasserIn)
Weitere Verfasser:	Zhang, Aixi, Chen, Zhiyuan, Hui, Tianrui, Liu, Si
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2022
Zugriff auf das übergeordnete Werk:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:	Journal Article


LEADER	01000naa a22002652 4500
001	NLM342304720
003	DE-627
005	20231226013808.0
007	cr uuu---uuuuu
008	231226s2022 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TIP.2022.3181516 \|2 doi
028	5	2	\|a pubmed24n1140.xml
035			\|a (DE-627)NLM342304720
035			\|a (NLM)35709109
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Liao, Yue \|e verfasserin \|4 aut
245	1	0	\|a Progressive Language-Customized Visual Feature Learning for One-Stage Visual Grounding
264		1	\|c 2022
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Completed 01.07.2022
500			\|a Date Revised 01.07.2022
500			\|a published: Print-Electronic
500			\|a Citation Status MEDLINE
520			\|a Visual grounding is a task to localize an object described by a sentence in an image. Conventional visual grounding methods extract visual and linguistic features isolatedly and then perform cross-modal interaction in a post-fusion manner. We argue that this post-fusion mechanism does not fully utilize the information in two modalities. Instead, it is more desired to perform cross-modal interaction during the extraction process of the visual and linguistic feature. In this paper, we propose a language-customized visual feature learning mechanism where linguistic information guides the extraction of visual feature from the very beginning. We instantiate the mechanism as a one-stage framework named Progressive Language-customized Visual feature learning (PLV). Our proposed PLV consists of a Progressive Language-customized Visual Encoder (PLVE) and a grounding module. We customize the visual feature with linguistic guidance at each stage of the PLVE by Channel-wise Language-guided Interaction Modules (CLIM). Our proposed PLV outperforms conventional state-of-the-art methods with large margins across five visual grounding datasets without pre-training on object detection datasets, while achieving real-time speed. The source code is available in the supplementary material
650		4	\|a Journal Article
700	1		\|a Zhang, Aixi \|e verfasserin \|4 aut
700	1		\|a Chen, Zhiyuan \|e verfasserin \|4 aut
700	1		\|a Hui, Tianrui \|e verfasserin \|4 aut
700	1		\|a Liu, Si \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society \|d 1992 \|g 31(2022) vom: 16., Seite 4266-4277 \|w (DE-627)NLM09821456X \|x 1941-0042 \|7 nnns
773	1	8	\|g volume:31 \|g year:2022 \|g day:16 \|g pages:4266-4277
856	4	0	\|u http://dx.doi.org/10.1109/TIP.2022.3181516 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_NLM
912			\|a GBV_ILN_350
951			\|a AR
952			\|d 31 \|j 2022 \|b 16 \|h 4266-4277