Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

Image-text retrieval is a central problem for understanding the semantic relationship between vision and language, and serves as the basis for various visual and language tasks. Most previous works either simply learn coarse-grained representations of the overall image and text, or elaborately estab...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 32(2023) vom: 07., Seite 3622-3633
1. Verfasser:	Liu, Chong (VerfasserIn)
Weitere Verfasser:	Zhang, Yuqi, Wang, Hongsong, Chen, Weihua, Wang, Fan, Huang, Yan, Shen, Yi-Dong, Wang, Liang
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2023
Zugriff auf das übergeordnete Werk:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:	Journal Article


LEADER	01000caa a22002652c 4500
001	NLM35840536X
003	DE-627
005	20250304223133.0
007	cr uuu---uuuuu
008	231226s2023 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TIP.2023.3286710 \|2 doi
028	5	2	\|a pubmed25n1194.xml
035			\|a (DE-627)NLM35840536X
035			\|a (NLM)37339023
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Liu, Chong \|e verfasserin \|4 aut
245	1	0	\|a Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training
264		1	\|c 2023
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Completed 04.07.2023
500			\|a Date Revised 04.07.2023
500			\|a published: Print-Electronic
500			\|a Citation Status PubMed-not-MEDLINE
520			\|a Image-text retrieval is a central problem for understanding the semantic relationship between vision and language, and serves as the basis for various visual and language tasks. Most previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words. However, the close relations between coarse- and fine-grained representations for each modality are important for image-text retrieval but almost neglected. As a result, such previous works inevitably suffer from low retrieval accuracy or heavy computational cost. In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework. This framework is consistent with human cognition, as humans simultaneously pay attention to the entire sample and regional elements to understand the semantic content. To this end, a Token-Guided Dual Transformer (TGDT) architecture which consists of two homogeneous branches for image and text modalities, respectively, is proposed for image-text retrieval. The TGDT incorporates both coarse- and fine-grained retrievals into a unified framework and beneficially leverages the advantages of both retrieval approaches. A novel training objective called Consistent Multimodal Contrastive (CMC) loss is proposed accordingly to ensure the intra- and inter-modal semantic consistencies between images and texts in the common embedding space. Equipped with a two-stage inference method based on the mixed global and local cross-modal similarity, the proposed method achieves state-of-the-art retrieval performances with extremely low inference time when compared with representative recent approaches. Code is publicly available: github.com/LCFractal/TGDT
650		4	\|a Journal Article
700	1		\|a Zhang, Yuqi \|e verfasserin \|4 aut
700	1		\|a Wang, Hongsong \|e verfasserin \|4 aut
700	1		\|a Chen, Weihua \|e verfasserin \|4 aut
700	1		\|a Wang, Fan \|e verfasserin \|4 aut
700	1		\|a Huang, Yan \|e verfasserin \|4 aut
700	1		\|a Shen, Yi-Dong \|e verfasserin \|4 aut
700	1		\|a Wang, Liang \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society \|d 1992 \|g 32(2023) vom: 07., Seite 3622-3633 \|w (DE-627)NLM09821456X \|x 1941-0042 \|7 nnas
773	1	8	\|g volume:32 \|g year:2023 \|g day:07 \|g pages:3622-3633
856	4	0	\|u http://dx.doi.org/10.1109/TIP.2023.3286710 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_NLM
912			\|a GBV_ILN_350
951			\|a AR
952			\|d 32 \|j 2023 \|b 07 \|h 3622-3633