|
|
|
|
LEADER |
01000naa a22002652 4500 |
001 |
NLM35840536X |
003 |
DE-627 |
005 |
20231226074653.0 |
007 |
cr uuu---uuuuu |
008 |
231226s2023 xx |||||o 00| ||eng c |
024 |
7 |
|
|a 10.1109/TIP.2023.3286710
|2 doi
|
028 |
5 |
2 |
|a pubmed24n1194.xml
|
035 |
|
|
|a (DE-627)NLM35840536X
|
035 |
|
|
|a (NLM)37339023
|
040 |
|
|
|a DE-627
|b ger
|c DE-627
|e rakwb
|
041 |
|
|
|a eng
|
100 |
1 |
|
|a Liu, Chong
|e verfasserin
|4 aut
|
245 |
1 |
0 |
|a Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training
|
264 |
|
1 |
|c 2023
|
336 |
|
|
|a Text
|b txt
|2 rdacontent
|
337 |
|
|
|a ƒaComputermedien
|b c
|2 rdamedia
|
338 |
|
|
|a ƒa Online-Ressource
|b cr
|2 rdacarrier
|
500 |
|
|
|a Date Completed 04.07.2023
|
500 |
|
|
|a Date Revised 04.07.2023
|
500 |
|
|
|a published: Print-Electronic
|
500 |
|
|
|a Citation Status PubMed-not-MEDLINE
|
520 |
|
|
|a Image-text retrieval is a central problem for understanding the semantic relationship between vision and language, and serves as the basis for various visual and language tasks. Most previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words. However, the close relations between coarse- and fine-grained representations for each modality are important for image-text retrieval but almost neglected. As a result, such previous works inevitably suffer from low retrieval accuracy or heavy computational cost. In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework. This framework is consistent with human cognition, as humans simultaneously pay attention to the entire sample and regional elements to understand the semantic content. To this end, a Token-Guided Dual Transformer (TGDT) architecture which consists of two homogeneous branches for image and text modalities, respectively, is proposed for image-text retrieval. The TGDT incorporates both coarse- and fine-grained retrievals into a unified framework and beneficially leverages the advantages of both retrieval approaches. A novel training objective called Consistent Multimodal Contrastive (CMC) loss is proposed accordingly to ensure the intra- and inter-modal semantic consistencies between images and texts in the common embedding space. Equipped with a two-stage inference method based on the mixed global and local cross-modal similarity, the proposed method achieves state-of-the-art retrieval performances with extremely low inference time when compared with representative recent approaches. Code is publicly available: github.com/LCFractal/TGDT
|
650 |
|
4 |
|a Journal Article
|
700 |
1 |
|
|a Zhang, Yuqi
|e verfasserin
|4 aut
|
700 |
1 |
|
|a Wang, Hongsong
|e verfasserin
|4 aut
|
700 |
1 |
|
|a Chen, Weihua
|e verfasserin
|4 aut
|
700 |
1 |
|
|a Wang, Fan
|e verfasserin
|4 aut
|
700 |
1 |
|
|a Huang, Yan
|e verfasserin
|4 aut
|
700 |
1 |
|
|a Shen, Yi-Dong
|e verfasserin
|4 aut
|
700 |
1 |
|
|a Wang, Liang
|e verfasserin
|4 aut
|
773 |
0 |
8 |
|i Enthalten in
|t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
|d 1992
|g 32(2023) vom: 20., Seite 3622-3633
|w (DE-627)NLM09821456X
|x 1941-0042
|7 nnns
|
773 |
1 |
8 |
|g volume:32
|g year:2023
|g day:20
|g pages:3622-3633
|
856 |
4 |
0 |
|u http://dx.doi.org/10.1109/TIP.2023.3286710
|3 Volltext
|
912 |
|
|
|a GBV_USEFLAG_A
|
912 |
|
|
|a SYSFLAG_A
|
912 |
|
|
|a GBV_NLM
|
912 |
|
|
|a GBV_ILN_350
|
951 |
|
|
|a AR
|
952 |
|
|
|d 32
|j 2023
|b 20
|h 3622-3633
|