Unpaired Image-text Matching via Multimodal Aligned Conceptual Knowledge

Recently, the accuracy of image-text matching has been greatly improved by multimodal pretrained models, all of which use millions or billions of paired images and texts for supervised model learning. Different from them, human brains can well match images with texts using their stored multimodal kn...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - PP(2024) vom: 23. Juli
1. Verfasser: Huang, Yan (VerfasserIn)
Weitere Verfasser: Wang, Yuming, Zeng, Yunan, Huang, Junshi, Chai, Zhenhua, Wang, Liang
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2024
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article
LEADER 01000naa a22002652 4500
001 NLM37530116X
003 DE-627
005 20240724234031.0
007 cr uuu---uuuuu
008 240724s2024 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2024.3432552  |2 doi 
028 5 2 |a pubmed24n1480.xml 
035 |a (DE-627)NLM37530116X 
035 |a (NLM)39042537 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Huang, Yan  |e verfasserin  |4 aut 
245 1 0 |a Unpaired Image-text Matching via Multimodal Aligned Conceptual Knowledge 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 23.07.2024 
500 |a published: Print-Electronic 
500 |a Citation Status Publisher 
520 |a Recently, the accuracy of image-text matching has been greatly improved by multimodal pretrained models, all of which use millions or billions of paired images and texts for supervised model learning. Different from them, human brains can well match images with texts using their stored multimodal knowledge. Inspired by that, this paper studies a new scenario as unpaired image-text matching, in which paired images and texts are assumed to be unavailable during model learning. To deal with it, we accordingly propose a simple yet effective method namely Multimodal Aligned Conceptual Knowledge (MACK). First, we collect a set of words and their related image regions from publicly available datasets, and compute prototypical region representations to obtain pretrained general knowledge. To make the obtained knowledge better suit for certain datasets, we refine it using unpaired images and texts in a self-supervised learning manner to obtain fine-tuned domain knowledge. Then, to match given images with texts based on the knowledge, we represent parsed words in the texts by prototypical region representations, and compute region-word similarity scores. At last, the scores are aggregated based on bidirectional similarity pooling into an image-text similarity score, which can be directly used for unpaired image-text matching. The proposed MACK is complementary with existing models, which can be easily extended as a re-ranking method to substantially improve their performance of zero-shot and cross-dataset image-text matching 
650 4 |a Journal Article 
700 1 |a Wang, Yuming  |e verfasserin  |4 aut 
700 1 |a Zeng, Yunan  |e verfasserin  |4 aut 
700 1 |a Huang, Junshi  |e verfasserin  |4 aut 
700 1 |a Chai, Zhenhua  |e verfasserin  |4 aut 
700 1 |a Wang, Liang  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g PP(2024) vom: 23. Juli  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnns 
773 1 8 |g volume:PP  |g year:2024  |g day:23  |g month:07 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2024.3432552  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d PP  |j 2024  |b 23  |c 07