Image-Text Embedding Learning via Visual and Textual Semantic Reasoning

As a bridge between language and vision domains, cross-modal retrieval between images and texts is a hot research topic in recent years. It remains challenging because the current image representations usually lack semantic concepts in the corresponding sentence captions. To address this issue, we i...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 45(2023), 1 vom: 15. Jan., Seite 641-656
1. Verfasser: Li, Kunpeng (VerfasserIn)
Weitere Verfasser: Zhang, Yulun, Li, Kai, Li, Yuanyuan, Fu, Yun
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2023
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article
LEADER 01000caa a22002652c 4500
001 NLM336626045
003 DE-627
005 20250303001412.0
007 cr uuu---uuuuu
008 231225s2023 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2022.3148470  |2 doi 
028 5 2 |a pubmed25n1121.xml 
035 |a (DE-627)NLM336626045 
035 |a (NLM)35130144 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Li, Kunpeng  |e verfasserin  |4 aut 
245 1 0 |a Image-Text Embedding Learning via Visual and Textual Semantic Reasoning 
264 1 |c 2023 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Completed 05.04.2023 
500 |a Date Revised 05.04.2023 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a As a bridge between language and vision domains, cross-modal retrieval between images and texts is a hot research topic in recent years. It remains challenging because the current image representations usually lack semantic concepts in the corresponding sentence captions. To address this issue, we introduce an intuitive and interpretable model to learn a common embedding space for alignments between images and text descriptions. Specifically, our model first incorporates the semantic relationship information into visual and textual features by performing region or word relationship reasoning. Then it utilizes the gate and memory mechanism to perform global semantic reasoning on these relationship-enhanced features, select the discriminative information and gradually grow representations for the whole scene. Through the alignment learning, the learned visual representations capture key objects and semantic concepts of a scene as in the corresponding text caption. Experiments on MS-COCO [1] and Flickr30K [2] datasets validate that our method surpasses many recent state-of-the-arts with a clear margin. In addition to the effectiveness, our methods are also very efficient at the inference stage. Thanks to the effective overall representation learning with visual semantic reasoning, our methods can already achieve very strong performance by only relying on the simple inner-product to obtain similarity scores between images and captions. Experiments validate the proposed methods are more than 30-75 times faster than many recent methods with code public available. Instead of following the recent trend of using complex local matching strategies [3], [4], [5], [6] to pursue good performance while sacrificing efficiency, we show that the simple global matching strategy can still be very effective, efficient and achieve even better performance based on our framework 
650 4 |a Journal Article 
700 1 |a Zhang, Yulun  |e verfasserin  |4 aut 
700 1 |a Li, Kai  |e verfasserin  |4 aut 
700 1 |a Li, Yuanyuan  |e verfasserin  |4 aut 
700 1 |a Fu, Yun  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g 45(2023), 1 vom: 15. Jan., Seite 641-656  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnas 
773 1 8 |g volume:45  |g year:2023  |g number:1  |g day:15  |g month:01  |g pages:641-656 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2022.3148470  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 45  |j 2023  |e 1  |b 15  |c 01  |h 641-656