MemBridge : Video-Language Pre-Training With Memory-Augmented Inter-Modality Bridge

Video-language pre-training has attracted considerable attention recently for its promising performance on various downstream tasks. Most existing methods utilize the modality-specific or modality-joint representation architectures for the cross-modality pre-training. Different from previous methods...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 32(2023) vom: 20., Seite 4073-4087
1. Verfasser: Yang, Jiahao (VerfasserIn)
Weitere Verfasser: Li, Xiangyang, Zheng, Mao, Wang, Zihan, Zhu, Yongqing, Guo, Xiaoqian, Yuan, Yuchen, Chai, Zifeng, Jiang, Shuqiang
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2023
Zugriff auf das übergeordnete Werk:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:Journal Article
LEADER 01000naa a22002652 4500
001 NLM359376991
003 DE-627
005 20231226080736.0
007 cr uuu---uuuuu
008 231226s2023 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2023.3283916  |2 doi 
028 5 2 |a pubmed24n1197.xml 
035 |a (DE-627)NLM359376991 
035 |a (NLM)37436853 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Yang, Jiahao  |e verfasserin  |4 aut 
245 1 0 |a MemBridge  |b Video-Language Pre-Training With Memory-Augmented Inter-Modality Bridge 
264 1 |c 2023 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 19.07.2023 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a Video-language pre-training has attracted considerable attention recently for its promising performance on various downstream tasks. Most existing methods utilize the modality-specific or modality-joint representation architectures for the cross-modality pre-training. Different from previous methods, this paper presents a novel architecture named Memory-augmented Inter-Modality Bridge (MemBridge), which uses the learnable intermediate modality representations as the bridge for the interaction between videos and language. Specifically, in the transformer-based cross-modality encoder, we introduce the learnable bridge tokens as the interaction approach, which means the video and language tokens can only perceive information from bridge tokens and themselves. Moreover, a memory bank is proposed to store abundant modality interaction information for adaptively generating bridge tokens according to different cases, enhancing the capacity and robustness of the inter-modality bridge. Through pre-training, MemBridge explicitly models the representations for more sufficient inter-modality interaction. Comprehensive experiments show that our approach achieves competitive performance with previous methods on various downstream tasks including video-text retrieval, video captioning, and video question answering on multiple datasets, demonstrating the effectiveness of the proposed method. The code has been available at https://github.com/jahhaoyang/MemBridge 
650 4 |a Journal Article 
700 1 |a Li, Xiangyang  |e verfasserin  |4 aut 
700 1 |a Zheng, Mao  |e verfasserin  |4 aut 
700 1 |a Wang, Zihan  |e verfasserin  |4 aut 
700 1 |a Zhu, Yongqing  |e verfasserin  |4 aut 
700 1 |a Guo, Xiaoqian  |e verfasserin  |4 aut 
700 1 |a Yuan, Yuchen  |e verfasserin  |4 aut 
700 1 |a Chai, Zifeng  |e verfasserin  |4 aut 
700 1 |a Jiang, Shuqiang  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g 32(2023) vom: 20., Seite 4073-4087  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnns 
773 1 8 |g volume:32  |g year:2023  |g day:20  |g pages:4073-4087 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2023.3283916  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 32  |j 2023  |b 20  |h 4073-4087