MemBridge : Video-Language Pre-Training With Memory-Augmented Inter-Modality Bridge

Video-language pre-training has attracted considerable attention recently for its promising performance on various downstream tasks. Most existing methods utilize the modality-specific or modality-joint representation architectures for the cross-modality pre-training. Different from previous methods...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 32(2023) vom: 20., Seite 4073-4087
1. Verfasser:	Yang, Jiahao (VerfasserIn)
Weitere Verfasser:	Li, Xiangyang, Zheng, Mao, Wang, Zihan, Zhu, Yongqing, Guo, Xiaoqian, Yuan, Yuchen, Chai, Zifeng, Jiang, Shuqiang
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2023
Zugriff auf das übergeordnete Werk:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:	Journal Article


LEADER	01000naa a22002652 4500
001	NLM359376991
003	DE-627
005	20231226080736.0
007	cr uuu---uuuuu
008	231226s2023 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TIP.2023.3283916 \|2 doi
028	5	2	\|a pubmed24n1197.xml
035			\|a (DE-627)NLM359376991
035			\|a (NLM)37436853
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Yang, Jiahao \|e verfasserin \|4 aut
245	1	0	\|a MemBridge \|b Video-Language Pre-Training With Memory-Augmented Inter-Modality Bridge
264		1	\|c 2023
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Revised 19.07.2023
500			\|a published: Print-Electronic
500			\|a Citation Status PubMed-not-MEDLINE
520			\|a Video-language pre-training has attracted considerable attention recently for its promising performance on various downstream tasks. Most existing methods utilize the modality-specific or modality-joint representation architectures for the cross-modality pre-training. Different from previous methods, this paper presents a novel architecture named Memory-augmented Inter-Modality Bridge (MemBridge), which uses the learnable intermediate modality representations as the bridge for the interaction between videos and language. Specifically, in the transformer-based cross-modality encoder, we introduce the learnable bridge tokens as the interaction approach, which means the video and language tokens can only perceive information from bridge tokens and themselves. Moreover, a memory bank is proposed to store abundant modality interaction information for adaptively generating bridge tokens according to different cases, enhancing the capacity and robustness of the inter-modality bridge. Through pre-training, MemBridge explicitly models the representations for more sufficient inter-modality interaction. Comprehensive experiments show that our approach achieves competitive performance with previous methods on various downstream tasks including video-text retrieval, video captioning, and video question answering on multiple datasets, demonstrating the effectiveness of the proposed method. The code has been available at https://github.com/jahhaoyang/MemBridge
650		4	\|a Journal Article
700	1		\|a Li, Xiangyang \|e verfasserin \|4 aut
700	1		\|a Zheng, Mao \|e verfasserin \|4 aut
700	1		\|a Wang, Zihan \|e verfasserin \|4 aut
700	1		\|a Zhu, Yongqing \|e verfasserin \|4 aut
700	1		\|a Guo, Xiaoqian \|e verfasserin \|4 aut
700	1		\|a Yuan, Yuchen \|e verfasserin \|4 aut
700	1		\|a Chai, Zifeng \|e verfasserin \|4 aut
700	1		\|a Jiang, Shuqiang \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society \|d 1992 \|g 32(2023) vom: 20., Seite 4073-4087 \|w (DE-627)NLM09821456X \|x 1941-0042 \|7 nnns
773	1	8	\|g volume:32 \|g year:2023 \|g day:20 \|g pages:4073-4087
856	4	0	\|u http://dx.doi.org/10.1109/TIP.2023.3283916 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_NLM
912			\|a GBV_ILN_350
951			\|a AR
952			\|d 32 \|j 2023 \|b 20 \|h 4073-4087