Graph-Based Multi-Interaction Network for Video Question Answering

Video question answering is an important task combining both Natural Language Processing and Computer Vision, which requires a machine to obtain a thorough understanding of the video. Most existing approaches simply capture spatio-temporal information in videos by using a combination of recurrent an...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 30(2021) vom: 21., Seite 2758-2770
1. Verfasser:	Gu, Mao (VerfasserIn)
Weitere Verfasser:	Zhao, Zhou, Jin, Weike, Hong, Richang, Wu, Fei
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2021
Zugriff auf das übergeordnete Werk:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:	Journal Article


LEADER	01000naa a22002652 4500
001	NLM320385752
003	DE-627
005	20231225173447.0
007	cr uuu---uuuuu
008	231225s2021 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TIP.2021.3051756 \|2 doi
028	5	2	\|a pubmed24n1067.xml
035			\|a (DE-627)NLM320385752
035			\|a (NLM)33476268
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Gu, Mao \|e verfasserin \|4 aut
245	1	0	\|a Graph-Based Multi-Interaction Network for Video Question Answering
264		1	\|c 2021
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Revised 15.02.2021
500			\|a published: Print-Electronic
500			\|a Citation Status PubMed-not-MEDLINE
520			\|a Video question answering is an important task combining both Natural Language Processing and Computer Vision, which requires a machine to obtain a thorough understanding of the video. Most existing approaches simply capture spatio-temporal information in videos by using a combination of recurrent and convolutional neural networks. Nonetheless, most previous work focus on only salient frames or regions, which normally lacks some significant details, such as potential location and action relations. In this paper, we propose a new method called Graph-based Multi-interaction Network for video question answering. In our model, a new attention mechanism named multi-interaction is designed to capture both element-wise and segment-wise sequence interactions simultaneously, which can be found between and inside the multi-modal inputs. Moreover, we propose a graph-based relation-aware neural network to explore a more fine-grained visual representation, which could explore the relationships and dependencies between objects spatially and temporally. We evaluate our method on TGIF-QA and other two video QA datasets. The qualitative and quantitative experimental results show the effectiveness of our model, which achieves state-of-the-art performance
650		4	\|a Journal Article
700	1		\|a Zhao, Zhou \|e verfasserin \|4 aut
700	1		\|a Jin, Weike \|e verfasserin \|4 aut
700	1		\|a Hong, Richang \|e verfasserin \|4 aut
700	1		\|a Wu, Fei \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society \|d 1992 \|g 30(2021) vom: 21., Seite 2758-2770 \|w (DE-627)NLM09821456X \|x 1941-0042 \|7 nnns
773	1	8	\|g volume:30 \|g year:2021 \|g day:21 \|g pages:2758-2770
856	4	0	\|u http://dx.doi.org/10.1109/TIP.2021.3051756 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_NLM
912			\|a GBV_ILN_350
951			\|a AR
952			\|d 30 \|j 2021 \|b 21 \|h 2758-2770