Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering

Recently, integrating vision and language for in-depth video understanding e.g., video captioning and video question answering, has become a promising direction for artificial intelligence. However, due to the complexity of video information, it is challenging to extract a video feature that can wel...

Description complète

Détails bibliographiques
Publié dans:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 31(2022) vom: 01., Seite 202-215
Auteur principal:	Gao, Lianli (Auteur)
Autres auteurs:	Lei, Yu, Zeng, Pengpeng, Song, Jingkuan, Wang, Meng, Shen, Heng Tao
Format:	Article en ligne
Langue:	English
Publié:	2022
Accès à la collection:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Sujets:	Journal Article


LEADER	01000caa a22002652c 4500
001	NLM332483657
003	DE-627
005	20250302151533.0
007	cr uuu---uuuuu
008	231225s2022 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TIP.2021.3120867 \|2 doi
028	5	2	\|a pubmed25n1108.xml
035			\|a (DE-627)NLM332483657
035			\|a (NLM)34710043
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Gao, Lianli \|e verfasserin \|4 aut
245	1	0	\|a Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering
264		1	\|c 2022
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Revised 07.12.2021
500			\|a published: Print-Electronic
500			\|a Citation Status PubMed-not-MEDLINE
520			\|a Recently, integrating vision and language for in-depth video understanding e.g., video captioning and video question answering, has become a promising direction for artificial intelligence. However, due to the complexity of video information, it is challenging to extract a video feature that can well represent multiple levels of concepts i.e., objects, actions and events. Meanwhile, content completeness and syntactic consistency play an important role in high-quality language-related video understanding. Motivated by these, we propose a novel framework, named Hierarchical Representation Network with Auxiliary Tasks (HRNAT), for learning multi-level representations and obtaining syntax-aware video captions. Specifically, the Cross-modality Matching Task enables the learning of hierarchical representation of videos, guided by the three-level representation of languages. The Syntax-guiding Task and the Vision-assist Task contribute to generating descriptions which are not only globally similar to the video content, but also syntax-consistent to the ground-truth description. The key components of our model are general and they can be readily applied to both video captioning and video question answering tasks. Performances for the above tasks on several benchmark datasets validate the effectiveness and superiority of our proposed method compared with the state-of-the-art methods. Codes and models are also released https://github.com/riesling00/HRNAT
650		4	\|a Journal Article
700	1		\|a Lei, Yu \|e verfasserin \|4 aut
700	1		\|a Zeng, Pengpeng \|e verfasserin \|4 aut
700	1		\|a Song, Jingkuan \|e verfasserin \|4 aut
700	1		\|a Wang, Meng \|e verfasserin \|4 aut
700	1		\|a Shen, Heng Tao \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society \|d 1992 \|g 31(2022) vom: 01., Seite 202-215 \|w (DE-627)NLM09821456X \|x 1941-0042 \|7 nnas
773	1	8	\|g volume:31 \|g year:2022 \|g day:01 \|g pages:202-215
856	4	0	\|u http://dx.doi.org/10.1109/TIP.2021.3120867 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_NLM
912			\|a GBV_ILN_350
951			\|a AR
952			\|d 31 \|j 2022 \|b 01 \|h 202-215