Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering

Recently, integrating vision and language for in-depth video understanding e.g., video captioning and video question answering, has become a promising direction for artificial intelligence. However, due to the complexity of video information, it is challenging to extract a video feature that can wel...

Description complète

Détails bibliographiques
Publié dans:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 31(2022) vom: 01., Seite 202-215
Auteur principal: Gao, Lianli (Auteur)
Autres auteurs: Lei, Yu, Zeng, Pengpeng, Song, Jingkuan, Wang, Meng, Shen, Heng Tao
Format: Article en ligne
Langue:English
Publié: 2022
Accès à la collection:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Sujets:Journal Article
LEADER 01000caa a22002652c 4500
001 NLM332483657
003 DE-627
005 20250302151533.0
007 cr uuu---uuuuu
008 231225s2022 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2021.3120867  |2 doi 
028 5 2 |a pubmed25n1108.xml 
035 |a (DE-627)NLM332483657 
035 |a (NLM)34710043 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Gao, Lianli  |e verfasserin  |4 aut 
245 1 0 |a Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering 
264 1 |c 2022 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 07.12.2021 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a Recently, integrating vision and language for in-depth video understanding e.g., video captioning and video question answering, has become a promising direction for artificial intelligence. However, due to the complexity of video information, it is challenging to extract a video feature that can well represent multiple levels of concepts i.e., objects, actions and events. Meanwhile, content completeness and syntactic consistency play an important role in high-quality language-related video understanding. Motivated by these, we propose a novel framework, named Hierarchical Representation Network with Auxiliary Tasks (HRNAT), for learning multi-level representations and obtaining syntax-aware video captions. Specifically, the Cross-modality Matching Task enables the learning of hierarchical representation of videos, guided by the three-level representation of languages. The Syntax-guiding Task and the Vision-assist Task contribute to generating descriptions which are not only globally similar to the video content, but also syntax-consistent to the ground-truth description. The key components of our model are general and they can be readily applied to both video captioning and video question answering tasks. Performances for the above tasks on several benchmark datasets validate the effectiveness and superiority of our proposed method compared with the state-of-the-art methods. Codes and models are also released https://github.com/riesling00/HRNAT 
650 4 |a Journal Article 
700 1 |a Lei, Yu  |e verfasserin  |4 aut 
700 1 |a Zeng, Pengpeng  |e verfasserin  |4 aut 
700 1 |a Song, Jingkuan  |e verfasserin  |4 aut 
700 1 |a Wang, Meng  |e verfasserin  |4 aut 
700 1 |a Shen, Heng Tao  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g 31(2022) vom: 01., Seite 202-215  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnas 
773 1 8 |g volume:31  |g year:2022  |g day:01  |g pages:202-215 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2021.3120867  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 31  |j 2022  |b 01  |h 202-215