End-to-End Temporal Action Detection With Transformer

Temporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding. Previous methods tackle this task with complicated pipelines. They often need to train multi...

Description complète

Détails bibliographiques
Publié dans:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 31(2022) vom: 10., Seite 5427-5441
Auteur principal: Liu, Xiaolong (Auteur)
Autres auteurs: Wang, Qimeng, Hu, Yao, Tang, Xu, Zhang, Shiwei, Bai, Song, Bai, Xiang
Format: Article en ligne
Langue:English
Publié: 2022
Accès à la collection:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Sujets:Journal Article
LEADER 01000caa a22002652c 4500
001 NLM344669343
003 DE-627
005 20250303164318.0
007 cr uuu---uuuuu
008 231226s2022 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2022.3195321  |2 doi 
028 5 2 |a pubmed25n1148.xml 
035 |a (DE-627)NLM344669343 
035 |a (NLM)35947570 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Liu, Xiaolong  |e verfasserin  |4 aut 
245 1 0 |a End-to-End Temporal Action Detection With Transformer 
264 1 |c 2022 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 18.08.2022 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a Temporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding. Previous methods tackle this task with complicated pipelines. They often need to train multiple networks and involve hand-designed operations, such as non-maximal suppression and anchor generation, which limit the flexibility and prevent end-to-end learning. In this paper, we propose an end-to-end Transformer-based method for TAD, termed TadTR. Given a small set of learnable embeddings called action queries, TadTR adaptively extracts temporal context information from the video for each query and directly predicts action instances with the context. To adapt Transformer to TAD, we propose three improvements to enhance its locality awareness. The core is a temporal deformable attention module that selectively attends to a sparse set of key snippets in a video. A segment refinement mechanism and an actionness regression head are designed to refine the boundaries and confidence of the predicted instances, respectively. With such a simple pipeline, TadTR requires lower computation cost than previous detectors, while preserving remarkable performance. As a self-contained detector, it achieves state-of-the-art performance on THUMOS14 (56.7% mAP) and HACS Segments (32.09% mAP). Combined with an extra action classifier, it obtains 36.75% mAP on ActivityNet-1.3. Code is available at https://github.com/xlliu7/TadTR 
650 4 |a Journal Article 
700 1 |a Wang, Qimeng  |e verfasserin  |4 aut 
700 1 |a Hu, Yao  |e verfasserin  |4 aut 
700 1 |a Tang, Xu  |e verfasserin  |4 aut 
700 1 |a Zhang, Shiwei  |e verfasserin  |4 aut 
700 1 |a Bai, Song  |e verfasserin  |4 aut 
700 1 |a Bai, Xiang  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g 31(2022) vom: 10., Seite 5427-5441  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnas 
773 1 8 |g volume:31  |g year:2022  |g day:10  |g pages:5427-5441 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2022.3195321  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 31  |j 2022  |b 10  |h 5427-5441