Learnable Feature Augmentation Framework for Temporal Action Localization

Temporal action localization (TAL) has drawn much attention in recent years, however, the performance of previous methods is still far from satisfactory due to the lack of annotated untrimmed video data. To deal with this issue, we propose to improve the utilization of current data through feature a...

Description complète

Détails bibliographiques
Publié dans:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 33(2024) vom: 02., Seite 4002-4015
Auteur principal: Tang, Yepeng (Auteur)
Autres auteurs: Wang, Weining, Zhang, Chunjie, Liu, Jing, Zhao, Yao
Format: Article en ligne
Langue:English
Publié: 2024
Accès à la collection:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Sujets:Journal Article
LEADER 01000caa a22002652c 4500
001 NLM373770472
003 DE-627
005 20250306074743.0
007 cr uuu---uuuuu
008 240619s2024 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2024.3413599  |2 doi 
028 5 2 |a pubmed25n1245.xml 
035 |a (DE-627)NLM373770472 
035 |a (NLM)38889016 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Tang, Yepeng  |e verfasserin  |4 aut 
245 1 0 |a Learnable Feature Augmentation Framework for Temporal Action Localization 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 01.07.2024 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a Temporal action localization (TAL) has drawn much attention in recent years, however, the performance of previous methods is still far from satisfactory due to the lack of annotated untrimmed video data. To deal with this issue, we propose to improve the utilization of current data through feature augmentation. Given an input video, we first extract video features with pre-trained video encoders, and then randomly mask various semantic contents of video features to consider different views of video features. To avoid damaging important action-related semantic information, we further develop a learnable feature augmentation framework to generate better views of videos. In particular, a Mask-based Feature Augmentation Module (MFAM) is proposed. The MFAM has three advantages: 1) it captures the temporal and semantic relationships of original video features, 2) it generates masked features with indispensable action-related information, and 3) it randomly recycles some masked information to ensure diversity. Finally, we input the masked features and the original features into shared action detectors respectively, and perform action classification and localization jointly for model learning. The proposed framework can improve the robustness and generalization of action detectors by learning more and better views of videos. In the testing stage, the MFAM can be removed, which does not bring extra computational costs. Extensive experiments are conducted on four TAL benchmark datasets. Our proposed framework significantly improves different TAL models and achieves the state-of-the-art performances 
650 4 |a Journal Article 
700 1 |a Wang, Weining  |e verfasserin  |4 aut 
700 1 |a Zhang, Chunjie  |e verfasserin  |4 aut 
700 1 |a Liu, Jing  |e verfasserin  |4 aut 
700 1 |a Zhao, Yao  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g 33(2024) vom: 02., Seite 4002-4015  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnas 
773 1 8 |g volume:33  |g year:2024  |g day:02  |g pages:4002-4015 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2024.3413599  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 33  |j 2024  |b 02  |h 4002-4015