ContextLoc++ : A Unified Context Model for Temporal Action Localization

Effectively tackling the problem of temporal action localization (TAL) necessitates a visual representation that jointly pursues two confounding goals, i.e., fine-grained discrimination for temporal localization and sufficient visual invariance for action classification. We address this challenge by...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on pattern analysis and machine intelligence. - 1979. - 45(2023), 8 vom: 17. Aug., Seite 9504-9519
1. Verfasser:	Zhu, Zixin (VerfasserIn)
Weitere Verfasser:	Wang, Le, Tang, Wei, Zheng, Nanning, Hua, Gang
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2023
Zugriff auf das übergeordnete Werk:	IEEE transactions on pattern analysis and machine intelligence
Schlagworte:	Journal Article

Beschreibung
Zusammenfassung:	Effectively tackling the problem of temporal action localization (TAL) necessitates a visual representation that jointly pursues two confounding goals, i.e., fine-grained discrimination for temporal localization and sufficient visual invariance for action classification. We address this challenge by enriching the local, global and multi-scale contexts in the popular two-stage temporal localization framework. Our proposed model, dubbed ContextLoc++, can be divided into three sub-networks: L-Net, G-Net, and M-Net. L-Net enriches the local context via fine-grained modeling of snippet-level features, which is formulated as a query-and-retrieval process. Furthermore, the spatial and temporal snippet-level features, functioning as keys and values, are fused by temporal gating. G-Net enriches the global context via higher-level modeling of the video-level representation. In addition, we introduce a novel context adaptation module to adapt the global context to different proposals. M-Net further fuses the local and global contexts with multi-scale proposal features. Specially, proposal-level features from multi-scale video snippets can focus on different action characteristics. Short-term snippets with fewer frames pay attention to the action details while long-term snippets with more frames focus on the action variations. Experiments on the THUMOS14 and ActivityNet v1.3 datasets validate the efficacy of our method against existing state-of-the-art TAL algorithms
Beschreibung:	Date Completed 07.07.2023 Date Revised 07.07.2023 published: Print-Electronic Citation Status PubMed-not-MEDLINE
ISSN:	1939-3539
DOI:	10.1109/TPAMI.2023.3237597