Multi-Modality Self-Distillation for Weakly Supervised Temporal Action Localization

As a challenging task of high-level video understanding, Weakly-supervised Temporal Action Localization (WTAL) has attracted increasing attention in recent years. However, due to the weak supervisions of whole-video classification labels, it is challenging to accurately determine action instance bou...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 31(2022) vom: 20., Seite 1504-1519
1. Verfasser:	Huang, Linjiang (VerfasserIn)
Weitere Verfasser:	Wang, Liang, Li, Hongsheng
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2022
Zugriff auf das übergeordnete Werk:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:	Journal Article


LEADER	01000naa a22002652 4500
001	NLM335854974
003	DE-627
005	20231225230816.0
007	cr uuu---uuuuu
008	231225s2022 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TIP.2021.3137649 \|2 doi
028	5	2	\|a pubmed24n1119.xml
035			\|a (DE-627)NLM335854974
035			\|a (NLM)35050854
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Huang, Linjiang \|e verfasserin \|4 aut
245	1	0	\|a Multi-Modality Self-Distillation for Weakly Supervised Temporal Action Localization
264		1	\|c 2022
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Revised 31.01.2022
500			\|a published: Print-Electronic
500			\|a Citation Status PubMed-not-MEDLINE
520			\|a As a challenging task of high-level video understanding, Weakly-supervised Temporal Action Localization (WTAL) has attracted increasing attention in recent years. However, due to the weak supervisions of whole-video classification labels, it is challenging to accurately determine action instance boundaries. To address this issue, pseudo-label-based methods [Alwassel et al. (2019), Luo et al. (2020), and Zhai et al. (2020)] were proposed to generate snippet-level pseudo labels from classification results. In spite of the promising performance, these methods hardly take full advantages of multiple modalities, i.e., RGB and optical flow sequences, to generate high quality pseudo labels. Most of them ignored how to mitigate the label noise, which hinders the capability of the network on learning discriminative feature representations. To address these challenges, we propose a Multi-Modality Self-Distillation (MMSD) framework, which contains two single-modal streams and a fused-modal stream to perform multi-modality knowledge distillation and multi-modality self-voting. On the one hand, multi-modality knowledge distillation improves snippet-level classification performance by transferring knowledge between single-modal streams and a fused-modal stream. On the other hand, multi-modality self-voting mitigates the label noise in a modality voting manner according to the reliability and complementarity of the streams. Experimental results on THUMOS14 and ActivityNet1.3 datasets demonstrate the effectiveness of our method and superior performance over state-of-the-art approaches. Our code is available at https://github.com/LeonHLJ/MMSD
650		4	\|a Journal Article
700	1		\|a Wang, Liang \|e verfasserin \|4 aut
700	1		\|a Li, Hongsheng \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society \|d 1992 \|g 31(2022) vom: 20., Seite 1504-1519 \|w (DE-627)NLM09821456X \|x 1941-0042 \|7 nnns
773	1	8	\|g volume:31 \|g year:2022 \|g day:20 \|g pages:1504-1519
856	4	0	\|u http://dx.doi.org/10.1109/TIP.2021.3137649 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_NLM
912			\|a GBV_ILN_350
951			\|a AR
952			\|d 31 \|j 2022 \|b 20 \|h 1504-1519