Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics

This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spa...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on pattern analysis and machine intelligence. - 1979. - 44(2022), 7 vom: 02. Juli, Seite 3791-3806
1. Verfasser:	Wang, Jiangliu (VerfasserIn)
Weitere Verfasser:	Jiao, Jianbo, Bao, Linchao, He, Shengfeng, Liu, Wei, Liu, Yun-Hui
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2022
Zugriff auf das übergeordnete Werk:	IEEE transactions on pattern analysis and machine intelligence
Schlagworte:	Journal Article Research Support, Non-U.S. Gov't


LEADER	01000naa a22002652 4500
001	NLM321268741
003	DE-627
005	20231225175438.0
007	cr uuu---uuuuu
008	231225s2022 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TPAMI.2021.3057833 \|2 doi
028	5	2	\|a pubmed24n1070.xml
035			\|a (DE-627)NLM321268741
035			\|a (NLM)33566757
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Wang, Jiangliu \|e verfasserin \|4 aut
245	1	0	\|a Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics
264		1	\|c 2022
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Completed 08.06.2022
500			\|a Date Revised 09.07.2022
500			\|a published: Print-Electronic
500			\|a Citation Status MEDLINE
520			\|a This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts
650		4	\|a Journal Article
650		4	\|a Research Support, Non-U.S. Gov't
700	1		\|a Jiao, Jianbo \|e verfasserin \|4 aut
700	1		\|a Bao, Linchao \|e verfasserin \|4 aut
700	1		\|a He, Shengfeng \|e verfasserin \|4 aut
700	1		\|a Liu, Wei \|e verfasserin \|4 aut
700	1		\|a Liu, Yun-Hui \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on pattern analysis and machine intelligence \|d 1979 \|g 44(2022), 7 vom: 02. Juli, Seite 3791-3806 \|w (DE-627)NLM098212257 \|x 1939-3539 \|7 nnns
773	1	8	\|g volume:44 \|g year:2022 \|g number:7 \|g day:02 \|g month:07 \|g pages:3791-3806
856	4	0	\|u http://dx.doi.org/10.1109/TPAMI.2021.3057833 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_NLM
912			\|a GBV_ILN_350
951			\|a AR
952			\|d 44 \|j 2022 \|e 7 \|b 02 \|c 07 \|h 3791-3806