Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics

This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spa...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 44(2022), 7 vom: 02. Juli, Seite 3791-3806
1. Verfasser: Wang, Jiangliu (VerfasserIn)
Weitere Verfasser: Jiao, Jianbo, Bao, Linchao, He, Shengfeng, Liu, Wei, Liu, Yun-Hui
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2022
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article Research Support, Non-U.S. Gov't
LEADER 01000naa a22002652 4500
001 NLM321268741
003 DE-627
005 20231225175438.0
007 cr uuu---uuuuu
008 231225s2022 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2021.3057833  |2 doi 
028 5 2 |a pubmed24n1070.xml 
035 |a (DE-627)NLM321268741 
035 |a (NLM)33566757 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Wang, Jiangliu  |e verfasserin  |4 aut 
245 1 0 |a Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics 
264 1 |c 2022 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Completed 08.06.2022 
500 |a Date Revised 09.07.2022 
500 |a published: Print-Electronic 
500 |a Citation Status MEDLINE 
520 |a This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts 
650 4 |a Journal Article 
650 4 |a Research Support, Non-U.S. Gov't 
700 1 |a Jiao, Jianbo  |e verfasserin  |4 aut 
700 1 |a Bao, Linchao  |e verfasserin  |4 aut 
700 1 |a He, Shengfeng  |e verfasserin  |4 aut 
700 1 |a Liu, Wei  |e verfasserin  |4 aut 
700 1 |a Liu, Yun-Hui  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g 44(2022), 7 vom: 02. Juli, Seite 3791-3806  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnns 
773 1 8 |g volume:44  |g year:2022  |g number:7  |g day:02  |g month:07  |g pages:3791-3806 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2021.3057833  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 44  |j 2022  |e 7  |b 02  |c 07  |h 3791-3806