Contrast-Reconstruction Representation Learning for Self-Supervised Skeleton-Based Action Recognition

Skeleton-based action recognition is widely used in varied areas, e.g., surveillance and human-machine interaction. Existing models are mainly learned in a supervised manner, thus heavily depending on large-scale labeled data, which could be infeasible when labels are prohibitively expensive. In thi...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 31(2022) vom: 23., Seite 6224-6238
1. Verfasser: Wang, Peng (VerfasserIn)
Weitere Verfasser: Wen, Jun, Si, Chenyang, Qian, Yuntao, Wang, Liang
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2022
Zugriff auf das übergeordnete Werk:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:Journal Article
Beschreibung
Zusammenfassung:Skeleton-based action recognition is widely used in varied areas, e.g., surveillance and human-machine interaction. Existing models are mainly learned in a supervised manner, thus heavily depending on large-scale labeled data, which could be infeasible when labels are prohibitively expensive. In this paper, we propose a novel Contrast-Reconstruction Representation Learning network (CRRL) that simultaneously captures postures and motion dynamics for unsupervised skeleton-based action recognition. It consists of three parts: Sequence Reconstructor (SER), Contrastive Motion Learner (CML), and Information Fuser (INF). SER learns representation from skeleton coordinate sequence via reconstruction. However the learned representation tends to focus on trivial postural coordinates and be hesitant in motion learning. To enhance the learning of motions, CML performs contrastive learning between the representation learned from coordinate sequences and additional velocity sequences, respectively. Finally, in the INF module, we explore varied strategies to combine SER and CML, and propose to couple postures and motions via a knowledge-distillation based fusion strategy which transfers the motion learning from CML to SER. Experimental results on several benchmarks, i.e., NTU RGB+D 60/120, PKU-MMD, CMU, and NW-UCLA, demonstrate the promise of the our method by outperforming state-of-the-art approaches
Beschreibung:Date Completed 30.09.2022
Date Revised 30.09.2022
published: Print-Electronic
Citation Status MEDLINE
ISSN:1941-0042
DOI:10.1109/TIP.2022.3207577