Temporal Representation Learning on Monocular Videos for 3D Human Pose Estimation

In this article we propose an unsupervised feature extraction method to capture temporal information on monocular videos, where we detect and encode subject of interest in each frame and leverage contrastive self-supervised (CSS) learning to extract rich latent vectors. Instead of simply treating th...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 45(2023), 5 vom: 17. Mai, Seite 6415-6427
1. Verfasser: Honari, Sina (VerfasserIn)
Weitere Verfasser: Constantin, Victor, Rhodin, Helge, Salzmann, Mathieu, Fua, Pascal
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2023
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article Research Support, Non-U.S. Gov't
LEADER 01000naa a22002652 4500
001 NLM347668127
003 DE-627
005 20231226034359.0
007 cr uuu---uuuuu
008 231226s2023 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2022.3215307  |2 doi 
028 5 2 |a pubmed24n1158.xml 
035 |a (DE-627)NLM347668127 
035 |a (NLM)36251908 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Honari, Sina  |e verfasserin  |4 aut 
245 1 0 |a Temporal Representation Learning on Monocular Videos for 3D Human Pose Estimation 
264 1 |c 2023 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Completed 11.04.2023 
500 |a Date Revised 05.05.2023 
500 |a published: Print-Electronic 
500 |a Citation Status MEDLINE 
520 |a In this article we propose an unsupervised feature extraction method to capture temporal information on monocular videos, where we detect and encode subject of interest in each frame and leverage contrastive self-supervised (CSS) learning to extract rich latent vectors. Instead of simply treating the latent features of nearby frames as positive pairs and those of temporally-distant ones as negative pairs as in other CSS approaches, we explicitly disentangle each latent vector into a time-variant component and a time-invariant one. We then show that applying contrastive loss only to the time-variant features and encouraging a gradual transition on them between nearby and away frames while also reconstructing the input, extract rich temporal features, well-suited for human pose estimation. Our approach reduces error by about 50% compared to the standard CSS strategies, outperforms other unsupervised single-view methods and matches the performance of multi-view techniques. When 2D pose is available, our approach can extract even richer latent features and improve the 3D pose estimation accuracy, outperforming other state-of-the-art weakly supervised methods 
650 4 |a Journal Article 
650 4 |a Research Support, Non-U.S. Gov't 
700 1 |a Constantin, Victor  |e verfasserin  |4 aut 
700 1 |a Rhodin, Helge  |e verfasserin  |4 aut 
700 1 |a Salzmann, Mathieu  |e verfasserin  |4 aut 
700 1 |a Fua, Pascal  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g 45(2023), 5 vom: 17. Mai, Seite 6415-6427  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnns 
773 1 8 |g volume:45  |g year:2023  |g number:5  |g day:17  |g month:05  |g pages:6415-6427 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2022.3215307  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 45  |j 2023  |e 5  |b 17  |c 05  |h 6415-6427