Adaptive Multi-View and Temporal Fusing Transformer for 3D Human Pose Estimation

This article proposes a unified framework dubbed Multi-view and Temporal Fusing Transformer (MTF-Transformer) to adaptively handle varying view numbers and video length without camera calibration in 3D Human Pose Estimation (HPE). It consists of Feature Extractor, Multi-view Fusing Transformer (MFT)...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 45(2023), 4 vom: 05. Apr., Seite 4122-4135
1. Verfasser: Shuai, Hui (VerfasserIn)
Weitere Verfasser: Wu, Lele, Liu, Qingshan
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2023
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article
LEADER 01000naa a22002652 4500
001 NLM343091372
003 DE-627
005 20231226015627.0
007 cr uuu---uuuuu
008 231226s2023 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2022.3188716  |2 doi 
028 5 2 |a pubmed24n1143.xml 
035 |a (DE-627)NLM343091372 
035 |a (NLM)35788463 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Shuai, Hui  |e verfasserin  |4 aut 
245 1 0 |a Adaptive Multi-View and Temporal Fusing Transformer for 3D Human Pose Estimation 
264 1 |c 2023 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Completed 10.04.2023 
500 |a Date Revised 11.04.2023 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a This article proposes a unified framework dubbed Multi-view and Temporal Fusing Transformer (MTF-Transformer) to adaptively handle varying view numbers and video length without camera calibration in 3D Human Pose Estimation (HPE). It consists of Feature Extractor, Multi-view Fusing Transformer (MFT), and Temporal Fusing Transformer (TFT). Feature Extractor estimates 2D pose from each image and fuses the prediction according to the confidence. It provides pose-focused feature embedding and makes subsequent modules computationally lightweight. MFT fuses the features of a varying number of views with a novel Relative-Attention block. It adaptively measures the implicit relative relationship between each pair of views and reconstructs more informative features. TFT aggregates the features of the whole sequence and predicts 3D pose via a transformer. It adaptively deals with the video of arbitrary length and fully unitizes the temporal information. The migration of transformers enables our model to learn spatial geometry better and preserve robustness for varying application scenarios. We report quantitative and qualitative results on the Human3.6M, TotalCapture, and KTH Multiview Football II. Compared with state-of-the-art methods with camera parameters, MTF-Transformer obtains competitive results and generalizes well to dynamic capture with an arbitrary number of unseen views 
650 4 |a Journal Article 
700 1 |a Wu, Lele  |e verfasserin  |4 aut 
700 1 |a Liu, Qingshan  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g 45(2023), 4 vom: 05. Apr., Seite 4122-4135  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnns 
773 1 8 |g volume:45  |g year:2023  |g number:4  |g day:05  |g month:04  |g pages:4122-4135 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2022.3188716  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 45  |j 2023  |e 4  |b 05  |c 04  |h 4122-4135