Audio2Gestures : Generating Diverse Gestures From Audio

People may perform diverse gestures affected by various mental and physical factors when speaking the same sentences. This inherent one-to-many relationship makes co-speech gesture generation from audio particularly challenging. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to pred...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on visualization and computer graphics. - 1996. - 30(2024), 8 vom: 01. Juli, Seite 4752-4766
1. Verfasser: Li, Jing (VerfasserIn)
Weitere Verfasser: Kang, Di, Pei, Wenjie, Zhe, Xuefei, Zhang, Ying, Bao, Linchao, He, Zhenyu
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2024
Zugriff auf das übergeordnete Werk:IEEE transactions on visualization and computer graphics
Schlagworte:Journal Article
LEADER 01000caa a22002652 4500
001 NLM356984133
003 DE-627
005 20240703232234.0
007 cr uuu---uuuuu
008 231226s2024 xx |||||o 00| ||eng c
024 7 |a 10.1109/TVCG.2023.3276973  |2 doi 
028 5 2 |a pubmed24n1458.xml 
035 |a (DE-627)NLM356984133 
035 |a (NLM)37195841 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Li, Jing  |e verfasserin  |4 aut 
245 1 0 |a Audio2Gestures  |b Generating Diverse Gestures From Audio 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 02.07.2024 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a People may perform diverse gestures affected by various mental and physical factors when speaking the same sentences. This inherent one-to-many relationship makes co-speech gesture generation from audio particularly challenging. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, easily resulting in plain/boring motions during inference. So we propose to explicitly model the one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code is expected to be responsible for the motion component that is more correlated to the audio while the motion-specific code is expected to capture diverse motion information that is more independent of the audio. However, splitting the latent code into two parts poses extra training difficulties. Several crucial training losses/strategies, including relaxed motion loss, bicycle constraint, and diversity loss, are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than previous state-of-the-art methods, quantitatively and qualitatively. Besides, our formulation is compatible with discrete cosine transformation (DCT) modeling and other popular backbones (i.e., RNN, Transformer). As for motion losses and quantitative motion evaluation, we find structured losses/metrics (e.g. STFT) that consider temporal and/or spatial context complement the most commonly used point-wise losses (e.g. PCK), resulting in better motion dynamics and more nuanced motion details. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline 
650 4 |a Journal Article 
700 1 |a Kang, Di  |e verfasserin  |4 aut 
700 1 |a Pei, Wenjie  |e verfasserin  |4 aut 
700 1 |a Zhe, Xuefei  |e verfasserin  |4 aut 
700 1 |a Zhang, Ying  |e verfasserin  |4 aut 
700 1 |a Bao, Linchao  |e verfasserin  |4 aut 
700 1 |a He, Zhenyu  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on visualization and computer graphics  |d 1996  |g 30(2024), 8 vom: 01. Juli, Seite 4752-4766  |w (DE-627)NLM098269445  |x 1941-0506  |7 nnns 
773 1 8 |g volume:30  |g year:2024  |g number:8  |g day:01  |g month:07  |g pages:4752-4766 
856 4 0 |u http://dx.doi.org/10.1109/TVCG.2023.3276973  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 30  |j 2024  |e 8  |b 01  |c 07  |h 4752-4766