Personalized Audio-Driven 3D Facial Animation via Style-Content Disentanglement

We present a learning-based approach for generating 3D facial animations with the motion style of a specific subject from arbitrary audio inputs. The subject style is learned from a video clip (1-2 minutes) either downloaded from the Internet or captured through an ordinary camera. Traditional metho...

Description complète

Détails bibliographiques
Publié dans:IEEE transactions on visualization and computer graphics. - 1996. - 30(2024), 3 vom: 19. März, Seite 1803-1820
Auteur principal: Chai, Yujin (Auteur)
Autres auteurs: Shao, Tianjia, Weng, Yanlin, Zhou, Kun
Format: Article en ligne
Langue:English
Publié: 2024
Accès à la collection:IEEE transactions on visualization and computer graphics
Sujets:Journal Article Research Support, U.S. Gov't, Non-P.H.S. Research Support, Non-U.S. Gov't
LEADER 01000caa a22002652c 4500
001 NLM355201968
003 DE-627
005 20250304150739.0
007 cr uuu---uuuuu
008 231226s2024 xx |||||o 00| ||eng c
024 7 |a 10.1109/TVCG.2022.3230541  |2 doi 
028 5 2 |a pubmed25n1183.xml 
035 |a (DE-627)NLM355201968 
035 |a (NLM)37015450 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Chai, Yujin  |e verfasserin  |4 aut 
245 1 0 |a Personalized Audio-Driven 3D Facial Animation via Style-Content Disentanglement 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Completed 31.01.2024 
500 |a Date Revised 06.01.2025 
500 |a published: Print-Electronic 
500 |a Citation Status MEDLINE 
520 |a We present a learning-based approach for generating 3D facial animations with the motion style of a specific subject from arbitrary audio inputs. The subject style is learned from a video clip (1-2 minutes) either downloaded from the Internet or captured through an ordinary camera. Traditional methods often require many hours of the subject's video to learn a robust audio-driven model and are thus unsuitable for this task. Recent research efforts aim to train a model from video collections of a few subjects but ignore the discrimination between the subject style and underlying speech content within facial motions, leading to inaccurate style or articulation. To solve the problem, we propose a novel framework that disentangles subject-specific style and speech content from facial motions. The disentanglement is enabled by two novel training mechanisms. One is two-pass style swapping between two random subjects, and the other is joint training of the decomposition network and audio-to-motion network with a shared decoder. After training, the disentangled style is combined with arbitrary audio inputs to generate stylized audio-driven 3D facial animations. Compared with start-of-the-art methods, our approach achieves better results qualitatively and quantitatively, especially in difficult cases like bilabial plosive and bilabial nasal phonemes 
650 4 |a Journal Article 
650 4 |a Research Support, U.S. Gov't, Non-P.H.S. 
650 4 |a Research Support, Non-U.S. Gov't 
700 1 |a Shao, Tianjia  |e verfasserin  |4 aut 
700 1 |a Weng, Yanlin  |e verfasserin  |4 aut 
700 1 |a Zhou, Kun  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on visualization and computer graphics  |d 1996  |g 30(2024), 3 vom: 19. März, Seite 1803-1820  |w (DE-627)NLM098269445  |x 1941-0506  |7 nnas 
773 1 8 |g volume:30  |g year:2024  |g number:3  |g day:19  |g month:03  |g pages:1803-1820 
856 4 0 |u http://dx.doi.org/10.1109/TVCG.2022.3230541  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 30  |j 2024  |e 3  |b 19  |c 03  |h 1803-1820