Personalized Audio-Driven 3D Facial Animation via Style-Content Disentanglement

We present a learning-based approach for generating 3D facial animations with the motion style of a specific subject from arbitrary audio inputs. The subject style is learned from a video clip (1-2 minutes) either downloaded from the Internet or captured through an ordinary camera. Traditional metho...

Description complète

Détails bibliographiques
Publié dans:	IEEE transactions on visualization and computer graphics. - 1996. - 30(2024), 3 vom: 19. März, Seite 1803-1820
Auteur principal:	Chai, Yujin (Auteur)
Autres auteurs:	Shao, Tianjia, Weng, Yanlin, Zhou, Kun
Format:	Article en ligne
Langue:	English
Publié:	2024
Accès à la collection:	IEEE transactions on visualization and computer graphics
Sujets:	Journal Article Research Support, U.S. Gov't, Non-P.H.S. Research Support, Non-U.S. Gov't

Description
Résumé:	We present a learning-based approach for generating 3D facial animations with the motion style of a specific subject from arbitrary audio inputs. The subject style is learned from a video clip (1-2 minutes) either downloaded from the Internet or captured through an ordinary camera. Traditional methods often require many hours of the subject's video to learn a robust audio-driven model and are thus unsuitable for this task. Recent research efforts aim to train a model from video collections of a few subjects but ignore the discrimination between the subject style and underlying speech content within facial motions, leading to inaccurate style or articulation. To solve the problem, we propose a novel framework that disentangles subject-specific style and speech content from facial motions. The disentanglement is enabled by two novel training mechanisms. One is two-pass style swapping between two random subjects, and the other is joint training of the decomposition network and audio-to-motion network with a shared decoder. After training, the disentangled style is combined with arbitrary audio inputs to generate stylized audio-driven 3D facial animations. Compared with start-of-the-art methods, our approach achieves better results qualitatively and quantitatively, especially in difficult cases like bilabial plosive and bilabial nasal phonemes
Description:	Date Completed 31.01.2024 Date Revised 06.01.2025 published: Print-Electronic Citation Status MEDLINE
ISSN:	1941-0506
DOI:	10.1109/TVCG.2022.3230541