Dubbing Movies via Hierarchical Phoneme Modeling and Acoustic Diffusion Denoising

Given a piece of text, a video clip, and reference audio, the movie dubbing (also known as Visual Voice Cloning, V2C) task aims to generate speeches that clone reference voice and align well with the video in both emotion and lip movement, which is more challenging than conventional text-to-speech s...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on pattern analysis and machine intelligence. - 1979. - 47(2025), 11 vom: 02. Okt., Seite 10361-10377
1. Verfasser:	Li, Liang (VerfasserIn)
Weitere Verfasser:	Cong, Gaoxiang, Qi, Yuankai, Zha, Zheng-Jun, Wu, Qi, Sheng, Quan Z, Huang, Qingming, Yang, Ming-Hsuan
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2025
Zugriff auf das übergeordnete Werk:	IEEE transactions on pattern analysis and machine intelligence
Schlagworte:	Journal Article


LEADER	01000caa a22002652c 4500
001	NLM390868779
003	DE-627
005	20251004231918.0
007	cr uuu---uuuuu
008	250809s2025 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TPAMI.2025.3597267 \|2 doi
028	5	2	\|a pubmed25n1589.xml
035			\|a (DE-627)NLM390868779
035			\|a (NLM)40779382
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Li, Liang \|e verfasserin \|4 aut
245	1	0	\|a Dubbing Movies via Hierarchical Phoneme Modeling and Acoustic Diffusion Denoising
264		1	\|c 2025
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Completed 03.10.2025
500			\|a Date Revised 03.10.2025
500			\|a published: Print
500			\|a Citation Status MEDLINE
520			\|a Given a piece of text, a video clip, and reference audio, the movie dubbing (also known as Visual Voice Cloning, V2C) task aims to generate speeches that clone reference voice and align well with the video in both emotion and lip movement, which is more challenging than conventional text-to-speech synthesis tasks. To align the generated speech with the inherent lip motion of the given silent video, most existing works utilize each video frame to query textual phonemes. However, such an attention operation usually leads to mumble speech because different phonemes are fused for video frames corresponding to one phoneme (video frames are finer-grained than phonemes). To address this issue, we propose a diffusion-based movie dubbing architecture, which improves pronunciation by Hierarchical Phoneme Modeling (HPM) and generates better mel-spectrogram through Acoustic Diffusion Denoising (ADD). We term our model as HD-Dubber. Specifically, our HPM bridges the visual information and corresponding speech prosody from three aspects: (1) aligning lip movement with the speech duration based on each phoneme unit by contrastive learning; (2) conveying facial expression to phoneme-level energy and pitch; and (3) injecting global emotions captured from video scenes into prosody. On the other hand, ADD exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via a parameterized Markov chain conditioned on textual phonemes and reference audio. ADD has two novel denoisers, the Style-adaptive Residual Denoiser (SRD) and the Phoneme-enhanced U-net Denoiser (PUD), to enhance speaker similarity and improve pronunciation quality. Extensive experimental results on the three benchmark datasets demonstrate the state-of-the-art performance of the proposed method. The source code and trained models will be made available to the public
650		4	\|a Journal Article
700	1		\|a Cong, Gaoxiang \|e verfasserin \|4 aut
700	1		\|a Qi, Yuankai \|e verfasserin \|4 aut
700	1		\|a Zha, Zheng-Jun \|e verfasserin \|4 aut
700	1		\|a Wu, Qi \|e verfasserin \|4 aut
700	1		\|a Sheng, Quan Z \|e verfasserin \|4 aut
700	1		\|a Huang, Qingming \|e verfasserin \|4 aut
700	1		\|a Yang, Ming-Hsuan \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on pattern analysis and machine intelligence \|d 1979 \|g 47(2025), 11 vom: 02. Okt., Seite 10361-10377 \|w (DE-627)NLM098212257 \|x 1939-3539 \|7 nnas
773	1	8	\|g volume:47 \|g year:2025 \|g number:11 \|g day:02 \|g month:10 \|g pages:10361-10377
856	4	0	\|u http://dx.doi.org/10.1109/TPAMI.2025.3597267 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_NLM
912			\|a GBV_ILN_350
951			\|a AR
952			\|d 47 \|j 2025 \|e 11 \|b 02 \|c 10 \|h 10361-10377