NaturalSpeech : End-to-End Text-to-Speech Synthesis With Human-Level Quality

Text-to-speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality, and how to achieve it. In this paper, we answer these questions by first defining th...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 46(2024), 6 vom: 22. Juni, Seite 4234-4245
1. Verfasser: Tan, Xu (VerfasserIn)
Weitere Verfasser: Chen, Jiawei, Liu, Haohe, Cong, Jian, Zhang, Chen, Liu, Yanqing, Wang, Xi, Leng, Yichong, Yi, Yuanhao, He, Lei, Zhao, Sheng, Qin, Tao, Soong, Frank, Liu, Tie-Yan
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2024
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article Research Support, Non-U.S. Gov't Research Support, U.S. Gov't, Non-P.H.S.
LEADER 01000caa a22002652 4500
001 NLM367316056
003 DE-627
005 20250103231833.0
007 cr uuu---uuuuu
008 240120s2024 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2024.3356232  |2 doi 
028 5 2 |a pubmed24n1650.xml 
035 |a (DE-627)NLM367316056 
035 |a (NLM)38241115 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Tan, Xu  |e verfasserin  |4 aut 
245 1 0 |a NaturalSpeech  |b End-to-End Text-to-Speech Synthesis With Human-Level Quality 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Completed 07.05.2024 
500 |a Date Revised 03.01.2025 
500 |a published: Print-Electronic 
500 |a Citation Status MEDLINE 
520 |a Text-to-speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality, and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on benchmark datasets. Specifically, we leverage a variational auto-encoder (VAE) for end-to-end text-to-waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experimental evaluations on the popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time 
650 4 |a Journal Article 
650 4 |a Research Support, Non-U.S. Gov't 
650 4 |a Research Support, U.S. Gov't, Non-P.H.S. 
700 1 |a Chen, Jiawei  |e verfasserin  |4 aut 
700 1 |a Liu, Haohe  |e verfasserin  |4 aut 
700 1 |a Cong, Jian  |e verfasserin  |4 aut 
700 1 |a Zhang, Chen  |e verfasserin  |4 aut 
700 1 |a Liu, Yanqing  |e verfasserin  |4 aut 
700 1 |a Wang, Xi  |e verfasserin  |4 aut 
700 1 |a Leng, Yichong  |e verfasserin  |4 aut 
700 1 |a Yi, Yuanhao  |e verfasserin  |4 aut 
700 1 |a He, Lei  |e verfasserin  |4 aut 
700 1 |a Zhao, Sheng  |e verfasserin  |4 aut 
700 1 |a Qin, Tao  |e verfasserin  |4 aut 
700 1 |a Soong, Frank  |e verfasserin  |4 aut 
700 1 |a Liu, Tie-Yan  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g 46(2024), 6 vom: 22. Juni, Seite 4234-4245  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnns 
773 1 8 |g volume:46  |g year:2024  |g number:6  |g day:22  |g month:06  |g pages:4234-4245 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2024.3356232  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 46  |j 2024  |e 6  |b 22  |c 06  |h 4234-4245