NaturalSpeech : End-to-End Text-to-Speech Synthesis With Human-Level Quality

Text-to-speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality, and how to achieve it. In this paper, we answer these questions by first defining th...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on pattern analysis and machine intelligence. - 1979. - 46(2024), 6 vom: 22. Juni, Seite 4234-4245
1. Verfasser:	Tan, Xu (VerfasserIn)
Weitere Verfasser:	Chen, Jiawei, Liu, Haohe, Cong, Jian, Zhang, Chen, Liu, Yanqing, Wang, Xi, Leng, Yichong, Yi, Yuanhao, He, Lei, Zhao, Sheng, Qin, Tao, Soong, Frank, Liu, Tie-Yan
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2024
Zugriff auf das übergeordnete Werk:	IEEE transactions on pattern analysis and machine intelligence
Schlagworte:	Journal Article Research Support, Non-U.S. Gov't Research Support, U.S. Gov't, Non-P.H.S.


LEADER	01000caa a22002652 4500
001	NLM367316056
003	DE-627
005	20250103231833.0
007	cr uuu---uuuuu
008	240120s2024 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TPAMI.2024.3356232 \|2 doi
028	5	2	\|a pubmed24n1650.xml
035			\|a (DE-627)NLM367316056
035			\|a (NLM)38241115
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Tan, Xu \|e verfasserin \|4 aut
245	1	0	\|a NaturalSpeech \|b End-to-End Text-to-Speech Synthesis With Human-Level Quality
264		1	\|c 2024
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Completed 07.05.2024
500			\|a Date Revised 03.01.2025
500			\|a published: Print-Electronic
500			\|a Citation Status MEDLINE
520			\|a Text-to-speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality, and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on benchmark datasets. Specifically, we leverage a variational auto-encoder (VAE) for end-to-end text-to-waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experimental evaluations on the popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time
650		4	\|a Journal Article
650		4	\|a Research Support, Non-U.S. Gov't
650		4	\|a Research Support, U.S. Gov't, Non-P.H.S.
700	1		\|a Chen, Jiawei \|e verfasserin \|4 aut
700	1		\|a Liu, Haohe \|e verfasserin \|4 aut
700	1		\|a Cong, Jian \|e verfasserin \|4 aut
700	1		\|a Zhang, Chen \|e verfasserin \|4 aut
700	1		\|a Liu, Yanqing \|e verfasserin \|4 aut
700	1		\|a Wang, Xi \|e verfasserin \|4 aut
700	1		\|a Leng, Yichong \|e verfasserin \|4 aut
700	1		\|a Yi, Yuanhao \|e verfasserin \|4 aut
700	1		\|a He, Lei \|e verfasserin \|4 aut
700	1		\|a Zhao, Sheng \|e verfasserin \|4 aut
700	1		\|a Qin, Tao \|e verfasserin \|4 aut
700	1		\|a Soong, Frank \|e verfasserin \|4 aut
700	1		\|a Liu, Tie-Yan \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on pattern analysis and machine intelligence \|d 1979 \|g 46(2024), 6 vom: 22. Juni, Seite 4234-4245 \|w (DE-627)NLM098212257 \|x 1939-3539 \|7 nnns
773	1	8	\|g volume:46 \|g year:2024 \|g number:6 \|g day:22 \|g month:06 \|g pages:4234-4245
856	4	0	\|u http://dx.doi.org/10.1109/TPAMI.2024.3356232 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_NLM
912			\|a GBV_ILN_350
951			\|a AR
952			\|d 46 \|j 2024 \|e 6 \|b 22 \|c 06 \|h 4234-4245