JAIST Repository >
School of Information Science >
Conference Papers >
Conference Papers >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10119/11514

Title: A Hybrid TTS between Unit Selection and HMM-based TTS under limited data conditions
Authors: Phung, Trung-Nghia
Luong, Chi Mai
Akagi, Masato
Keywords: TTS
unit selection
HMM-based
Temporal Decomposition
HTT
Issue Date: 2013-09-02
Publisher: International Speech Communication Association
Magazine name: Proceedings of 8th ISCA Speech Synthesis Workshop
Start page: 279
End page: 284
Abstract: The intelligibility of HMM-based TTS can reach that of the original speech. However, HMM-based TTS is far from natural. On the contrary, unit selection TTS is the most-natural sounding TTS currently. However, its intelligibility and naturalness on segmental duration and timing are not stable. Additionally, unit selection needs to store a huge amount of data for concatenation. Recently, hybrid approaches between these two TTS, i.e. the HMM trajectory tiling TTS (HTT), have been studied to take advantages of both unit selection and HMM-based TTS. However, such methods still require a huge amount of data for rendering. In this paper, a hybrid TTS among unit selection, HMM-based TTS, and the Modified Restricted Temporal Decomposition (MRTD), named HTD, is proposed motivating to take advantages of both unit selection and HMM-based TTS under limited data conditions. Here, TD is a sparse representation of speech that decomposes a spectral or prosodic sequence into two mutually independent components: static event targets and correspondent dynamic event functions, and MRTD is a compact but efficient version of TD. Previous studies show that the dynamic event functions of MRTD are related to the perception of speech intelligibility, one core linguistic or content information, while the static event targets of MRTD convey non-linguistic or style information. Therefore, by borrowing the concepts of unit selection to render the event targets of the spectral sequence, and directly borrowing the prosodic sequences and the dynamic event functions of the spectral sequence generated by HMM-based TTS, the naturalness and the intelligibility of the proposed HTD can reach the naturalness of unit selection, and the intelligibility of HMM-based TTS, respectively. Due to the smoothness of event functions of MRTD, an appropriate smoothness in synthesized speech can still be ensured when being rendering by a small amount of data, resulting in the usability of the proposed HTD under limited data conditions. The experimental results with a small Vietnamese dataset, simulated to be a “limited data condition”, show that the proposed HTD outperformed all HMM-based TTS, unit selection, HTT under a limited data condition.
Rights: Copyright (C) 2013 International Speech Communication Association. Trung-Nghia Phung, Chi Mai Luong, Masato Akagi, Proceedings of 8th ISCA Speech Synthesis Workshop, 2013, pp.279-284.
URI: http://hdl.handle.net/10119/11514
Material Type: publisher
Appears in Collections:b11-1. 会議発表論文・発表資料 (Conference Papers)

Files in This Item:

File Description SizeFormat
ssw8_PS3-6_Phung.pdf1342KbAdobe PDFView/Open

All items in DSpace are protected by copyright, with all rights reserved.

 


Contact : Library Information Section, Japan Advanced Institute of Science and Technology