JAIST Repository >
School of Information Science >
Articles >
Journal Articles >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10119/11535

Title: Improving Naturalness of HMM-Based TTS Trained with Limited Data by Temporal Decomposition
Authors: PHUNG, Trung-Nghia
PHAN, Thanh-Son
VU, Thang Tat
LUONG, Mai Chi
AKAGI, Masato
Keywords: text to speech
HMM-based TTS
hybrid TTS
limited data
temporal decomposition
Issue Date: 2013-11-01
Publisher: 電子情報通信学会
Magazine name: IEICE TRANSACTIONS on Information and Systems
Volume: E96-D
Number: 11
Start page: 2417
End page: 2426
Abstract: The most important advantage of HMM-based TTS is its highly intelligible. However, speech synthesized by HMM-based TTS is muffled and far from natural, especially under limited data conditions, which is mainly caused by its over-smoothness. Therefore, the motivation for this paper is to improve the naturalness of HMM-based TTS trained under limited data conditions while preserving its intelligibility. To achieve this motivation, a hybrid TTS between HMM-based TTS and the modified restricted Temporal Decomposition (MRTD), named HTD in this paper, was proposed. Here, TD is an interpolation model of decomposing a spectral or prosodic sequence of speech into sparse event targets and dynamic event functions, and MRTD is one simplified version of TD. With a determination of event functions close to the concept of co-articulation in speech, MRTD can synthesize smooth speech and the smoothness in synthesized speech can be adjusted by manipulating event targets of MRTD. Previous studies have also found that event functions of MRTD can represent linguistic information of speech, which is important to perceive speech intelligibility, while sparse event targets can convey the non-linguistics information, which is important to perceive the naturalness of speech. Therefore, prosodic trajectories and MRTD event functions of the spectral trajectory generated by HMM-based TTS were kept unchanged to preserve the high and stable intelligibility of HMM-based TTS. Whereas MRTD event targets of the spectral trajectory generated by HMM-based TTS were rendered with an original speech database to enhance the naturalness of synthesized speech. Experimental results with small Vietnamese datasets revealed that the proposed HTD was equivalent to HMM-based TTS in terms of intelligibility but was superior to it in terms of naturalness. Further discussions show that HTD had a small footprint. Therefore, the proposed HTD showed its strong efficiency under limited data conditions.
Rights: Copyright (C)2013 IEICE. Trung-Nghia PHUNG, Thanh-Son PHAN, Thang Tat VU, Mai Chi LUONG, and Masato AKAGI, IEICE TRANSACTIONS on Information and Systems, E96-D(11), 2013, 2417-2426. http://www.ieice.org/jpn/trans_online/
URI: http://hdl.handle.net/10119/11535
Material Type: publisher
Appears in Collections:b10-1. 雑誌掲載論文 (Journal Articles)

Files in This Item:

File Description SizeFormat
IEICE-D2013_Nghia.pdf2048KbAdobe PDFView/Open

All items in DSpace are protected by copyright, with all rights reserved.

 


Contact : Library Information Section, Japan Advanced Institute of Science and Technology