JAIST Repository >
b. 情報科学研究科・情報科学系 >
b11. 会議発表論文・発表資料等 >
b11-1. 会議発表論文・発表資料 >

このアイテムの引用には次の識別子を使用してください: http://hdl.handle.net/10119/11514

タイトル: A Hybrid TTS between Unit Selection and HMM-based TTS under limited data conditions
著者: Phung, Trung-Nghia
Luong, Chi Mai
Akagi, Masato
キーワード: TTS
unit selection
Temporal Decomposition
発行日: 2013-09-02
出版者: International Speech Communication Association
誌名: Proceedings of 8th ISCA Speech Synthesis Workshop
開始ページ: 279
終了ページ: 284
抄録: The intelligibility of HMM-based TTS can reach that of the original speech. However, HMM-based TTS is far from natural. On the contrary, unit selection TTS is the most-natural sounding TTS currently. However, its intelligibility and naturalness on segmental duration and timing are not stable. Additionally, unit selection needs to store a huge amount of data for concatenation. Recently, hybrid approaches between these two TTS, i.e. the HMM trajectory tiling TTS (HTT), have been studied to take advantages of both unit selection and HMM-based TTS. However, such methods still require a huge amount of data for rendering. In this paper, a hybrid TTS among unit selection, HMM-based TTS, and the Modified Restricted Temporal Decomposition (MRTD), named HTD, is proposed motivating to take advantages of both unit selection and HMM-based TTS under limited data conditions. Here, TD is a sparse representation of speech that decomposes a spectral or prosodic sequence into two mutually independent components: static event targets and correspondent dynamic event functions, and MRTD is a compact but efficient version of TD. Previous studies show that the dynamic event functions of MRTD are related to the perception of speech intelligibility, one core linguistic or content information, while the static event targets of MRTD convey non-linguistic or style information. Therefore, by borrowing the concepts of unit selection to render the event targets of the spectral sequence, and directly borrowing the prosodic sequences and the dynamic event functions of the spectral sequence generated by HMM-based TTS, the naturalness and the intelligibility of the proposed HTD can reach the naturalness of unit selection, and the intelligibility of HMM-based TTS, respectively. Due to the smoothness of event functions of MRTD, an appropriate smoothness in synthesized speech can still be ensured when being rendering by a small amount of data, resulting in the usability of the proposed HTD under limited data conditions. The experimental results with a small Vietnamese dataset, simulated to be a “limited data condition”, show that the proposed HTD outperformed all HMM-based TTS, unit selection, HTT under a limited data condition.
Rights: Copyright (C) 2013 International Speech Communication Association. Trung-Nghia Phung, Chi Mai Luong, Masato Akagi, Proceedings of 8th ISCA Speech Synthesis Workshop, 2013, pp.279-284.
URI: http://hdl.handle.net/10119/11514
資料タイプ: publisher
出現コレクション:b11-1. 会議発表論文・発表資料 (Conference Papers)


ファイル 記述 サイズ形式
ssw8_PS3-6_Phung.pdf1342KbAdobe PDF見る/開く



お問合せ先 : 北陸先端科学技術大学院大学 研究推進課図書館情報係 (ir-sys[at]ml.jaist.ac.jp)