|
JAIST Repository >
b. 情報科学研究科・情報科学系 >
b10. 学術雑誌論文等 >
b10-1. 雑誌掲載論文 >
このアイテムの引用には次の識別子を使用してください:
http://hdl.handle.net/10119/11535
|
タイトル: | Improving Naturalness of HMM-Based TTS Trained with Limited Data by Temporal Decomposition |
著者: | PHUNG, Trung-Nghia PHAN, Thanh-Son VU, Thang Tat LUONG, Mai Chi AKAGI, Masato |
キーワード: | text to speech HMM-based TTS hybrid TTS limited data temporal decomposition |
発行日: | 2013-11-01 |
出版者: | 電子情報通信学会 |
誌名: | IEICE TRANSACTIONS on Information and Systems |
巻: | E96-D |
号: | 11 |
開始ページ: | 2417 |
終了ページ: | 2426 |
抄録: | The most important advantage of HMM-based TTS is its highly intelligible. However, speech synthesized by HMM-based TTS is muffled and far from natural, especially under limited data conditions, which is mainly caused by its over-smoothness. Therefore, the motivation for this paper is to improve the naturalness of HMM-based TTS trained under limited data conditions while preserving its intelligibility. To achieve this motivation, a hybrid TTS between HMM-based TTS and the modified restricted Temporal Decomposition (MRTD), named HTD in this paper, was proposed. Here, TD is an interpolation model of decomposing a spectral or prosodic sequence of speech into sparse event targets and dynamic event functions, and MRTD is one simplified version of TD. With a determination of event functions close to the concept of co-articulation in speech, MRTD can synthesize smooth speech and the smoothness in synthesized speech can be adjusted by manipulating event targets of MRTD. Previous studies have also found that event functions of MRTD can represent linguistic information of speech, which is important to perceive speech intelligibility, while sparse event targets can convey the non-linguistics information, which is important to perceive the naturalness of speech. Therefore, prosodic trajectories and MRTD event functions of the spectral trajectory generated by HMM-based TTS were kept unchanged to preserve the high and stable intelligibility of HMM-based TTS. Whereas MRTD event targets of the spectral trajectory generated by HMM-based TTS were rendered with an original speech database to enhance the naturalness of synthesized speech. Experimental results with small Vietnamese datasets revealed that the proposed HTD was equivalent to HMM-based TTS in terms of intelligibility but was superior to it in terms of naturalness. Further discussions show that HTD had a small footprint. Therefore, the proposed HTD showed its strong efficiency under limited data conditions. |
Rights: | Copyright (C)2013 IEICE. Trung-Nghia PHUNG, Thanh-Son PHAN, Thang Tat VU, Mai Chi LUONG, and Masato AKAGI, IEICE TRANSACTIONS on Information and Systems, E96-D(11), 2013, 2417-2426. http://www.ieice.org/jpn/trans_online/ |
URI: | http://hdl.handle.net/10119/11535 |
資料タイプ: | publisher |
出現コレクション: | b10-1. 雑誌掲載論文 (Journal Articles)
|
このアイテムのファイル:
ファイル |
記述 |
サイズ | 形式 |
IEICE-D2013_Nghia.pdf | | 2048Kb | Adobe PDF | 見る/開く |
|
当システムに保管されているアイテムはすべて著作権により保護されています。
|