タイトル: Phoneme-based Spectral Voice Conversion Using Temporal Decomposition and Gaussian Mixture Model
著者: Nguyen, Binh Phu
Akagi, Masato
キーワード: spectral voice conversion
temporal decomposition
Gaussian mixture model (GMM)
発行日: 2008-06
出版者: Institute of Electrical and Electronics Engineers (IEEE)
誌名: Second International Conference on Communications and Electronics, 2008 (ICCE 2008)
開始ページ: 224
終了ページ: 229
抄録: In state-of-the-art voice conversion systems, GMM-based voice conversion methods are regarded as some of the best systems. However, the quality of converted speech is still far from natural. There are three main reasons for the degradation of the quality of converted speech: (i) modeling the distribution of acoustic features in voice conversion often uses unstable frames, which degrades the precision of GMM parameters (ii) the transformation function may generate discontinuous features if frames are processed independently (iii) over-smooth effect occurs in each converted frame. This paper presents a new spectral voice conversion method to deal with the two first draw-backs of standard spectral modification methods, insufficient precision of GMM parameters and insufficient smoothness of the converted spectra between frames. A speech analysis technique called temporal decomposition (TD), which decomposes speech into event targets and event functions, is used to effectively model the spectral evolution. For improvement of estimation of GMM parameters, we use phoneme-based features of event targets as spectral vectors in training procedure to take into account relations between spectral parameters in each phoneme, and to avoid using spectral parameters in transition parts. For enhancement of the continuity of speech spectra, we only need to convert event targets, instead of converting source features to target features frame by frame, and the smoothness of converted speech is ensured by the shape of the event functions. Experimental results show that our proposed spectral voice conversion method improves both the speech quality and the speaker individuality of converted speech.
