JAIST Repository >
School of Information Science >
Conference Papers >
Conference Papers >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10119/17027

Title: Non-parallel Voice Conversion based on Hierarchical Latent Embedding Vector Quantized Variational Autoencoder
Authors: Ho, Tuan Vu
Akagi, Masato
Keywords: Voice Conversion Challenge 2020
cross-lingual
variational auoencoder
hierarchical structure
Issue Date: 2020-10-30
Publisher: International Speech Communication Association
Magazine name: Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020
Start page: 140
End page: 144
DOI: 10.21437/VCC_BC.2020-20
Abstract: This paper proposes a hierarchical latent embedding structure for Vector Quantized Variational Autoencoder (VQVAE) to improve the performance of the non-parallel voice conversion (NPVC) model. Previous studies on NPVC based on vanilla VQVAE use a single codebook to encode the linguistic information at a fixed temporal scale. However, the linguistic structure contains different semantic levels (e.g., phoneme, syllable, word) that span at various temporal scales. Therefore, the converted speech may contain unnatural pronunciations which can degrade the naturalness of speech. To tackle this problem, we propose to use the hierarchical latent embedding structure which comprises several vector quantization blocks operating at different temporal scales. When trained with a multi-speaker database, our proposed model can encode the voice characteristics into the speaker embedding vector, which can be used in one-shot learning settings. Results from objective and subjective tests indicate that our proposed model outperforms the conventional VQVAE based model in both intra-lingual and cross-lingual conversion tasks. The official results from Voice Conversion Challenge 2020 reveal that our proposed model achieved the highest naturalness performance among autoencoder based models in both tasks. Our implementation is being made available at https://github.com/tuanvu92/VCC2020.
Rights: Copyright (C) 2020 International Speech Communication Association. Ho, T.V., Akagi, M. (2020) Non-parallel Voice Conversion based on Hierarchical Latent Embedding Vector Quantized Variational Autoencoder. Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp.140-144, DOI: 10.21437/VCC_BC.2020-20. http://dx.doi.org/10.21437/VCC_BC.2020-20
URI: http://hdl.handle.net/10119/17027
Material Type: publisher
Appears in Collections:b11-1. 会議発表論文・発表資料 (Conference Papers)

Files in This Item:

File Description SizeFormat
3400.pdf667KbAdobe PDFView/Open

All items in DSpace are protected by copyright, with all rights reserved.

 


Contact : Library Information Section, Japan Advanced Institute of Science and Technology