JAIST Repository >
School of Information Science >
Conference Papers >
Conference Papers >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10119/12208

Title: Toward relaying emotional state for speech-to-speech translator: Estimation of emotional state for synthesizing speech with emotion
Authors: Akagi, Masato
Elbarougy, Reda
Issue Date: 2014-07
Publisher: International Institute of Acoustics and Vibration (IIAV)
Magazine name: Proceedings of the 21st International Congress on Sound and Vibration (ICSV21)
Start page: 1
End page: 8
Abstract: Most of the previous studies on Speech-to-Speech Translation (S2ST) focused on processing of linguistic content by directly translating the spoken utterance from the source language to the target language without taking into account the paralinguistic and non-linguistic information like emotional states emitted by the source. However, for clear communication, it is important to capture and transmit the emotional states from the source language to the target language. In order to synthesize the target speech with the emotional state conveyed at the source, a speech emotion recognition system is required to detect the emotional state of the source language. The S2ST system should enable the source and target languages to be used interchangeably, i.e. it should possess the ability to detect the emotional state of the source regardless of the language used. This paper proposes a Bilingual Speech Emotion Recognition (BSER) system for detecting the emotional state of the source language in the S2ST system. In natural speech, humans can detect the emotional states from the speech regardless of the language used. Therefore, this study demonstrates feasibility of constructing a global BSER system that has the ability to recognize universal emotions. This paper introduces a three-layer model: emotion dimensions in the top layer, semantic primitives in the middle layer, and acoustic features in the bottom layer. The experimental results reveal that the proposed system precisely estimates the emotion dimensions cross-lingual working with Japanese and German languages. The most important outcome is that, using the proposed normalization method for acoustic features, we found that emotion recognition is language independent. Therefore, this system can be extended for estimating the emotional state conveyed in the source languages in a S2ST system for several language pairs.
Rights: Copyright (C) 2014 International Institute of Acoustics and Vibration (IIAV). Masato Akagi and Reda Elbarougy, Proceedings of the 21st International Congress on Sound and Vibration (ICSV21), 2014, pp.1-8. This paper is based on one first published in the proceedings of the 21st International Congress on Sound and Vibration, July 2014 and is published here with permission of the International Institute of Acoustics and Vibration (IIAV.)
URI: http://hdl.handle.net/10119/12208
Material Type: publisher
Appears in Collections:b11-1. 会議発表論文・発表資料 (Conference Papers)

Files in This Item:

File Description SizeFormat
ICSV2014_Akagi_Reda.pdf437KbAdobe PDFView/Open

All items in DSpace are protected by copyright, with all rights reserved.

 


Contact : Library Information Section, Japan Advanced Institute of Science and Technology