Audio-Visual Speech Emotion Recognition by Disentangling Emotion and Identity Attributes <BR>(3 minutes introduction)

Audio-Visual Speech Emotion Recognition by Disentangling Emotion and Identity Attributes
(3 minutes introduction)

Koichiro Ito (Hitachi, Japan), Takuya Fujioka (Hitachi, Japan), Qinghua Sun (Hitachi, Japan), Kenji Nagamatsu (Hitachi, Japan)

In this paper, we propose an audio-visual speech emotion recognition (AV-SER) that can suppress the disturbance from an identity attribute by disentangling an emotion attribute and an identity one. We developed a model that first disentangles both attributes for each modality. In order to achieve the disentanglement, we introduce a co-attention module to our model. Our model disentangles the emotion attribute by giving the identity attribute as conditional features to the module. Conversely, the identity attribute is also obtained with the emotion attribute as a condition. Our model then makes a prediction for each attribute from these disentangled features by considering both modalities. In addition, to ensure the disentanglement capacity of our model, we train the model with an identification task as the auxiliary task and an SER task as the primary task alternately, and we update only the part of parameters responsible for each task. The experimental result shows the effectiveness of our method with the wild CMU-MOSEI dataset.

Loading player

Search in Audio

Related Recordings

Metric Learning Based Feature Representation With Gated Fusion Model For Speech Emotion Recognition
(3 minutes introduction)

Yuan Gao , Jiaxing Liu , Longbiao Wang , Jianwu Dang

Speech Emotion Recognition with Multi-task Learning
(3 minutes introduction)

Xingyu Cai , Jiahong Yuan , Renjie Zheng , Liang Huang , Kenneth Church

InterSpeech 2021

Audio-Visual Speech Emotion Recognition by Disentangling Emotion and Identity Attributes (3 minutes introduction)

Search in Audio

Related Recordings

Metric Learning Based Feature Representation With Gated Fusion Model For Speech Emotion Recognition (3 minutes introduction)

Speech Emotion Recognition with Multi-task Learning (3 minutes introduction)

Audio-Visual Speech Emotion Recognition by Disentangling Emotion and Identity Attributes
(3 minutes introduction)

Metric Learning Based Feature Representation With Gated Fusion Model For Speech Emotion Recognition
(3 minutes introduction)

Speech Emotion Recognition with Multi-task Learning
(3 minutes introduction)