Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning
Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech...
Guardado en:
Autores principales: | , , , , , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
MDPI AG
2021
|
Materias: | |
Acceso en línea: | https://doaj.org/article/0967e0d9064646e7a61ad7169dbcadac |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:0967e0d9064646e7a61ad7169dbcadac |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:0967e0d9064646e7a61ad7169dbcadac2021-11-25T18:58:18ZMultimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning10.3390/s212276651424-8220https://doaj.org/article/0967e0d9064646e7a61ad7169dbcadac2021-11-01T00:00:00Zhttps://www.mdpi.com/1424-8220/21/22/7665https://doaj.org/toc/1424-8220Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the frame-based systems could present some problems when they were used directly to solve a video-based task despite the domain adaptation, which opens a new line of research to discover new ways to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models. Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The results revealed that these modalities carry relevant information to detect users’ emotional state and their combination enables improvement of system performance.Cristina Luna-JiménezDavid GriolZoraida CallejasRicardo KleinleinJuan M. MonteroFernando Fernández-MartínezMDPI AGarticleaudio–visual emotion recognitionhuman–computer interactioncomputational paralinguisticsspatial transformerstransfer learningspeech emotion recognitionChemical technologyTP1-1185ENSensors, Vol 21, Iss 7665, p 7665 (2021) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
audio–visual emotion recognition human–computer interaction computational paralinguistics spatial transformers transfer learning speech emotion recognition Chemical technology TP1-1185 |
spellingShingle |
audio–visual emotion recognition human–computer interaction computational paralinguistics spatial transformers transfer learning speech emotion recognition Chemical technology TP1-1185 Cristina Luna-Jiménez David Griol Zoraida Callejas Ricardo Kleinlein Juan M. Montero Fernando Fernández-Martínez Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning |
description |
Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the frame-based systems could present some problems when they were used directly to solve a video-based task despite the domain adaptation, which opens a new line of research to discover new ways to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models. Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The results revealed that these modalities carry relevant information to detect users’ emotional state and their combination enables improvement of system performance. |
format |
article |
author |
Cristina Luna-Jiménez David Griol Zoraida Callejas Ricardo Kleinlein Juan M. Montero Fernando Fernández-Martínez |
author_facet |
Cristina Luna-Jiménez David Griol Zoraida Callejas Ricardo Kleinlein Juan M. Montero Fernando Fernández-Martínez |
author_sort |
Cristina Luna-Jiménez |
title |
Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning |
title_short |
Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning |
title_full |
Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning |
title_fullStr |
Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning |
title_full_unstemmed |
Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning |
title_sort |
multimodal emotion recognition on ravdess dataset using transfer learning |
publisher |
MDPI AG |
publishDate |
2021 |
url |
https://doaj.org/article/0967e0d9064646e7a61ad7169dbcadac |
work_keys_str_mv |
AT cristinalunajimenez multimodalemotionrecognitiononravdessdatasetusingtransferlearning AT davidgriol multimodalemotionrecognitiononravdessdatasetusingtransferlearning AT zoraidacallejas multimodalemotionrecognitiononravdessdatasetusingtransferlearning AT ricardokleinlein multimodalemotionrecognitiononravdessdatasetusingtransferlearning AT juanmmontero multimodalemotionrecognitiononravdessdatasetusingtransferlearning AT fernandofernandezmartinez multimodalemotionrecognitiononravdessdatasetusingtransferlearning |
_version_ |
1718410445124009984 |