Multimodal Emotion Recognition Based on Facial Expressions, Speech, and EEG

Jiahui Pan; Weijie Fang; Zhihang Zhang; Bingzhi Chen; Zheng Zhang; Shuihua Wang

doi:10.1109/OJEMB.2023.3240280

Multimodal Emotion Recognition Based on Facial Expressions, Speech, and EEG

Jiahui Pan, Weijie Fang, Zhihang Zhang, Bingzhi Chen^*, Zheng Zhang, Shuihua Wang

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

31 Citations (Scopus)

Abstract

Goal: As an essential human-machine interactive task, emotion recognition has become an emerging area over the decades. Although previous attempts to classify emotions have achieved high performance, several challenges remain open: 1) How to effectively recognize emotions using different modalities remains challenging. 2) Due to the increasing amount of computing power required for deep learning, how to provide real-time detection and improve the robustness of deep neural networks is important. Method: In this paper, we propose a deep learning-based multimodal emotion recognition (MER) called Deep-Emotion, which can adaptively integrate the most discriminating features from facial expressions, speech, and electroencephalogram (EEG) to improve the performance of the MER. Specifically, the proposed Deep-Emotion framework consists of three branches, i.e., the facial branch, speech branch, and EEG branch. Correspondingly, the facial branch uses the improved GhostNet neural network proposed in this paper for feature extraction, which effectively alleviates the overfitting phenomenon in the training process and improves the classification accuracy compared with the original GhostNet network. For work on the speech branch, this paper proposes a lightweight fully convolutional neural network (LFCNN) for the efficient extraction of speech emotion features. Regarding the study of EEG branches, we proposed a tree-like LSTM (tLSTM) model capable of fusing multi-stage features for EEG emotion feature extraction. Finally, we adopted the strategy of decision-level fusion to integrate the recognition results of the above three modes, resulting in more comprehensive and accurate performance. Result and Conclusions: Extensive experiments on the CK+, EMO-DB, and MAHNOB-HCI datasets have demonstrated the advanced nature of the Deep-Emotion method proposed in this paper, as well as the feasibility and superiority of the MER approach.

Original language	English
Pages (from-to)	396-403
Number of pages	8
Journal	IEEE Open Journal of Engineering in Medicine and Biology
Volume	5
DOIs	https://doi.org/10.1109/OJEMB.2023.3240280
Publication status	Published - 2024
Externally published	Yes

Keywords

Multimodal emotion recognition
electroencephalogram
facial expressions
speech

Access to Document

10.1109/OJEMB.2023.3240280

Cite this

@article{0cda6191606541c899e2bf1416802f5e,

title = "Multimodal Emotion Recognition Based on Facial Expressions, Speech, and EEG",

abstract = "Goal: As an essential human-machine interactive task, emotion recognition has become an emerging area over the decades. Although previous attempts to classify emotions have achieved high performance, several challenges remain open: 1) How to effectively recognize emotions using different modalities remains challenging. 2) Due to the increasing amount of computing power required for deep learning, how to provide real-time detection and improve the robustness of deep neural networks is important. Method: In this paper, we propose a deep learning-based multimodal emotion recognition (MER) called Deep-Emotion, which can adaptively integrate the most discriminating features from facial expressions, speech, and electroencephalogram (EEG) to improve the performance of the MER. Specifically, the proposed Deep-Emotion framework consists of three branches, i.e., the facial branch, speech branch, and EEG branch. Correspondingly, the facial branch uses the improved GhostNet neural network proposed in this paper for feature extraction, which effectively alleviates the overfitting phenomenon in the training process and improves the classification accuracy compared with the original GhostNet network. For work on the speech branch, this paper proposes a lightweight fully convolutional neural network (LFCNN) for the efficient extraction of speech emotion features. Regarding the study of EEG branches, we proposed a tree-like LSTM (tLSTM) model capable of fusing multi-stage features for EEG emotion feature extraction. Finally, we adopted the strategy of decision-level fusion to integrate the recognition results of the above three modes, resulting in more comprehensive and accurate performance. Result and Conclusions: Extensive experiments on the CK+, EMO-DB, and MAHNOB-HCI datasets have demonstrated the advanced nature of the Deep-Emotion method proposed in this paper, as well as the feasibility and superiority of the MER approach.",

keywords = "Multimodal emotion recognition, electroencephalogram, facial expressions, speech",

author = "Jiahui Pan and Weijie Fang and Zhihang Zhang and Bingzhi Chen and Zheng Zhang and Shuihua Wang",

note = "Publisher Copyright: {\textcopyright} 2020 IEEE.",

year = "2024",

doi = "10.1109/OJEMB.2023.3240280",

language = "English",

volume = "5",

pages = "396--403",

journal = "IEEE Open Journal of Engineering in Medicine and Biology",

issn = "2644-1276",

}

TY - JOUR

T1 - Multimodal Emotion Recognition Based on Facial Expressions, Speech, and EEG

AU - Pan, Jiahui

AU - Fang, Weijie

AU - Zhang, Zhihang

AU - Chen, Bingzhi

AU - Zhang, Zheng

AU - Wang, Shuihua

PY - 2024

Y1 - 2024

N2 - Goal: As an essential human-machine interactive task, emotion recognition has become an emerging area over the decades. Although previous attempts to classify emotions have achieved high performance, several challenges remain open: 1) How to effectively recognize emotions using different modalities remains challenging. 2) Due to the increasing amount of computing power required for deep learning, how to provide real-time detection and improve the robustness of deep neural networks is important. Method: In this paper, we propose a deep learning-based multimodal emotion recognition (MER) called Deep-Emotion, which can adaptively integrate the most discriminating features from facial expressions, speech, and electroencephalogram (EEG) to improve the performance of the MER. Specifically, the proposed Deep-Emotion framework consists of three branches, i.e., the facial branch, speech branch, and EEG branch. Correspondingly, the facial branch uses the improved GhostNet neural network proposed in this paper for feature extraction, which effectively alleviates the overfitting phenomenon in the training process and improves the classification accuracy compared with the original GhostNet network. For work on the speech branch, this paper proposes a lightweight fully convolutional neural network (LFCNN) for the efficient extraction of speech emotion features. Regarding the study of EEG branches, we proposed a tree-like LSTM (tLSTM) model capable of fusing multi-stage features for EEG emotion feature extraction. Finally, we adopted the strategy of decision-level fusion to integrate the recognition results of the above three modes, resulting in more comprehensive and accurate performance. Result and Conclusions: Extensive experiments on the CK+, EMO-DB, and MAHNOB-HCI datasets have demonstrated the advanced nature of the Deep-Emotion method proposed in this paper, as well as the feasibility and superiority of the MER approach.

AB - Goal: As an essential human-machine interactive task, emotion recognition has become an emerging area over the decades. Although previous attempts to classify emotions have achieved high performance, several challenges remain open: 1) How to effectively recognize emotions using different modalities remains challenging. 2) Due to the increasing amount of computing power required for deep learning, how to provide real-time detection and improve the robustness of deep neural networks is important. Method: In this paper, we propose a deep learning-based multimodal emotion recognition (MER) called Deep-Emotion, which can adaptively integrate the most discriminating features from facial expressions, speech, and electroencephalogram (EEG) to improve the performance of the MER. Specifically, the proposed Deep-Emotion framework consists of three branches, i.e., the facial branch, speech branch, and EEG branch. Correspondingly, the facial branch uses the improved GhostNet neural network proposed in this paper for feature extraction, which effectively alleviates the overfitting phenomenon in the training process and improves the classification accuracy compared with the original GhostNet network. For work on the speech branch, this paper proposes a lightweight fully convolutional neural network (LFCNN) for the efficient extraction of speech emotion features. Regarding the study of EEG branches, we proposed a tree-like LSTM (tLSTM) model capable of fusing multi-stage features for EEG emotion feature extraction. Finally, we adopted the strategy of decision-level fusion to integrate the recognition results of the above three modes, resulting in more comprehensive and accurate performance. Result and Conclusions: Extensive experiments on the CK+, EMO-DB, and MAHNOB-HCI datasets have demonstrated the advanced nature of the Deep-Emotion method proposed in this paper, as well as the feasibility and superiority of the MER approach.

KW - Multimodal emotion recognition

KW - electroencephalogram

KW - facial expressions

KW - speech

UR - http://www.scopus.com/inward/record.url?scp=85148460513&partnerID=8YFLogxK

U2 - 10.1109/OJEMB.2023.3240280

DO - 10.1109/OJEMB.2023.3240280

M3 - Article

AN - SCOPUS:85148460513

SN - 2644-1276

VL - 5

SP - 396

EP - 403

JO - IEEE Open Journal of Engineering in Medicine and Biology

JF - IEEE Open Journal of Engineering in Medicine and Biology

ER -

Multimodal Emotion Recognition Based on Facial Expressions, Speech, and EEG

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this