Mel-MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram and Improved Multiscale Vision Transformers

Kah Liang Ong; Chin Poo Lee; Heng Siong Lim; Kian Ming Lim; Ali Alqahtani

doi:10.1109/ACCESS.2023.3321122

Mel-MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram and Improved Multiscale Vision Transformers

Kah Liang Ong, Chin Poo Lee^*, Heng Siong Lim, Kian Ming Lim, Ali Alqahtani

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

17 Citations (Scopus)

Abstract

Speech emotion recognition aims to automatically identify and classify emotions from speech signals. It plays a crucial role in various applications such as human-computer interaction, affective computing, and social robotics. Over the years, researchers have proposed different approaches for speech emotion recognition, leveraging various classifiers and features. However, despite the advancements, existing methods in speech emotion recognition still have certain limitations. Some approaches rely on handcrafted features that may not capture the full complexity of emotional information present in speech signals, while others may suffer from a lack of robustness and generalization when applied to different datasets. To address these challenges, this paper proposes a speech emotion recognition method that combines Mel spectrogram with Short-Term Fourier Transform (Mel-STFT) and the Improved Multiscale Vision Transformers (MViTv2). The Mel-STFT spectrograms capture both the frequency and temporal information of speech signals, providing a more comprehensive representation of the emotional content. The MViTv2 classifier introduces multi-scale visual modeling with different stages and pooling attention mechanisms. MViTv2 incorporates relative positional embeddings and a residual pooling connection to effectively model the interactions between tokens in the space-time structure, preserve essential information, and improve the efficiency of the model. Experimental results demonstrate that the proposed method generalizes well on different datasets, achieving an accuracy of 91.51% on the Emo-DB dataset, 81.75% on the RAVDESS dataset, and 64.03% on the IEMOCAP dataset.

Original language	English
Pages (from-to)	108571-108579
Number of pages	9
Journal	IEEE Access
Volume	11
DOIs	https://doi.org/10.1109/ACCESS.2023.3321122
Publication status	Published - 2023
Externally published	Yes

Keywords

Emo-DB
IEMOCAP
improved multiscale vision transformers
mel spectrogram
mel spectrogram with short-time Fourier transform
RAVDESS
spectrogram
Speech
speech emotion
speech emotion recognition
vision transformer

Access to Document

10.1109/ACCESS.2023.3321122

Cite this

@article{b5c4895f5a7340ed82068252b6dc1909,

title = "Mel-MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram and Improved Multiscale Vision Transformers",

abstract = "Speech emotion recognition aims to automatically identify and classify emotions from speech signals. It plays a crucial role in various applications such as human-computer interaction, affective computing, and social robotics. Over the years, researchers have proposed different approaches for speech emotion recognition, leveraging various classifiers and features. However, despite the advancements, existing methods in speech emotion recognition still have certain limitations. Some approaches rely on handcrafted features that may not capture the full complexity of emotional information present in speech signals, while others may suffer from a lack of robustness and generalization when applied to different datasets. To address these challenges, this paper proposes a speech emotion recognition method that combines Mel spectrogram with Short-Term Fourier Transform (Mel-STFT) and the Improved Multiscale Vision Transformers (MViTv2). The Mel-STFT spectrograms capture both the frequency and temporal information of speech signals, providing a more comprehensive representation of the emotional content. The MViTv2 classifier introduces multi-scale visual modeling with different stages and pooling attention mechanisms. MViTv2 incorporates relative positional embeddings and a residual pooling connection to effectively model the interactions between tokens in the space-time structure, preserve essential information, and improve the efficiency of the model. Experimental results demonstrate that the proposed method generalizes well on different datasets, achieving an accuracy of 91.51% on the Emo-DB dataset, 81.75% on the RAVDESS dataset, and 64.03% on the IEMOCAP dataset.",

keywords = "Emo-DB, IEMOCAP, improved multiscale vision transformers, mel spectrogram, mel spectrogram with short-time Fourier transform, RAVDESS, spectrogram, Speech, speech emotion, speech emotion recognition, vision transformer",

author = "Ong, {Kah Liang} and Lee, {Chin Poo} and Lim, {Heng Siong} and Lim, {Kian Ming} and Ali Alqahtani",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2023",

doi = "10.1109/ACCESS.2023.3321122",

language = "English",

volume = "11",

pages = "108571--108579",

journal = "IEEE Access",

issn = "2169-3536",

}

TY - JOUR

T1 - Mel-MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram and Improved Multiscale Vision Transformers

AU - Ong, Kah Liang

AU - Lee, Chin Poo

AU - Lim, Heng Siong

AU - Lim, Kian Ming

AU - Alqahtani, Ali

PY - 2023

Y1 - 2023

N2 - Speech emotion recognition aims to automatically identify and classify emotions from speech signals. It plays a crucial role in various applications such as human-computer interaction, affective computing, and social robotics. Over the years, researchers have proposed different approaches for speech emotion recognition, leveraging various classifiers and features. However, despite the advancements, existing methods in speech emotion recognition still have certain limitations. Some approaches rely on handcrafted features that may not capture the full complexity of emotional information present in speech signals, while others may suffer from a lack of robustness and generalization when applied to different datasets. To address these challenges, this paper proposes a speech emotion recognition method that combines Mel spectrogram with Short-Term Fourier Transform (Mel-STFT) and the Improved Multiscale Vision Transformers (MViTv2). The Mel-STFT spectrograms capture both the frequency and temporal information of speech signals, providing a more comprehensive representation of the emotional content. The MViTv2 classifier introduces multi-scale visual modeling with different stages and pooling attention mechanisms. MViTv2 incorporates relative positional embeddings and a residual pooling connection to effectively model the interactions between tokens in the space-time structure, preserve essential information, and improve the efficiency of the model. Experimental results demonstrate that the proposed method generalizes well on different datasets, achieving an accuracy of 91.51% on the Emo-DB dataset, 81.75% on the RAVDESS dataset, and 64.03% on the IEMOCAP dataset.

AB - Speech emotion recognition aims to automatically identify and classify emotions from speech signals. It plays a crucial role in various applications such as human-computer interaction, affective computing, and social robotics. Over the years, researchers have proposed different approaches for speech emotion recognition, leveraging various classifiers and features. However, despite the advancements, existing methods in speech emotion recognition still have certain limitations. Some approaches rely on handcrafted features that may not capture the full complexity of emotional information present in speech signals, while others may suffer from a lack of robustness and generalization when applied to different datasets. To address these challenges, this paper proposes a speech emotion recognition method that combines Mel spectrogram with Short-Term Fourier Transform (Mel-STFT) and the Improved Multiscale Vision Transformers (MViTv2). The Mel-STFT spectrograms capture both the frequency and temporal information of speech signals, providing a more comprehensive representation of the emotional content. The MViTv2 classifier introduces multi-scale visual modeling with different stages and pooling attention mechanisms. MViTv2 incorporates relative positional embeddings and a residual pooling connection to effectively model the interactions between tokens in the space-time structure, preserve essential information, and improve the efficiency of the model. Experimental results demonstrate that the proposed method generalizes well on different datasets, achieving an accuracy of 91.51% on the Emo-DB dataset, 81.75% on the RAVDESS dataset, and 64.03% on the IEMOCAP dataset.

KW - Emo-DB

KW - IEMOCAP

KW - improved multiscale vision transformers

KW - mel spectrogram

KW - mel spectrogram with short-time Fourier transform

KW - RAVDESS

KW - spectrogram

KW - Speech

KW - speech emotion

KW - speech emotion recognition

KW - vision transformer

UR - http://www.scopus.com/inward/record.url?scp=85174800658&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2023.3321122

DO - 10.1109/ACCESS.2023.3321122

M3 - Article

AN - SCOPUS:85174800658

SN - 2169-3536

VL - 11

SP - 108571

EP - 108579

JO - IEEE Access

JF - IEEE Access

ER -

Mel-MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram and Improved Multiscale Vision Transformers

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this