MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

Kah Liang Ong; Chin Poo Lee; Heng Siong Lim; Kian Ming Lim; Ali Alqahtani

doi:10.1109/ACCESS.2024.3360483

MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

Kah Liang Ong, Chin Poo Lee^*, Heng Siong Lim, Kian Ming Lim, Ali Alqahtani

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

8 Citations (Scopus)

Abstract

Vision Transformers, known for their innovative architectural design and modeling capabilities, have gained significant attention in computer vision. This paper presents a dual-path approach that leverages the strengths of the Multi-Axis Vision Transformer (MaxViT) and the Improved Multiscale Vision Transformer (MViTv2). It starts by encoding speech signals into Constant-Q Transform (CQT) spectrograms and Mel Spectrograms with Short-Time Fourier Transform (Mel-STFT). The CQT spectrogram is then fed into the MaxViT model, while the Mel-STFT is input to the MViTv2 model to extract informative features from the spectrograms. These features are integrated and passed into a Multilayer Perceptron (MLP) model for final classification. This hybrid model is named the 'MaxViT and MViTv2 Fusion Network with Multilayer Perceptron (MaxMViT-MLP).' The MaxMViT-MLP model achieves remarkable results with an accuracy of 95.28% on the Emo-DB, 89.12% on the RAVDESS dataset, and 68.39% on the IEMOCAP dataset, substantiating the advantages of integrating multiple audio feature representations and Vision Transformers in speech emotion recognition.

Original language	English
Pages (from-to)	18237-18250
Number of pages	14
Journal	IEEE Access
Volume	12
DOIs	https://doi.org/10.1109/ACCESS.2024.3360483
Publication status	Published - 2024
Externally published	Yes

Keywords

Emo-DB
ensemble learning
IEMOCAP
RAVDESS
spectrogram
Speech emotion recognition
vision transformer

Access to Document

10.1109/ACCESS.2024.3360483

Cite this

@article{99ba64adec2d46be89a121a91997b39d,

title = "MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition",

abstract = "Vision Transformers, known for their innovative architectural design and modeling capabilities, have gained significant attention in computer vision. This paper presents a dual-path approach that leverages the strengths of the Multi-Axis Vision Transformer (MaxViT) and the Improved Multiscale Vision Transformer (MViTv2). It starts by encoding speech signals into Constant-Q Transform (CQT) spectrograms and Mel Spectrograms with Short-Time Fourier Transform (Mel-STFT). The CQT spectrogram is then fed into the MaxViT model, while the Mel-STFT is input to the MViTv2 model to extract informative features from the spectrograms. These features are integrated and passed into a Multilayer Perceptron (MLP) model for final classification. This hybrid model is named the 'MaxViT and MViTv2 Fusion Network with Multilayer Perceptron (MaxMViT-MLP).' The MaxMViT-MLP model achieves remarkable results with an accuracy of 95.28% on the Emo-DB, 89.12% on the RAVDESS dataset, and 68.39% on the IEMOCAP dataset, substantiating the advantages of integrating multiple audio feature representations and Vision Transformers in speech emotion recognition.",

keywords = "Emo-DB, ensemble learning, IEMOCAP, RAVDESS, spectrogram, Speech emotion recognition, vision transformer",

author = "Ong, {Kah Liang} and Lee, {Chin Poo} and Lim, {Heng Siong} and Lim, {Kian Ming} and Ali Alqahtani",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2024",

doi = "10.1109/ACCESS.2024.3360483",

language = "English",

volume = "12",

pages = "18237--18250",

journal = "IEEE Access",

issn = "2169-3536",

}

TY - JOUR

T1 - MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

AU - Ong, Kah Liang

AU - Lee, Chin Poo

AU - Lim, Heng Siong

AU - Lim, Kian Ming

AU - Alqahtani, Ali

PY - 2024

Y1 - 2024

N2 - Vision Transformers, known for their innovative architectural design and modeling capabilities, have gained significant attention in computer vision. This paper presents a dual-path approach that leverages the strengths of the Multi-Axis Vision Transformer (MaxViT) and the Improved Multiscale Vision Transformer (MViTv2). It starts by encoding speech signals into Constant-Q Transform (CQT) spectrograms and Mel Spectrograms with Short-Time Fourier Transform (Mel-STFT). The CQT spectrogram is then fed into the MaxViT model, while the Mel-STFT is input to the MViTv2 model to extract informative features from the spectrograms. These features are integrated and passed into a Multilayer Perceptron (MLP) model for final classification. This hybrid model is named the 'MaxViT and MViTv2 Fusion Network with Multilayer Perceptron (MaxMViT-MLP).' The MaxMViT-MLP model achieves remarkable results with an accuracy of 95.28% on the Emo-DB, 89.12% on the RAVDESS dataset, and 68.39% on the IEMOCAP dataset, substantiating the advantages of integrating multiple audio feature representations and Vision Transformers in speech emotion recognition.

AB - Vision Transformers, known for their innovative architectural design and modeling capabilities, have gained significant attention in computer vision. This paper presents a dual-path approach that leverages the strengths of the Multi-Axis Vision Transformer (MaxViT) and the Improved Multiscale Vision Transformer (MViTv2). It starts by encoding speech signals into Constant-Q Transform (CQT) spectrograms and Mel Spectrograms with Short-Time Fourier Transform (Mel-STFT). The CQT spectrogram is then fed into the MaxViT model, while the Mel-STFT is input to the MViTv2 model to extract informative features from the spectrograms. These features are integrated and passed into a Multilayer Perceptron (MLP) model for final classification. This hybrid model is named the 'MaxViT and MViTv2 Fusion Network with Multilayer Perceptron (MaxMViT-MLP).' The MaxMViT-MLP model achieves remarkable results with an accuracy of 95.28% on the Emo-DB, 89.12% on the RAVDESS dataset, and 68.39% on the IEMOCAP dataset, substantiating the advantages of integrating multiple audio feature representations and Vision Transformers in speech emotion recognition.

KW - Emo-DB

KW - ensemble learning

KW - IEMOCAP

KW - RAVDESS

KW - spectrogram

KW - Speech emotion recognition

KW - vision transformer

UR - http://www.scopus.com/inward/record.url?scp=85184318936&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2024.3360483

DO - 10.1109/ACCESS.2024.3360483

M3 - Article

AN - SCOPUS:85184318936

SN - 2169-3536

VL - 12

SP - 18237

EP - 18250

JO - IEEE Access

JF - IEEE Access

ER -

MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this