MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

Kah Liang Ong, Chin Poo Lee*, Heng Siong Lim, Kian Ming Lim, Ali Alqahtani

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)

Abstract

Vision Transformers, known for their innovative architectural design and modeling capabilities, have gained significant attention in computer vision. This paper presents a dual-path approach that leverages the strengths of the Multi-Axis Vision Transformer (MaxViT) and the Improved Multiscale Vision Transformer (MViTv2). It starts by encoding speech signals into Constant-Q Transform (CQT) spectrograms and Mel Spectrograms with Short-Time Fourier Transform (Mel-STFT). The CQT spectrogram is then fed into the MaxViT model, while the Mel-STFT is input to the MViTv2 model to extract informative features from the spectrograms. These features are integrated and passed into a Multilayer Perceptron (MLP) model for final classification. This hybrid model is named the 'MaxViT and MViTv2 Fusion Network with Multilayer Perceptron (MaxMViT-MLP).' The MaxMViT-MLP model achieves remarkable results with an accuracy of 95.28% on the Emo-DB, 89.12% on the RAVDESS dataset, and 68.39% on the IEMOCAP dataset, substantiating the advantages of integrating multiple audio feature representations and Vision Transformers in speech emotion recognition.

Original languageEnglish
Pages (from-to)18237-18250
Number of pages14
JournalIEEE Access
Volume12
DOIs
Publication statusPublished - 2024
Externally publishedYes

Keywords

  • Emo-DB
  • ensemble learning
  • IEMOCAP
  • RAVDESS
  • spectrogram
  • Speech emotion recognition
  • vision transformer

Fingerprint

Dive into the research topics of 'MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition'. Together they form a unique fingerprint.

Cite this