SCQT-MaxViT: Speech Emotion Recognition With Constant-Q Transform and Multi-Axis Vision Transformer

Kah Liang Ong; Chin Poo Lee; Heng Siong Lim; Kian Ming Lim; Takeki Mukaida

doi:10.1109/ACCESS.2023.3288526

SCQT-MaxViT: Speech Emotion Recognition With Constant-Q Transform and Multi-Axis Vision Transformer

Kah Liang Ong, Chin Poo Lee^*, Heng Siong Lim, Kian Ming Lim, Takeki Mukaida

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

12 Citations (Scopus)

Abstract

Speech emotion recognition presents a significant challenge within the field of affective computing, requiring the analysis and detection of emotions conveyed through speech signals. However, existing approaches often rely on traditional signal processing techniques and handcrafted features, which may not effectively capture the nuanced aspects of emotional expression. In this paper, an approach named 'SCQT-MaxViT' is proposed for speech emotion recognition, combining signal processing, computer vision, and deep learning techniques. The method utilizes the Constant-Q Transform (CQT) to convert speech waveforms into spectrograms, providing high-frequency resolution and enabling the model to capture intricate emotional details. Additionally, the Multi-axis Vision Transformer (MaxViT) is employed for further representation learning and classification of the CQT spectrograms. MaxViT incorporates a multi-axis self-attention mechanism, facilitating both local and global interactions within the network and enhancing the ability of the model to learn meaningful features. Furthermore, the dataset is augmented using random time masking techniques to enhance the generalization capabilities. Achieving accuracies of 88.68% on the Emo-DB dataset, 77.54% on the RAVDESS dataset, and 62.49% on the IEMOCAP dataset, the proposed SCQT-MaxViT method exhibits promising performance in capturing and recognizing emotions in speech signals.

Original language	English
Pages (from-to)	63081-63091
Number of pages	11
Journal	IEEE Access
Volume	11
DOIs	https://doi.org/10.1109/ACCESS.2023.3288526
Publication status	Published - 2023
Externally published	Yes

Keywords

constant-Q transform
Emo-DB
IEMOCAP
multi-axis vision transformer
RAVDESS
spectrogram
Speech
speech emotion
speech emotion recognition
vision transformer

Access to Document

10.1109/ACCESS.2023.3288526

Cite this

Ong, K. L., Lee, C. P., Lim, H. S., Lim, K. M., & Mukaida, T. (2023). SCQT-MaxViT: Speech Emotion Recognition With Constant-Q Transform and Multi-Axis Vision Transformer. IEEE Access, 11, 63081-63091. https://doi.org/10.1109/ACCESS.2023.3288526

@article{dd4d61b83a1642c3923647a0da8c2131,

title = "SCQT-MaxViT: Speech Emotion Recognition With Constant-Q Transform and Multi-Axis Vision Transformer",

abstract = "Speech emotion recognition presents a significant challenge within the field of affective computing, requiring the analysis and detection of emotions conveyed through speech signals. However, existing approaches often rely on traditional signal processing techniques and handcrafted features, which may not effectively capture the nuanced aspects of emotional expression. In this paper, an approach named 'SCQT-MaxViT' is proposed for speech emotion recognition, combining signal processing, computer vision, and deep learning techniques. The method utilizes the Constant-Q Transform (CQT) to convert speech waveforms into spectrograms, providing high-frequency resolution and enabling the model to capture intricate emotional details. Additionally, the Multi-axis Vision Transformer (MaxViT) is employed for further representation learning and classification of the CQT spectrograms. MaxViT incorporates a multi-axis self-attention mechanism, facilitating both local and global interactions within the network and enhancing the ability of the model to learn meaningful features. Furthermore, the dataset is augmented using random time masking techniques to enhance the generalization capabilities. Achieving accuracies of 88.68% on the Emo-DB dataset, 77.54% on the RAVDESS dataset, and 62.49% on the IEMOCAP dataset, the proposed SCQT-MaxViT method exhibits promising performance in capturing and recognizing emotions in speech signals.",

keywords = "constant-Q transform, Emo-DB, IEMOCAP, multi-axis vision transformer, RAVDESS, spectrogram, Speech, speech emotion, speech emotion recognition, vision transformer",

author = "Ong, {Kah Liang} and Lee, {Chin Poo} and Lim, {Heng Siong} and Lim, {Kian Ming} and Takeki Mukaida",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2023",

doi = "10.1109/ACCESS.2023.3288526",

language = "English",

volume = "11",

pages = "63081--63091",

journal = "IEEE Access",

issn = "2169-3536",

}

TY - JOUR

T1 - SCQT-MaxViT: Speech Emotion Recognition With Constant-Q Transform and Multi-Axis Vision Transformer

AU - Ong, Kah Liang

AU - Lee, Chin Poo

AU - Lim, Heng Siong

AU - Lim, Kian Ming

AU - Mukaida, Takeki

PY - 2023

Y1 - 2023

N2 - Speech emotion recognition presents a significant challenge within the field of affective computing, requiring the analysis and detection of emotions conveyed through speech signals. However, existing approaches often rely on traditional signal processing techniques and handcrafted features, which may not effectively capture the nuanced aspects of emotional expression. In this paper, an approach named 'SCQT-MaxViT' is proposed for speech emotion recognition, combining signal processing, computer vision, and deep learning techniques. The method utilizes the Constant-Q Transform (CQT) to convert speech waveforms into spectrograms, providing high-frequency resolution and enabling the model to capture intricate emotional details. Additionally, the Multi-axis Vision Transformer (MaxViT) is employed for further representation learning and classification of the CQT spectrograms. MaxViT incorporates a multi-axis self-attention mechanism, facilitating both local and global interactions within the network and enhancing the ability of the model to learn meaningful features. Furthermore, the dataset is augmented using random time masking techniques to enhance the generalization capabilities. Achieving accuracies of 88.68% on the Emo-DB dataset, 77.54% on the RAVDESS dataset, and 62.49% on the IEMOCAP dataset, the proposed SCQT-MaxViT method exhibits promising performance in capturing and recognizing emotions in speech signals.

AB - Speech emotion recognition presents a significant challenge within the field of affective computing, requiring the analysis and detection of emotions conveyed through speech signals. However, existing approaches often rely on traditional signal processing techniques and handcrafted features, which may not effectively capture the nuanced aspects of emotional expression. In this paper, an approach named 'SCQT-MaxViT' is proposed for speech emotion recognition, combining signal processing, computer vision, and deep learning techniques. The method utilizes the Constant-Q Transform (CQT) to convert speech waveforms into spectrograms, providing high-frequency resolution and enabling the model to capture intricate emotional details. Additionally, the Multi-axis Vision Transformer (MaxViT) is employed for further representation learning and classification of the CQT spectrograms. MaxViT incorporates a multi-axis self-attention mechanism, facilitating both local and global interactions within the network and enhancing the ability of the model to learn meaningful features. Furthermore, the dataset is augmented using random time masking techniques to enhance the generalization capabilities. Achieving accuracies of 88.68% on the Emo-DB dataset, 77.54% on the RAVDESS dataset, and 62.49% on the IEMOCAP dataset, the proposed SCQT-MaxViT method exhibits promising performance in capturing and recognizing emotions in speech signals.

KW - constant-Q transform

KW - Emo-DB

KW - IEMOCAP

KW - multi-axis vision transformer

KW - RAVDESS

KW - spectrogram

KW - Speech

KW - speech emotion

KW - speech emotion recognition

KW - vision transformer

UR - http://www.scopus.com/inward/record.url?scp=85163460658&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2023.3288526

DO - 10.1109/ACCESS.2023.3288526

M3 - Article

AN - SCOPUS:85163460658

SN - 2169-3536

VL - 11

SP - 63081

EP - 63091

JO - IEEE Access

JF - IEEE Access

ER -

SCQT-MaxViT: Speech Emotion Recognition With Constant-Q Transform and Multi-Axis Vision Transformer

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this