TY - JOUR
T1 - SCQT-MaxViT: Speech Emotion Recognition With Constant-Q Transform and Multi-Axis Vision Transformer
AU - Ong, Kah Liang
AU - Lee, Chin Poo
AU - Lim, Heng Siong
AU - Lim, Kian Ming
AU - Mukaida, Takeki
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2023
Y1 - 2023
N2 - Speech emotion recognition presents a significant challenge within the field of affective computing, requiring the analysis and detection of emotions conveyed through speech signals. However, existing approaches often rely on traditional signal processing techniques and handcrafted features, which may not effectively capture the nuanced aspects of emotional expression. In this paper, an approach named 'SCQT-MaxViT' is proposed for speech emotion recognition, combining signal processing, computer vision, and deep learning techniques. The method utilizes the Constant-Q Transform (CQT) to convert speech waveforms into spectrograms, providing high-frequency resolution and enabling the model to capture intricate emotional details. Additionally, the Multi-axis Vision Transformer (MaxViT) is employed for further representation learning and classification of the CQT spectrograms. MaxViT incorporates a multi-axis self-attention mechanism, facilitating both local and global interactions within the network and enhancing the ability of the model to learn meaningful features. Furthermore, the dataset is augmented using random time masking techniques to enhance the generalization capabilities. Achieving accuracies of 88.68% on the Emo-DB dataset, 77.54% on the RAVDESS dataset, and 62.49% on the IEMOCAP dataset, the proposed SCQT-MaxViT method exhibits promising performance in capturing and recognizing emotions in speech signals.
AB - Speech emotion recognition presents a significant challenge within the field of affective computing, requiring the analysis and detection of emotions conveyed through speech signals. However, existing approaches often rely on traditional signal processing techniques and handcrafted features, which may not effectively capture the nuanced aspects of emotional expression. In this paper, an approach named 'SCQT-MaxViT' is proposed for speech emotion recognition, combining signal processing, computer vision, and deep learning techniques. The method utilizes the Constant-Q Transform (CQT) to convert speech waveforms into spectrograms, providing high-frequency resolution and enabling the model to capture intricate emotional details. Additionally, the Multi-axis Vision Transformer (MaxViT) is employed for further representation learning and classification of the CQT spectrograms. MaxViT incorporates a multi-axis self-attention mechanism, facilitating both local and global interactions within the network and enhancing the ability of the model to learn meaningful features. Furthermore, the dataset is augmented using random time masking techniques to enhance the generalization capabilities. Achieving accuracies of 88.68% on the Emo-DB dataset, 77.54% on the RAVDESS dataset, and 62.49% on the IEMOCAP dataset, the proposed SCQT-MaxViT method exhibits promising performance in capturing and recognizing emotions in speech signals.
KW - constant-Q transform
KW - Emo-DB
KW - IEMOCAP
KW - multi-axis vision transformer
KW - RAVDESS
KW - spectrogram
KW - Speech
KW - speech emotion
KW - speech emotion recognition
KW - vision transformer
UR - http://www.scopus.com/inward/record.url?scp=85163460658&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2023.3288526
DO - 10.1109/ACCESS.2023.3288526
M3 - Article
AN - SCOPUS:85163460658
SN - 2169-3536
VL - 11
SP - 63081
EP - 63091
JO - IEEE Access
JF - IEEE Access
ER -