ASTT: acoustic spatial-temporal transformer for short utterance speaker recognition

Xing Wu*, Ruixuan Li, Bin Deng, Ming Zhao, Xingyue Du, Jianjia Wang, Kai Ding

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)


Text-independent Short Utterance Speaker Recognition (SUSR) is of importance for the purpose of person authentication. However, it is a great challenge for the speaker recognition with a short utterance, which is defined as the duration of a speech is shorter than 5 seconds. To address this problem, an Acoustic Spatial-Temporal Transformer (ASTT) method is proposed to alleviate the bottleneck of short utterance speaker recognition. The contribution of the proposed ASTT method can be expressed as two parts. On the one hand, the ASTT method has a simple and elegant structure. Without convolutional structures, the ASTT method is purely based on an attention mechanism combining temporal and spatial features of speakers with knowledge migration on the ImageNet. On the other hand, the ASTT method has good performance on text-independent short utterance speaker recognition. Extensive experiments demonstrate that the proposed ASTT method outperforms state-of-the-art methods on audio dataset with no more than 5-second speech clips with equal error rate (EER) of 6.93% and minimum detection cost function (minDCF) of 0.487, which has a relative improvement of 41.8% and 33.7%, respectively. Furthermore, the qualitative and quantitative analysis proves the effectiveness and efficiency of proposed ASTT, which can not only accelerate model converging, but also reduce the size of training data by 90%.

Original languageEnglish
Pages (from-to)33039-33061
Number of pages23
JournalMultimedia Tools and Applications
Issue number21
Publication statusPublished - Sept 2023
Externally publishedYes


  • Data efficiency
  • Short utterance
  • Speaker recognition
  • Text-independent
  • Transformer


Dive into the research topics of 'ASTT: acoustic spatial-temporal transformer for short utterance speaker recognition'. Together they form a unique fingerprint.

Cite this