TY - JOUR
T1 - ASTT
T2 - acoustic spatial-temporal transformer for short utterance speaker recognition
AU - Wu, Xing
AU - Li, Ruixuan
AU - Deng, Bin
AU - Zhao, Ming
AU - Du, Xingyue
AU - Wang, Jianjia
AU - Ding, Kai
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2023/9
Y1 - 2023/9
N2 - Text-independent Short Utterance Speaker Recognition (SUSR) is of importance for the purpose of person authentication. However, it is a great challenge for the speaker recognition with a short utterance, which is defined as the duration of a speech is shorter than 5 seconds. To address this problem, an Acoustic Spatial-Temporal Transformer (ASTT) method is proposed to alleviate the bottleneck of short utterance speaker recognition. The contribution of the proposed ASTT method can be expressed as two parts. On the one hand, the ASTT method has a simple and elegant structure. Without convolutional structures, the ASTT method is purely based on an attention mechanism combining temporal and spatial features of speakers with knowledge migration on the ImageNet. On the other hand, the ASTT method has good performance on text-independent short utterance speaker recognition. Extensive experiments demonstrate that the proposed ASTT method outperforms state-of-the-art methods on audio dataset with no more than 5-second speech clips with equal error rate (EER) of 6.93% and minimum detection cost function (minDCF) of 0.487, which has a relative improvement of 41.8% and 33.7%, respectively. Furthermore, the qualitative and quantitative analysis proves the effectiveness and efficiency of proposed ASTT, which can not only accelerate model converging, but also reduce the size of training data by 90%.
AB - Text-independent Short Utterance Speaker Recognition (SUSR) is of importance for the purpose of person authentication. However, it is a great challenge for the speaker recognition with a short utterance, which is defined as the duration of a speech is shorter than 5 seconds. To address this problem, an Acoustic Spatial-Temporal Transformer (ASTT) method is proposed to alleviate the bottleneck of short utterance speaker recognition. The contribution of the proposed ASTT method can be expressed as two parts. On the one hand, the ASTT method has a simple and elegant structure. Without convolutional structures, the ASTT method is purely based on an attention mechanism combining temporal and spatial features of speakers with knowledge migration on the ImageNet. On the other hand, the ASTT method has good performance on text-independent short utterance speaker recognition. Extensive experiments demonstrate that the proposed ASTT method outperforms state-of-the-art methods on audio dataset with no more than 5-second speech clips with equal error rate (EER) of 6.93% and minimum detection cost function (minDCF) of 0.487, which has a relative improvement of 41.8% and 33.7%, respectively. Furthermore, the qualitative and quantitative analysis proves the effectiveness and efficiency of proposed ASTT, which can not only accelerate model converging, but also reduce the size of training data by 90%.
KW - Data efficiency
KW - Short utterance
KW - Speaker recognition
KW - Text-independent
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85149216417&partnerID=8YFLogxK
U2 - 10.1007/s11042-023-14657-x
DO - 10.1007/s11042-023-14657-x
M3 - Article
AN - SCOPUS:85149216417
SN - 1380-7501
VL - 82
SP - 33039
EP - 33061
JO - Multimedia Tools and Applications
JF - Multimedia Tools and Applications
IS - 21
ER -