ASTT: acoustic spatial-temporal transformer for short utterance speaker recognition

Xing Wu; Ruixuan Li; Bin Deng; Ming Zhao; Xingyue Du; Jianjia Wang; Kai Ding

doi:10.1007/s11042-023-14657-x

ASTT: acoustic spatial-temporal transformer for short utterance speaker recognition

Xing Wu^*, Ruixuan Li, Bin Deng, Ming Zhao, Xingyue Du, Jianjia Wang, Kai Ding

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

2 Citations (Scopus)

Abstract

Text-independent Short Utterance Speaker Recognition (SUSR) is of importance for the purpose of person authentication. However, it is a great challenge for the speaker recognition with a short utterance, which is defined as the duration of a speech is shorter than 5 seconds. To address this problem, an Acoustic Spatial-Temporal Transformer (ASTT) method is proposed to alleviate the bottleneck of short utterance speaker recognition. The contribution of the proposed ASTT method can be expressed as two parts. On the one hand, the ASTT method has a simple and elegant structure. Without convolutional structures, the ASTT method is purely based on an attention mechanism combining temporal and spatial features of speakers with knowledge migration on the ImageNet. On the other hand, the ASTT method has good performance on text-independent short utterance speaker recognition. Extensive experiments demonstrate that the proposed ASTT method outperforms state-of-the-art methods on audio dataset with no more than 5-second speech clips with equal error rate (EER) of 6.93% and minimum detection cost function (minDCF) of 0.487, which has a relative improvement of 41.8% and 33.7%, respectively. Furthermore, the qualitative and quantitative analysis proves the effectiveness and efficiency of proposed ASTT, which can not only accelerate model converging, but also reduce the size of training data by 90%.

Original language	English
Pages (from-to)	33039-33061
Number of pages	23
Journal	Multimedia Tools and Applications
Volume	82
Issue number	21
DOIs	https://doi.org/10.1007/s11042-023-14657-x
Publication status	Published - Sept 2023
Externally published	Yes

Keywords

Data efficiency
Short utterance
Speaker recognition
Text-independent
Transformer

Access to Document

10.1007/s11042-023-14657-x

Cite this

@article{7a6d4c23c0064c9b8b215fd39d64fe91,

title = "ASTT: acoustic spatial-temporal transformer for short utterance speaker recognition",

abstract = "Text-independent Short Utterance Speaker Recognition (SUSR) is of importance for the purpose of person authentication. However, it is a great challenge for the speaker recognition with a short utterance, which is defined as the duration of a speech is shorter than 5 seconds. To address this problem, an Acoustic Spatial-Temporal Transformer (ASTT) method is proposed to alleviate the bottleneck of short utterance speaker recognition. The contribution of the proposed ASTT method can be expressed as two parts. On the one hand, the ASTT method has a simple and elegant structure. Without convolutional structures, the ASTT method is purely based on an attention mechanism combining temporal and spatial features of speakers with knowledge migration on the ImageNet. On the other hand, the ASTT method has good performance on text-independent short utterance speaker recognition. Extensive experiments demonstrate that the proposed ASTT method outperforms state-of-the-art methods on audio dataset with no more than 5-second speech clips with equal error rate (EER) of 6.93% and minimum detection cost function (minDCF) of 0.487, which has a relative improvement of 41.8% and 33.7%, respectively. Furthermore, the qualitative and quantitative analysis proves the effectiveness and efficiency of proposed ASTT, which can not only accelerate model converging, but also reduce the size of training data by 90%.",

keywords = "Data efficiency, Short utterance, Speaker recognition, Text-independent, Transformer",

author = "Xing Wu and Ruixuan Li and Bin Deng and Ming Zhao and Xingyue Du and Jianjia Wang and Kai Ding",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.",

year = "2023",

month = sep,

doi = "10.1007/s11042-023-14657-x",

language = "English",

volume = "82",

pages = "33039--33061",

journal = "Multimedia Tools and Applications",

issn = "1380-7501",

publisher = "Springer",

number = "21",

}

TY - JOUR

T1 - ASTT

T2 - acoustic spatial-temporal transformer for short utterance speaker recognition

AU - Wu, Xing

AU - Li, Ruixuan

AU - Deng, Bin

AU - Zhao, Ming

AU - Du, Xingyue

AU - Wang, Jianjia

AU - Ding, Kai

PY - 2023/9

Y1 - 2023/9

N2 - Text-independent Short Utterance Speaker Recognition (SUSR) is of importance for the purpose of person authentication. However, it is a great challenge for the speaker recognition with a short utterance, which is defined as the duration of a speech is shorter than 5 seconds. To address this problem, an Acoustic Spatial-Temporal Transformer (ASTT) method is proposed to alleviate the bottleneck of short utterance speaker recognition. The contribution of the proposed ASTT method can be expressed as two parts. On the one hand, the ASTT method has a simple and elegant structure. Without convolutional structures, the ASTT method is purely based on an attention mechanism combining temporal and spatial features of speakers with knowledge migration on the ImageNet. On the other hand, the ASTT method has good performance on text-independent short utterance speaker recognition. Extensive experiments demonstrate that the proposed ASTT method outperforms state-of-the-art methods on audio dataset with no more than 5-second speech clips with equal error rate (EER) of 6.93% and minimum detection cost function (minDCF) of 0.487, which has a relative improvement of 41.8% and 33.7%, respectively. Furthermore, the qualitative and quantitative analysis proves the effectiveness and efficiency of proposed ASTT, which can not only accelerate model converging, but also reduce the size of training data by 90%.

AB - Text-independent Short Utterance Speaker Recognition (SUSR) is of importance for the purpose of person authentication. However, it is a great challenge for the speaker recognition with a short utterance, which is defined as the duration of a speech is shorter than 5 seconds. To address this problem, an Acoustic Spatial-Temporal Transformer (ASTT) method is proposed to alleviate the bottleneck of short utterance speaker recognition. The contribution of the proposed ASTT method can be expressed as two parts. On the one hand, the ASTT method has a simple and elegant structure. Without convolutional structures, the ASTT method is purely based on an attention mechanism combining temporal and spatial features of speakers with knowledge migration on the ImageNet. On the other hand, the ASTT method has good performance on text-independent short utterance speaker recognition. Extensive experiments demonstrate that the proposed ASTT method outperforms state-of-the-art methods on audio dataset with no more than 5-second speech clips with equal error rate (EER) of 6.93% and minimum detection cost function (minDCF) of 0.487, which has a relative improvement of 41.8% and 33.7%, respectively. Furthermore, the qualitative and quantitative analysis proves the effectiveness and efficiency of proposed ASTT, which can not only accelerate model converging, but also reduce the size of training data by 90%.

KW - Data efficiency

KW - Short utterance

KW - Speaker recognition

KW - Text-independent

KW - Transformer

UR - http://www.scopus.com/inward/record.url?scp=85149216417&partnerID=8YFLogxK

U2 - 10.1007/s11042-023-14657-x

DO - 10.1007/s11042-023-14657-x

M3 - Article

AN - SCOPUS:85149216417

SN - 1380-7501

VL - 82

SP - 33039

EP - 33061

JO - Multimedia Tools and Applications

JF - Multimedia Tools and Applications

IS - 21

ER -

ASTT: acoustic spatial-temporal transformer for short utterance speaker recognition

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this