TY - GEN
T1 - Enhancing Open-Set Speaker Identification Through Rapid Tuning With Speaker Reciprocal Points and Negative Sample
AU - Chen, Zhiyong
AU - Ai, Zhiqi
AU - Li, Xinnuo
AU - Xu, Shugong
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - This paper introduces a novel framework for open-set speaker identification in household environments, playing a crucial role in facilitating seamless human-computer interactions. Addressing the limitations of current speaker models and classification approaches, our work integrates an pretrained WavLM frontend with a few-shot rapid tuning neural network (NN) backend for enrollment, employing task-optimized Speaker Reciprocal Points Learning (SRPL) to enhance discrimination across multiple target speakers. Furthermore, we propose an enhanced version of SRPL (SRPL+), which incorporates negative sample learning with both speech-synthesized and real negative samples to significantly improve open-set SID accuracy. Our approach is thoroughly evaluated across various multi-language textdependent speaker recognition datasets, demonstrating its effectiveness in achieving high usability for complex household multi-speaker recognition scenarios. The proposed system enhanced open-set performance by up to 27% over the directly use of efficient WavLM base+ model. For detailed information on open-sourced implementation in our project website1,.
AB - This paper introduces a novel framework for open-set speaker identification in household environments, playing a crucial role in facilitating seamless human-computer interactions. Addressing the limitations of current speaker models and classification approaches, our work integrates an pretrained WavLM frontend with a few-shot rapid tuning neural network (NN) backend for enrollment, employing task-optimized Speaker Reciprocal Points Learning (SRPL) to enhance discrimination across multiple target speakers. Furthermore, we propose an enhanced version of SRPL (SRPL+), which incorporates negative sample learning with both speech-synthesized and real negative samples to significantly improve open-set SID accuracy. Our approach is thoroughly evaluated across various multi-language textdependent speaker recognition datasets, demonstrating its effectiveness in achieving high usability for complex household multi-speaker recognition scenarios. The proposed system enhanced open-set performance by up to 27% over the directly use of efficient WavLM base+ model. For detailed information on open-sourced implementation in our project website1,.
KW - few-shot learning
KW - open-set learning
KW - Speaker identification
KW - speaker recognition
KW - speech synthesis
UR - http://www.scopus.com/inward/record.url?scp=85217432657&partnerID=8YFLogxK
U2 - 10.1109/SLT61566.2024.10832359
DO - 10.1109/SLT61566.2024.10832359
M3 - Conference Proceeding
AN - SCOPUS:85217432657
T3 - Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024
SP - 1144
EP - 1149
BT - Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE Spoken Language Technology Workshop, SLT 2024
Y2 - 2 December 2024 through 5 December 2024
ER -