TY - JOUR
T1 - Towards Robust Speaker Recognition against Intrinsic Variation with Foundation Model Few-shot Tuning and Effective Speech Synthesis
AU - Chen, Zhiyong
AU - Wu, Shuhang
AU - Li, Xinnuo
AU - Ai, Zhiqi
AU - Xu, Shugong
N1 - Publisher Copyright:
© 2025 International Speech Communication Association. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Speaker recognition is essential for secure authentication and personalized voice assistants in smart home settings, but it faces challenges due to intrinsic speaker variability, such as aging and emotional fluctuations. Existing methods often rely on pretraining and require extensive data. To address these challenges, we propose a framework for time-varying and emotion-robust open-set identification (OSI) for smart home environments, utilizing few-shot foundation enrollment-time tuning and style-rich zero-shot text-to-speech (TTS) systems. We explore best practices for synthetic data selection and suitable open-set outlier-focused loss functions. Our proposed method improves handling emotional and aging variations in target speakers, enhancing robustness to intrinsic variability while maintaining resilience to unknown outliers. Experiments demonstrate strong generalization across multiple time-varying and emotionally rich benchmarks.
AB - Speaker recognition is essential for secure authentication and personalized voice assistants in smart home settings, but it faces challenges due to intrinsic speaker variability, such as aging and emotional fluctuations. Existing methods often rely on pretraining and require extensive data. To address these challenges, we propose a framework for time-varying and emotion-robust open-set identification (OSI) for smart home environments, utilizing few-shot foundation enrollment-time tuning and style-rich zero-shot text-to-speech (TTS) systems. We explore best practices for synthetic data selection and suitable open-set outlier-focused loss functions. Our proposed method improves handling emotional and aging variations in target speakers, enhancing robustness to intrinsic variability while maintaining resilience to unknown outliers. Experiments demonstrate strong generalization across multiple time-varying and emotionally rich benchmarks.
KW - few-shot learning
KW - robust recognition
KW - speaker identification
KW - speaker recognition
KW - speech synthesis
UR - https://www.scopus.com/pages/publications/105020032545
U2 - 10.21437/Interspeech.2025-42
DO - 10.21437/Interspeech.2025-42
M3 - Conference article
AN - SCOPUS:105020032545
SN - 2308-457X
SP - 1118
EP - 1122
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 26th Interspeech Conference 2025
Y2 - 17 August 2025 through 21 August 2025
ER -