Towards Robust Speaker Recognition against Intrinsic Variation with Foundation Model Few-shot Tuning and Effective Speech Synthesis

  • Zhiyong Chen
  • , Shuhang Wu
  • , Xinnuo Li
  • , Zhiqi Ai
  • , Shugong Xu

Research output: Contribution to journalConference articlepeer-review

Abstract

Speaker recognition is essential for secure authentication and personalized voice assistants in smart home settings, but it faces challenges due to intrinsic speaker variability, such as aging and emotional fluctuations. Existing methods often rely on pretraining and require extensive data. To address these challenges, we propose a framework for time-varying and emotion-robust open-set identification (OSI) for smart home environments, utilizing few-shot foundation enrollment-time tuning and style-rich zero-shot text-to-speech (TTS) systems. We explore best practices for synthetic data selection and suitable open-set outlier-focused loss functions. Our proposed method improves handling emotional and aging variations in target speakers, enhancing robustness to intrinsic variability while maintaining resilience to unknown outliers. Experiments demonstrate strong generalization across multiple time-varying and emotionally rich benchmarks.

Original languageEnglish
Pages (from-to)1118-1122
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
Publication statusPublished - 2025
Event26th Interspeech Conference 2025 - Rotterdam, Netherlands
Duration: 17 Aug 202521 Aug 2025

Keywords

  • few-shot learning
  • robust recognition
  • speaker identification
  • speaker recognition
  • speech synthesis

Cite this