Skip to main navigation Skip to search Skip to main content

Learning Relationship Between Speaker Embeddings and Descriptions of Speaker Traits

  • Xuechen Liu*
  • , Junichi Yamagishi
  • , Xin Wang
  • , Erica Cooper
  • *Corresponding author for this work
  • Research Organization of Information and Systems, National Institute of Informatics
  • Japan National Institute of Information and Communications Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Speech perception research reveals important connections between audio signals and perceptual speaker characteristics. Addressing this intersection, this study explores the relationship between textual descriptions of perceivable speaker characteristics and speech representations by establishing a joint learning space. We thus construct a dataset for this purpose created through extensive crowd-sourced listening tests based on VoxCeleb, where participants provided detailed evaluations of diverse speaker attributes. These evaluations are transformed into structured textual descriptions, creating paired data that captures nuanced speaker characteristics. Using such data, we extract speaker and text embeddings via pre-trained corresponding encoders. Additionally, Our specialized linking networks use contrastive learning and generative transformations to align these embeddings in a unified space. We apply them for cross-modal speaker retrieval in both English and Japanese, and extend to a multilingual scenario. Experimental results highlight the value of our curated dataset of listener-perceived speaker traits.

Original languageEnglish
Pages (from-to)567-580
Number of pages14
JournalIEEE Transactions on Audio, Speech and Language Processing
Volume34
DOIs
Publication statusPublished - 2026
Externally publishedYes

Keywords

  • cross-modal retrieval
  • speaker embeddings
  • speaker retrieval
  • Speaker traits

Cite this