Skip to main navigation Skip to search Skip to main content

Disentangling The Prosody And Semantic Information With Pre-Trained Model For In-Context Learning Based Zero-Shot Voice Conversion

  • Zhengyang Chen
  • , Shuai Wang
  • , Mingyang Zhang
  • , Xuechen Liu
  • , Junichi Yamagishi
  • , Yanmin Qian*
  • *Corresponding author for this work
  • Shanghai Jiao Tong University
  • Shenzhen Research Institute of Big Data
  • The Chinese University of Hong Kong, Shenzhen
  • Research Organization of Information and Systems, National Institute of Informatics

Research output: Chapter in Book or Report/Conference proceedingConference Proceedingpeer-review

3 Citations (Scopus)

Abstract

Voice conversion (VC) aims to modify the speaker's timbre while retaining speech content. Previous approaches have tokenized the outputs from self-supervised into semantic tokens, facilitating disentanglement of speech content information. Recently, in-context learning (ICL) has emerged in text-to-speech (TTS) systems for effectively modeling specific characteristics such as timbre through context conditioning. This paper proposes an ICL capability enhanced VC system (ICL-VC) employing a mask and reconstruction training strategy based on flow-matching generative models. Augmented with semantic tokens, our experiments on the LibriTTS dataset demonstrate that ICL-VC improves speaker similarity. Additionally, we find that k-means is a versatile tokenization method applicable to various pre-trained models. However, the ICL-VC system faces challenges in preserving the prosody of the source speech. To mitigate this issue, we propose incorporating prosody embeddings extracted from a pre-trained emotion recognition model into our system. Integration of prosody embeddings notably enhances the system's capability to preserve source speech prosody, as validated on the Emotional Speech Database.

Original languageEnglish
Title of host publicationProceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages698-704
Number of pages7
ISBN (Electronic)9798350392258
DOIs
Publication statusPublished - 2024
Externally publishedYes
Event2024 IEEE Spoken Language Technology Workshop, SLT 2024 - Macao, China
Duration: 2 Dec 20245 Dec 2024

Publication series

NameProceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024

Conference

Conference2024 IEEE Spoken Language Technology Workshop, SLT 2024
Country/TerritoryChina
CityMacao
Period2/12/245/12/24

Keywords

  • Emotion Speech Database
  • in-context learning
  • LibriTTS
  • prosody preservation
  • Voice conversion

Cite this