TY - GEN
T1 - Disentangling The Prosody And Semantic Information With Pre-Trained Model For In-Context Learning Based Zero-Shot Voice Conversion
AU - Chen, Zhengyang
AU - Wang, Shuai
AU - Zhang, Mingyang
AU - Liu, Xuechen
AU - Yamagishi, Junichi
AU - Qian, Yanmin
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Voice conversion (VC) aims to modify the speaker's timbre while retaining speech content. Previous approaches have tokenized the outputs from self-supervised into semantic tokens, facilitating disentanglement of speech content information. Recently, in-context learning (ICL) has emerged in text-to-speech (TTS) systems for effectively modeling specific characteristics such as timbre through context conditioning. This paper proposes an ICL capability enhanced VC system (ICL-VC) employing a mask and reconstruction training strategy based on flow-matching generative models. Augmented with semantic tokens, our experiments on the LibriTTS dataset demonstrate that ICL-VC improves speaker similarity. Additionally, we find that k-means is a versatile tokenization method applicable to various pre-trained models. However, the ICL-VC system faces challenges in preserving the prosody of the source speech. To mitigate this issue, we propose incorporating prosody embeddings extracted from a pre-trained emotion recognition model into our system. Integration of prosody embeddings notably enhances the system's capability to preserve source speech prosody, as validated on the Emotional Speech Database.
AB - Voice conversion (VC) aims to modify the speaker's timbre while retaining speech content. Previous approaches have tokenized the outputs from self-supervised into semantic tokens, facilitating disentanglement of speech content information. Recently, in-context learning (ICL) has emerged in text-to-speech (TTS) systems for effectively modeling specific characteristics such as timbre through context conditioning. This paper proposes an ICL capability enhanced VC system (ICL-VC) employing a mask and reconstruction training strategy based on flow-matching generative models. Augmented with semantic tokens, our experiments on the LibriTTS dataset demonstrate that ICL-VC improves speaker similarity. Additionally, we find that k-means is a versatile tokenization method applicable to various pre-trained models. However, the ICL-VC system faces challenges in preserving the prosody of the source speech. To mitigate this issue, we propose incorporating prosody embeddings extracted from a pre-trained emotion recognition model into our system. Integration of prosody embeddings notably enhances the system's capability to preserve source speech prosody, as validated on the Emotional Speech Database.
KW - Emotion Speech Database
KW - in-context learning
KW - LibriTTS
KW - prosody preservation
KW - Voice conversion
UR - https://www.scopus.com/pages/publications/85217388739
U2 - 10.1109/SLT61566.2024.10832278
DO - 10.1109/SLT61566.2024.10832278
M3 - Conference Proceeding
AN - SCOPUS:85217388739
T3 - Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024
SP - 698
EP - 704
BT - Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE Spoken Language Technology Workshop, SLT 2024
Y2 - 2 December 2024 through 5 December 2024
ER -