Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

Zhiyong Chen; Zhiqi Ai; Youxuan Ma; Xinnuo Li; Shugong Xu

doi:10.1186/s13636-024-00351-9

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

Zhiyong Chen, Zhiqi Ai, Youxuan Ma, Xinnuo Li, Shugong Xu^*

^*Corresponding author for this work

Shanghai University

Research output: Contribution to journal › Article › peer-review

3 Citations (Scopus)

Abstract

In the era of advanced text-to-speech (TTS) systems capable of generating high-fidelity, human-like speech by referring a reference speech, voice cloning (VC), or zero-shot TTS (ZS-TTS), stands out as an important subtask. A primary challenge in VC is maintaining speech quality and speaker similarity with limited reference data for a specific speaker. However, existing VC systems often rely on naive combinations of embedded speaker vectors for speaker control, which compromises the capture of speaking style, voice print, and semantic accuracy. To overcome this, we introduce the Two-branch Speaker Control Module (TSCM), a novel and highly adaptable voice cloning module designed to precisely processing speaker or style control for a target speaker. Our method uses an advanced fusion of local-level features from a Gated Convolutional Network (GCN) and utterance-level features from a gated recurrent unit (GRU) to enhance speaker control. We demonstrate the effectiveness of TSCM by integrating it into advanced TTS systems like FastSpeech 2 and VITS architectures, significantly optimizing their performance. Experimental results show that TSCM enables accurate voice cloning for a target speaker with minimal data through both zero-shot or few-shot fine-tuning of pretrained TTS models. Furthermore, our TSCM-based VITS (TSCM-VITS) showcases superior performance in zero-shot scenarios compared to existing state-of-the-art VC systems, even with basic dataset configurations. Our method’s superiority is validated through comprehensive subjective and objective evaluations. A demonstration of our system is available at https://great-research.github.io/tsct-tts-demo/, providing practical insights into its application and effectiveness.

Original language	English
Article number	28
Journal	Eurasip Journal on Audio, Speech, and Music Processing
Volume	2024
Issue number	1
DOIs	https://doi.org/10.1186/s13636-024-00351-9
Publication status	Published - Dec 2024
Externally published	Yes

Keywords

Few-shot learning
Speech synthesis
Text to speech
Voice cloning
Zero-shot learning

Access to Document

10.1186/s13636-024-00351-9

Cite this

@article{c0de9196f2a14ca6bbbbb44bc833bd0e,

title = "Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis",

abstract = "In the era of advanced text-to-speech (TTS) systems capable of generating high-fidelity, human-like speech by referring a reference speech, voice cloning (VC), or zero-shot TTS (ZS-TTS), stands out as an important subtask. A primary challenge in VC is maintaining speech quality and speaker similarity with limited reference data for a specific speaker. However, existing VC systems often rely on naive combinations of embedded speaker vectors for speaker control, which compromises the capture of speaking style, voice print, and semantic accuracy. To overcome this, we introduce the Two-branch Speaker Control Module (TSCM), a novel and highly adaptable voice cloning module designed to precisely processing speaker or style control for a target speaker. Our method uses an advanced fusion of local-level features from a Gated Convolutional Network (GCN) and utterance-level features from a gated recurrent unit (GRU) to enhance speaker control. We demonstrate the effectiveness of TSCM by integrating it into advanced TTS systems like FastSpeech 2 and VITS architectures, significantly optimizing their performance. Experimental results show that TSCM enables accurate voice cloning for a target speaker with minimal data through both zero-shot or few-shot fine-tuning of pretrained TTS models. Furthermore, our TSCM-based VITS (TSCM-VITS) showcases superior performance in zero-shot scenarios compared to existing state-of-the-art VC systems, even with basic dataset configurations. Our method{\textquoteright}s superiority is validated through comprehensive subjective and objective evaluations. A demonstration of our system is available at https://great-research.github.io/tsct-tts-demo/, providing practical insights into its application and effectiveness.",

keywords = "Few-shot learning, Speech synthesis, Text to speech, Voice cloning, Zero-shot learning",

author = "Zhiyong Chen and Zhiqi Ai and Youxuan Ma and Xinnuo Li and Shugong Xu",

note = "Publisher Copyright: {\textcopyright} The Author(s) 2024.",

year = "2024",

month = dec,

doi = "10.1186/s13636-024-00351-9",

language = "English",

volume = "2024",

journal = "Eurasip Journal on Audio, Speech, and Music Processing",

issn = "1687-4714",

number = "1",

}

TY - JOUR

T1 - Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

AU - Chen, Zhiyong

AU - Ai, Zhiqi

AU - Ma, Youxuan

AU - Li, Xinnuo

AU - Xu, Shugong

N1 - Publisher Copyright: © The Author(s) 2024.

PY - 2024/12

Y1 - 2024/12

N2 - In the era of advanced text-to-speech (TTS) systems capable of generating high-fidelity, human-like speech by referring a reference speech, voice cloning (VC), or zero-shot TTS (ZS-TTS), stands out as an important subtask. A primary challenge in VC is maintaining speech quality and speaker similarity with limited reference data for a specific speaker. However, existing VC systems often rely on naive combinations of embedded speaker vectors for speaker control, which compromises the capture of speaking style, voice print, and semantic accuracy. To overcome this, we introduce the Two-branch Speaker Control Module (TSCM), a novel and highly adaptable voice cloning module designed to precisely processing speaker or style control for a target speaker. Our method uses an advanced fusion of local-level features from a Gated Convolutional Network (GCN) and utterance-level features from a gated recurrent unit (GRU) to enhance speaker control. We demonstrate the effectiveness of TSCM by integrating it into advanced TTS systems like FastSpeech 2 and VITS architectures, significantly optimizing their performance. Experimental results show that TSCM enables accurate voice cloning for a target speaker with minimal data through both zero-shot or few-shot fine-tuning of pretrained TTS models. Furthermore, our TSCM-based VITS (TSCM-VITS) showcases superior performance in zero-shot scenarios compared to existing state-of-the-art VC systems, even with basic dataset configurations. Our method’s superiority is validated through comprehensive subjective and objective evaluations. A demonstration of our system is available at https://great-research.github.io/tsct-tts-demo/, providing practical insights into its application and effectiveness.

AB - In the era of advanced text-to-speech (TTS) systems capable of generating high-fidelity, human-like speech by referring a reference speech, voice cloning (VC), or zero-shot TTS (ZS-TTS), stands out as an important subtask. A primary challenge in VC is maintaining speech quality and speaker similarity with limited reference data for a specific speaker. However, existing VC systems often rely on naive combinations of embedded speaker vectors for speaker control, which compromises the capture of speaking style, voice print, and semantic accuracy. To overcome this, we introduce the Two-branch Speaker Control Module (TSCM), a novel and highly adaptable voice cloning module designed to precisely processing speaker or style control for a target speaker. Our method uses an advanced fusion of local-level features from a Gated Convolutional Network (GCN) and utterance-level features from a gated recurrent unit (GRU) to enhance speaker control. We demonstrate the effectiveness of TSCM by integrating it into advanced TTS systems like FastSpeech 2 and VITS architectures, significantly optimizing their performance. Experimental results show that TSCM enables accurate voice cloning for a target speaker with minimal data through both zero-shot or few-shot fine-tuning of pretrained TTS models. Furthermore, our TSCM-based VITS (TSCM-VITS) showcases superior performance in zero-shot scenarios compared to existing state-of-the-art VC systems, even with basic dataset configurations. Our method’s superiority is validated through comprehensive subjective and objective evaluations. A demonstration of our system is available at https://great-research.github.io/tsct-tts-demo/, providing practical insights into its application and effectiveness.

KW - Few-shot learning

KW - Speech synthesis

KW - Text to speech

KW - Voice cloning

KW - Zero-shot learning

UR - http://www.scopus.com/inward/record.url?scp=85194826879&partnerID=8YFLogxK

U2 - 10.1186/s13636-024-00351-9

DO - 10.1186/s13636-024-00351-9

M3 - Article

AN - SCOPUS:85194826879

SN - 1687-4714

VL - 2024

JO - Eurasip Journal on Audio, Speech, and Music Processing

JF - Eurasip Journal on Audio, Speech, and Music Processing

IS - 1

M1 - 28

ER -

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this