StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis

Zhiyong Chen, Xinnuo Li*, Zhiqi Ai, Shugong Xu*

*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceedingConference Proceedingpeer-review

Abstract

We introduce StyleFusion-TTS, a prompt and/or audio referenced, style- and speaker-controllable, zero-shot text-to-speech (TTS) synthesis system designed to enhance the editability and naturalness of current research literature. We propose a general front-end encoder as a compact and effective module to utilize multimodal inputs-including text prompts, audio references, and speaker timbre references-in a fully zero-shot manner and produce disentangled style and speaker control embeddings. Our novel approach also leverages a hierarchical conformer structure for the fusion of style and speaker control embeddings, aiming to achieve optimal feature fusion within the current advanced TTS architecture. StyleFusion-TTS is evaluated through multiple metrics, both subjectively and objectively. The system shows promising performance across our evaluations, suggesting its potential to contribute to the advancement of the field of zero-shot text-to-speech synthesis. A project website provides detailed information for demonstration and reproduction.

Original languageEnglish
Title of host publicationPattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings
EditorsZhouchen Lin, Hongbin Zha, Ming-Ming Cheng, Ran He, Cheng-Lin Liu, Kurban Ubul, Wushouer Silamu, Jie Zhou
PublisherSpringer Science and Business Media Deutschland GmbH
Pages263-277
Number of pages15
ISBN (Print)9789819787944
DOIs
Publication statusPublished - 2025
Externally publishedYes
Event7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024 - Urumqi, China
Duration: 18 Oct 202420 Oct 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume15041 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024
Country/TerritoryChina
CityUrumqi
Period18/10/2420/10/24

Keywords

  • Multimodal learning
  • Text-to-speech synthesis
  • Voice cloning
  • Zero-shot learning

Fingerprint

Dive into the research topics of 'StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis'. Together they form a unique fingerprint.

Cite this