StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis

Zhiyong Chen; Xinnuo Li; Zhiqi Ai; Shugong Xu

doi:10.1007/978-981-97-8795-1_18

StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis

Zhiyong Chen, Xinnuo Li^*, Zhiqi Ai, Shugong Xu^*

^*Corresponding author for this work

Shanghai University

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

Abstract

We introduce StyleFusion-TTS, a prompt and/or audio referenced, style- and speaker-controllable, zero-shot text-to-speech (TTS) synthesis system designed to enhance the editability and naturalness of current research literature. We propose a general front-end encoder as a compact and effective module to utilize multimodal inputs-including text prompts, audio references, and speaker timbre references-in a fully zero-shot manner and produce disentangled style and speaker control embeddings. Our novel approach also leverages a hierarchical conformer structure for the fusion of style and speaker control embeddings, aiming to achieve optimal feature fusion within the current advanced TTS architecture. StyleFusion-TTS is evaluated through multiple metrics, both subjectively and objectively. The system shows promising performance across our evaluations, suggesting its potential to contribute to the advancement of the field of zero-shot text-to-speech synthesis. A project website provides detailed information for demonstration and reproduction.

Original language	English
Title of host publication	Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings
Editors	Zhouchen Lin, Hongbin Zha, Ming-Ming Cheng, Ran He, Cheng-Lin Liu, Kurban Ubul, Wushouer Silamu, Jie Zhou
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	263-277
Number of pages	15
ISBN (Print)	9789819787944
DOIs	https://doi.org/10.1007/978-981-97-8795-1_18
Publication status	Published - 2025
Externally published	Yes
Event	7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024 - Urumqi, China Duration: 18 Oct 2024 → 20 Oct 2024

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	15041 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024
Country/Territory	China
City	Urumqi
Period	18/10/24 → 20/10/24

Keywords

Multimodal learning
Text-to-speech synthesis
Voice cloning
Zero-shot learning

Access to Document

10.1007/978-981-97-8795-1_18

Cite this

Chen, Z., Li, X., Ai, Z., & Xu, S. (2025). StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis. In Z. Lin, H. Zha, M.-M. Cheng, R. He, C.-L. Liu, K. Ubul, W. Silamu, & J. Zhou (Eds.), Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings (pp. 263-277). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 15041 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-97-8795-1_18

Chen, Zhiyong ; Li, Xinnuo ; Ai, Zhiqi et al. / StyleFusion TTS : Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis. Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings. editor / Zhouchen Lin ; Hongbin Zha ; Ming-Ming Cheng ; Ran He ; Cheng-Lin Liu ; Kurban Ubul ; Wushouer Silamu ; Jie Zhou. Springer Science and Business Media Deutschland GmbH, 2025. pp. 263-277 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{6162fa07734b42a785a58b691214e479,

title = "StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis",

abstract = "We introduce StyleFusion-TTS, a prompt and/or audio referenced, style- and speaker-controllable, zero-shot text-to-speech (TTS) synthesis system designed to enhance the editability and naturalness of current research literature. We propose a general front-end encoder as a compact and effective module to utilize multimodal inputs-including text prompts, audio references, and speaker timbre references-in a fully zero-shot manner and produce disentangled style and speaker control embeddings. Our novel approach also leverages a hierarchical conformer structure for the fusion of style and speaker control embeddings, aiming to achieve optimal feature fusion within the current advanced TTS architecture. StyleFusion-TTS is evaluated through multiple metrics, both subjectively and objectively. The system shows promising performance across our evaluations, suggesting its potential to contribute to the advancement of the field of zero-shot text-to-speech synthesis. A project website provides detailed information for demonstration and reproduction.",

keywords = "Multimodal learning, Text-to-speech synthesis, Voice cloning, Zero-shot learning",

author = "Zhiyong Chen and Xinnuo Li and Zhiqi Ai and Shugong Xu",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.; 7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024 ; Conference date: 18-10-2024 Through 20-10-2024",

year = "2025",

doi = "10.1007/978-981-97-8795-1_18",

language = "English",

isbn = "9789819787944",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "263--277",

editor = "Zhouchen Lin and Hongbin Zha and Ming-Ming Cheng and Ran He and Cheng-Lin Liu and Kurban Ubul and Wushouer Silamu and Jie Zhou",

booktitle = "Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings",

}

Chen, Z, Li, X, Ai, Z & Xu, S 2025, StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis. in Z Lin, H Zha, M-M Cheng, R He, C-L Liu, K Ubul, W Silamu & J Zhou (eds), Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 15041 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 263-277, 7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024, Urumqi, China, 18/10/24. https://doi.org/10.1007/978-981-97-8795-1_18

StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis. / Chen, Zhiyong; Li, Xinnuo; Ai, Zhiqi et al.
Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings. ed. / Zhouchen Lin; Hongbin Zha; Ming-Ming Cheng; Ran He; Cheng-Lin Liu; Kurban Ubul; Wushouer Silamu; Jie Zhou. Springer Science and Business Media Deutschland GmbH, 2025. p. 263-277 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 15041 LNCS).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - StyleFusion TTS

T2 - 7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024

AU - Chen, Zhiyong

AU - Li, Xinnuo

AU - Ai, Zhiqi

AU - Xu, Shugong

N1 - Publisher Copyright: © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

PY - 2025

Y1 - 2025

N2 - We introduce StyleFusion-TTS, a prompt and/or audio referenced, style- and speaker-controllable, zero-shot text-to-speech (TTS) synthesis system designed to enhance the editability and naturalness of current research literature. We propose a general front-end encoder as a compact and effective module to utilize multimodal inputs-including text prompts, audio references, and speaker timbre references-in a fully zero-shot manner and produce disentangled style and speaker control embeddings. Our novel approach also leverages a hierarchical conformer structure for the fusion of style and speaker control embeddings, aiming to achieve optimal feature fusion within the current advanced TTS architecture. StyleFusion-TTS is evaluated through multiple metrics, both subjectively and objectively. The system shows promising performance across our evaluations, suggesting its potential to contribute to the advancement of the field of zero-shot text-to-speech synthesis. A project website provides detailed information for demonstration and reproduction.

AB - We introduce StyleFusion-TTS, a prompt and/or audio referenced, style- and speaker-controllable, zero-shot text-to-speech (TTS) synthesis system designed to enhance the editability and naturalness of current research literature. We propose a general front-end encoder as a compact and effective module to utilize multimodal inputs-including text prompts, audio references, and speaker timbre references-in a fully zero-shot manner and produce disentangled style and speaker control embeddings. Our novel approach also leverages a hierarchical conformer structure for the fusion of style and speaker control embeddings, aiming to achieve optimal feature fusion within the current advanced TTS architecture. StyleFusion-TTS is evaluated through multiple metrics, both subjectively and objectively. The system shows promising performance across our evaluations, suggesting its potential to contribute to the advancement of the field of zero-shot text-to-speech synthesis. A project website provides detailed information for demonstration and reproduction.

KW - Multimodal learning

KW - Text-to-speech synthesis

KW - Voice cloning

KW - Zero-shot learning

UR - http://www.scopus.com/inward/record.url?scp=85209385779&partnerID=8YFLogxK

U2 - 10.1007/978-981-97-8795-1_18

DO - 10.1007/978-981-97-8795-1_18

M3 - Conference Proceeding

AN - SCOPUS:85209385779

SN - 9789819787944

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 263

EP - 277

BT - Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings

A2 - Lin, Zhouchen

A2 - Zha, Hongbin

A2 - Cheng, Ming-Ming

A2 - He, Ran

A2 - Liu, Cheng-Lin

A2 - Ubul, Kurban

A2 - Silamu, Wushouer

A2 - Zhou, Jie

PB - Springer Science and Business Media Deutschland GmbH

Y2 - 18 October 2024 through 20 October 2024

ER -

Chen Z, Li X, Ai Z, Xu S. StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis. In Lin Z, Zha H, Cheng MM, He R, Liu CL, Ubul K, Silamu W, Zhou J, editors, Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings. Springer Science and Business Media Deutschland GmbH. 2025. p. 263-277. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-981-97-8795-1_18