Speech synthesis with face embeddings

Xing Wu; Sihui Ji; Jianjia Wang; Yike Guo

doi:10.1007/s10489-022-03227-7

Speech synthesis with face embeddings

Xing Wu^*, Sihui Ji, Jianjia Wang, Yike Guo

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

7 Citations (Scopus)

Abstract

Human beings are capable of imagining a person’s voice according to his or her appearance because different people have different voice characteristics. Although researchers have made great progress in single-view speech synthesis, there are few studies on multi-view speech synthesis, especially the speech synthesis using face images. On the basis of implicit relationship between the speaker’s face image and his or her voice, we propose a multi-view speech synthesis method called SSFE (Speech Synthesis with Face Embeddings). The proposed SSFE consists of three parts: a voice encoder, a face encoder and an improved multi-speaker text-to-speech (TTS) engine. On the one hand, the proposed voice encoder generates the voice embeddings from the speaker’s speech and the proposed face encoder extracts the voice features from the speaker’s face as f-voice embeddings. On the other hand, the multi-speaker TTS engine would synthesize the speech with voice embeddings and f-voice embeddings. We have conducted extensive experiments to evaluate the proposed SSFE on the synthesized speech quality and face-voice matching degree, in which the Mean Opinion Score of the SSFE is more than 3.7 and the matching degree is about 1.7. The experimental results prove that the proposed SSFE method outperforms state-of-the-art methods on the synthesized speech in terms of speech quality and face-voice matching degree.

Original language	English
Pages (from-to)	14839-14852
Number of pages	14
Journal	Applied Intelligence
Volume	52
Issue number	13
DOIs	https://doi.org/10.1007/s10489-022-03227-7
Publication status	Published - Oct 2022
Externally published	Yes

Keywords

Face to voice
Multi-speaker text-to-speech
Multi-view speech synthesis
Visual-audio

Access to Document

10.1007/s10489-022-03227-7

Cite this

@article{ecdbb6e8ab4840709679310bb6f65313,

title = "Speech synthesis with face embeddings",

abstract = "Human beings are capable of imagining a person{\textquoteright}s voice according to his or her appearance because different people have different voice characteristics. Although researchers have made great progress in single-view speech synthesis, there are few studies on multi-view speech synthesis, especially the speech synthesis using face images. On the basis of implicit relationship between the speaker{\textquoteright}s face image and his or her voice, we propose a multi-view speech synthesis method called SSFE (Speech Synthesis with Face Embeddings). The proposed SSFE consists of three parts: a voice encoder, a face encoder and an improved multi-speaker text-to-speech (TTS) engine. On the one hand, the proposed voice encoder generates the voice embeddings from the speaker{\textquoteright}s speech and the proposed face encoder extracts the voice features from the speaker{\textquoteright}s face as f-voice embeddings. On the other hand, the multi-speaker TTS engine would synthesize the speech with voice embeddings and f-voice embeddings. We have conducted extensive experiments to evaluate the proposed SSFE on the synthesized speech quality and face-voice matching degree, in which the Mean Opinion Score of the SSFE is more than 3.7 and the matching degree is about 1.7. The experimental results prove that the proposed SSFE method outperforms state-of-the-art methods on the synthesized speech in terms of speech quality and face-voice matching degree.",

keywords = "Face to voice, Multi-speaker text-to-speech, Multi-view speech synthesis, Visual-audio",

author = "Xing Wu and Sihui Ji and Jianjia Wang and Yike Guo",

note = "Publisher Copyright: {\textcopyright} 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.",

year = "2022",

month = oct,

doi = "10.1007/s10489-022-03227-7",

language = "English",

volume = "52",

pages = "14839--14852",

journal = "Applied Intelligence",

issn = "0924-669X",

number = "13",

}

TY - JOUR

T1 - Speech synthesis with face embeddings

AU - Wu, Xing

AU - Ji, Sihui

AU - Wang, Jianjia

AU - Guo, Yike

PY - 2022/10

Y1 - 2022/10

N2 - Human beings are capable of imagining a person’s voice according to his or her appearance because different people have different voice characteristics. Although researchers have made great progress in single-view speech synthesis, there are few studies on multi-view speech synthesis, especially the speech synthesis using face images. On the basis of implicit relationship between the speaker’s face image and his or her voice, we propose a multi-view speech synthesis method called SSFE (Speech Synthesis with Face Embeddings). The proposed SSFE consists of three parts: a voice encoder, a face encoder and an improved multi-speaker text-to-speech (TTS) engine. On the one hand, the proposed voice encoder generates the voice embeddings from the speaker’s speech and the proposed face encoder extracts the voice features from the speaker’s face as f-voice embeddings. On the other hand, the multi-speaker TTS engine would synthesize the speech with voice embeddings and f-voice embeddings. We have conducted extensive experiments to evaluate the proposed SSFE on the synthesized speech quality and face-voice matching degree, in which the Mean Opinion Score of the SSFE is more than 3.7 and the matching degree is about 1.7. The experimental results prove that the proposed SSFE method outperforms state-of-the-art methods on the synthesized speech in terms of speech quality and face-voice matching degree.

AB - Human beings are capable of imagining a person’s voice according to his or her appearance because different people have different voice characteristics. Although researchers have made great progress in single-view speech synthesis, there are few studies on multi-view speech synthesis, especially the speech synthesis using face images. On the basis of implicit relationship between the speaker’s face image and his or her voice, we propose a multi-view speech synthesis method called SSFE (Speech Synthesis with Face Embeddings). The proposed SSFE consists of three parts: a voice encoder, a face encoder and an improved multi-speaker text-to-speech (TTS) engine. On the one hand, the proposed voice encoder generates the voice embeddings from the speaker’s speech and the proposed face encoder extracts the voice features from the speaker’s face as f-voice embeddings. On the other hand, the multi-speaker TTS engine would synthesize the speech with voice embeddings and f-voice embeddings. We have conducted extensive experiments to evaluate the proposed SSFE on the synthesized speech quality and face-voice matching degree, in which the Mean Opinion Score of the SSFE is more than 3.7 and the matching degree is about 1.7. The experimental results prove that the proposed SSFE method outperforms state-of-the-art methods on the synthesized speech in terms of speech quality and face-voice matching degree.

KW - Face to voice

KW - Multi-speaker text-to-speech

KW - Multi-view speech synthesis

KW - Visual-audio

UR - http://www.scopus.com/inward/record.url?scp=85126548867&partnerID=8YFLogxK

U2 - 10.1007/s10489-022-03227-7

DO - 10.1007/s10489-022-03227-7

M3 - Article

AN - SCOPUS:85126548867

SN - 0924-669X

VL - 52

SP - 14839

EP - 14852

JO - Applied Intelligence

JF - Applied Intelligence

IS - 13

ER -

Speech synthesis with face embeddings

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this