Peking opera synthesis via duration informed attention network

Yusong Wu; Shengchen Li; Chengzhu Yu; Heng Lu; Chao Weng; Liqiang Zhang; Dong Yu

doi:10.21437/Interspeech.2020-1724

Peking opera synthesis via duration informed attention network

Yusong Wu, Shengchen Li, Chengzhu Yu, Heng Lu, Chao Weng, Liqiang Zhang, Dong Yu

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

5 Citations (Scopus)

Abstract

Peking Opera has been the most dominant form of Chinese performing art since around 200 years ago. A Peking Opera singer usually exhibits a very strong personal style via introducing improvisation and expressiveness on stage which leads the actual rhythm and pitch contour to deviate significantly from the original music score. This inconsistency poses a great challenge in Peking Opera singing voice synthesis from a music score. In this work, we propose to deal with this issue and synthesize expressive Peking Opera singing from the music score based on the Duration Informed Attention Network (DurIAN) framework. To tackle the rhythm mismatch, Lagrange multiplier is used to find the optimal output phoneme duration sequence with the constraint of the given note duration from music score. As for the pitch contour mismatch, instead of directly inferring from music score, we adopt a pseudo music score generated from the real singing and feed it as input during training. The experiments demonstrate that with the proposed system we can synthesize Peking Opera singing voice with high-quality timbre, pitch and expressiveness.

Original language	English
Title of host publication	Interspeech 2020
Publisher	International Speech Communication Association
Pages	1226-1230
Number of pages	5
ISBN (Print)	9781713820697
DOIs	https://doi.org/10.21437/Interspeech.2020-1724
Publication status	Published - 2020
Externally published	Yes
Event	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, China Duration: 25 Oct 2020 → 29 Oct 2020

Publication series

Name	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2020-October
ISSN (Print)	2308-457X
ISSN (Electronic)	1990-9772

Conference

Conference	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Country/Territory	China
City	Shanghai
Period	25/10/20 → 29/10/20

Keywords

Deep learning
Expressive singing synthesis
Lagrange multiplier
Machine learning
Singing synthesis

Access to Document

10.21437/Interspeech.2020-1724

Cite this

Wu, Y., Li, S., Yu, C., Lu, H., Weng, C., Zhang, L., & Yu, D. (2020). Peking opera synthesis via duration informed attention network. In Interspeech 2020 (pp. 1226-1230). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 2020-October). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2020-1724

@inproceedings{deba7053179a481590063471a0fa0ba4,

title = "Peking opera synthesis via duration informed attention network",

abstract = "Peking Opera has been the most dominant form of Chinese performing art since around 200 years ago. A Peking Opera singer usually exhibits a very strong personal style via introducing improvisation and expressiveness on stage which leads the actual rhythm and pitch contour to deviate significantly from the original music score. This inconsistency poses a great challenge in Peking Opera singing voice synthesis from a music score. In this work, we propose to deal with this issue and synthesize expressive Peking Opera singing from the music score based on the Duration Informed Attention Network (DurIAN) framework. To tackle the rhythm mismatch, Lagrange multiplier is used to find the optimal output phoneme duration sequence with the constraint of the given note duration from music score. As for the pitch contour mismatch, instead of directly inferring from music score, we adopt a pseudo music score generated from the real singing and feed it as input during training. The experiments demonstrate that with the proposed system we can synthesize Peking Opera singing voice with high-quality timbre, pitch and expressiveness.",

keywords = "Deep learning, Expressive singing synthesis, Lagrange multiplier, Machine learning, Singing synthesis",

author = "Yusong Wu and Shengchen Li and Chengzhu Yu and Heng Lu and Chao Weng and Liqiang Zhang and Dong Yu",

note = "Publisher Copyright: {\textcopyright} 2020 International Speech Communication Association. All rights reserved.; 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 ; Conference date: 25-10-2020 Through 29-10-2020",

year = "2020",

doi = "10.21437/Interspeech.2020-1724",

language = "English",

isbn = "9781713820697",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "1226--1230",

booktitle = "Interspeech 2020",

}

Wu, Y, Li, S, Yu, C, Lu, H, Weng, C, Zhang, L & Yu, D 2020, Peking opera synthesis via duration informed attention network. in Interspeech 2020. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2020-October, International Speech Communication Association, pp. 1226-1230, 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, Shanghai, China, 25/10/20. https://doi.org/10.21437/Interspeech.2020-1724

Peking opera synthesis via duration informed attention network. / Wu, Yusong; Li, Shengchen; Yu, Chengzhu et al.
Interspeech 2020. International Speech Communication Association, 2020. p. 1226-1230 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 2020-October).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Peking opera synthesis via duration informed attention network

AU - Wu, Yusong

AU - Li, Shengchen

AU - Yu, Chengzhu

AU - Lu, Heng

AU - Weng, Chao

AU - Zhang, Liqiang

AU - Yu, Dong

PY - 2020

Y1 - 2020

N2 - Peking Opera has been the most dominant form of Chinese performing art since around 200 years ago. A Peking Opera singer usually exhibits a very strong personal style via introducing improvisation and expressiveness on stage which leads the actual rhythm and pitch contour to deviate significantly from the original music score. This inconsistency poses a great challenge in Peking Opera singing voice synthesis from a music score. In this work, we propose to deal with this issue and synthesize expressive Peking Opera singing from the music score based on the Duration Informed Attention Network (DurIAN) framework. To tackle the rhythm mismatch, Lagrange multiplier is used to find the optimal output phoneme duration sequence with the constraint of the given note duration from music score. As for the pitch contour mismatch, instead of directly inferring from music score, we adopt a pseudo music score generated from the real singing and feed it as input during training. The experiments demonstrate that with the proposed system we can synthesize Peking Opera singing voice with high-quality timbre, pitch and expressiveness.

AB - Peking Opera has been the most dominant form of Chinese performing art since around 200 years ago. A Peking Opera singer usually exhibits a very strong personal style via introducing improvisation and expressiveness on stage which leads the actual rhythm and pitch contour to deviate significantly from the original music score. This inconsistency poses a great challenge in Peking Opera singing voice synthesis from a music score. In this work, we propose to deal with this issue and synthesize expressive Peking Opera singing from the music score based on the Duration Informed Attention Network (DurIAN) framework. To tackle the rhythm mismatch, Lagrange multiplier is used to find the optimal output phoneme duration sequence with the constraint of the given note duration from music score. As for the pitch contour mismatch, instead of directly inferring from music score, we adopt a pseudo music score generated from the real singing and feed it as input during training. The experiments demonstrate that with the proposed system we can synthesize Peking Opera singing voice with high-quality timbre, pitch and expressiveness.

KW - Deep learning

KW - Expressive singing synthesis

KW - Lagrange multiplier

KW - Machine learning

KW - Singing synthesis

UR - http://www.scopus.com/inward/record.url?scp=85098157014&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2020-1724

DO - 10.21437/Interspeech.2020-1724

M3 - Conference Proceeding

AN - SCOPUS:85098157014

SN - 9781713820697

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 1226

EP - 1230

BT - Interspeech 2020

PB - International Speech Communication Association

T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020

Y2 - 25 October 2020 through 29 October 2020

ER -

Peking opera synthesis via duration informed attention network

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this