Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition

Bin Liu; Shuai Nie; Shan Liang; Wenju Liu; Meng Yu; Lianwu Chen; Shouye Peng; Changliang Li

doi:10.21437/Interspeech.2019-1242

Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition

Bin Liu, Shuai Nie, Shan Liang, Wenju Liu, Meng Yu, Lianwu Chen, Shouye Peng, Changliang Li

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

33 Citations (Scopus)

Abstract

Recently, the end-to-end system has made significant breakthroughs in the field of speech recognition. However, this single end-to-end architecture is not especially robust to the input variations interfered of noises and reverberations, resulting in performance degradation dramatically in reality. To alleviate this issue, the mainstream approach is to use a well-designed speech enhancement module as the front-end of ASR. However, enhancement modules would result in speech distortions and mismatches to training, which sometimes degrades the AS-R performance. In this paper, we propose a jointly adversarial enhancement training to boost robustness of end-to-end systems. Specifically, we use a jointly compositional scheme of mask-based enhancement network, attention-based encoder-decoder network and discriminant network during training. The discriminator is used to distinguish between the enhanced features from enhancement network and clean features, which could guide enhancement network to output towards the realistic distribution. With the joint optimization of the recognition, enhancement and adversarial loss, the compositional scheme is expected to learn more robust representations for the recognition task automatically. Systematic experiments on AISHELL-1 show that the proposed method improves the noise robustness of end-to-end systems and achieves the relative error rate reduction of 4.6% over the multi-condition training.

Original language	English
Title of host publication	20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019
Pages	491-495
Number of pages	5
Volume	2019-September
DOIs	https://doi.org/10.21437/Interspeech.2019-1242
Publication status	Published - 2019
Externally published	Yes
Event	20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, Austria Duration: 15 Sept 2019 → 19 Sept 2019

Publication series

Name	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
ISSN (Print)	2308-457X

Conference

Conference	20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019
Country/Territory	Austria
City	Graz
Period	15/09/19 → 19/09/19

Keywords

End-to-end speech recognition
Generative adversarial networks
Robust speech recognition
Speech enhancement

Access to Document

10.21437/Interspeech.2019-1242

Cite this

Liu, B., Nie, S., Liang, S., Liu, W., Yu, M., Chen, L., Peng, S., & Li, C. (2019). Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition. In 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 (Vol. 2019-September, pp. 491-495). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH). https://doi.org/10.21437/Interspeech.2019-1242

Liu, Bin ; Nie, Shuai ; Liang, Shan et al. / Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition. 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019. Vol. 2019-September 2019. pp. 491-495 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH).

@inproceedings{42963de78cdd4bd3b53f75d187f4b122,

title = "Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition",

abstract = "Recently, the end-to-end system has made significant breakthroughs in the field of speech recognition. However, this single end-to-end architecture is not especially robust to the input variations interfered of noises and reverberations, resulting in performance degradation dramatically in reality. To alleviate this issue, the mainstream approach is to use a well-designed speech enhancement module as the front-end of ASR. However, enhancement modules would result in speech distortions and mismatches to training, which sometimes degrades the AS-R performance. In this paper, we propose a jointly adversarial enhancement training to boost robustness of end-to-end systems. Specifically, we use a jointly compositional scheme of mask-based enhancement network, attention-based encoder-decoder network and discriminant network during training. The discriminator is used to distinguish between the enhanced features from enhancement network and clean features, which could guide enhancement network to output towards the realistic distribution. With the joint optimization of the recognition, enhancement and adversarial loss, the compositional scheme is expected to learn more robust representations for the recognition task automatically. Systematic experiments on AISHELL-1 show that the proposed method improves the noise robustness of end-to-end systems and achieves the relative error rate reduction of 4.6% over the multi-condition training.",

keywords = "End-to-end speech recognition, Generative adversarial networks, Robust speech recognition, Speech enhancement",

author = "Bin Liu and Shuai Nie and Shan Liang and Wenju Liu and Meng Yu and Lianwu Chen and Shouye Peng and Changliang Li",

note = "Publisher Copyright: {\textcopyright} 2019 ISCA; 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 ; Conference date: 15-09-2019 Through 19-09-2019",

year = "2019",

doi = "10.21437/Interspeech.2019-1242",

language = "English",

volume = "2019-September",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

pages = "491--495",

booktitle = "20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019",

}

Liu, B, Nie, S, Liang, S, Liu, W, Yu, M, Chen, L, Peng, S & Li, C 2019, Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition. in 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019. vol. 2019-September, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 491-495, 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019, Graz, Austria, 15/09/19. https://doi.org/10.21437/Interspeech.2019-1242

Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition. / Liu, Bin; Nie, Shuai; Liang, Shan et al.
20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019. Vol. 2019-September 2019. p. 491-495 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition

AU - Liu, Bin

AU - Nie, Shuai

AU - Liang, Shan

AU - Liu, Wenju

AU - Yu, Meng

AU - Chen, Lianwu

AU - Peng, Shouye

AU - Li, Changliang

PY - 2019

Y1 - 2019

N2 - Recently, the end-to-end system has made significant breakthroughs in the field of speech recognition. However, this single end-to-end architecture is not especially robust to the input variations interfered of noises and reverberations, resulting in performance degradation dramatically in reality. To alleviate this issue, the mainstream approach is to use a well-designed speech enhancement module as the front-end of ASR. However, enhancement modules would result in speech distortions and mismatches to training, which sometimes degrades the AS-R performance. In this paper, we propose a jointly adversarial enhancement training to boost robustness of end-to-end systems. Specifically, we use a jointly compositional scheme of mask-based enhancement network, attention-based encoder-decoder network and discriminant network during training. The discriminator is used to distinguish between the enhanced features from enhancement network and clean features, which could guide enhancement network to output towards the realistic distribution. With the joint optimization of the recognition, enhancement and adversarial loss, the compositional scheme is expected to learn more robust representations for the recognition task automatically. Systematic experiments on AISHELL-1 show that the proposed method improves the noise robustness of end-to-end systems and achieves the relative error rate reduction of 4.6% over the multi-condition training.

AB - Recently, the end-to-end system has made significant breakthroughs in the field of speech recognition. However, this single end-to-end architecture is not especially robust to the input variations interfered of noises and reverberations, resulting in performance degradation dramatically in reality. To alleviate this issue, the mainstream approach is to use a well-designed speech enhancement module as the front-end of ASR. However, enhancement modules would result in speech distortions and mismatches to training, which sometimes degrades the AS-R performance. In this paper, we propose a jointly adversarial enhancement training to boost robustness of end-to-end systems. Specifically, we use a jointly compositional scheme of mask-based enhancement network, attention-based encoder-decoder network and discriminant network during training. The discriminator is used to distinguish between the enhanced features from enhancement network and clean features, which could guide enhancement network to output towards the realistic distribution. With the joint optimization of the recognition, enhancement and adversarial loss, the compositional scheme is expected to learn more robust representations for the recognition task automatically. Systematic experiments on AISHELL-1 show that the proposed method improves the noise robustness of end-to-end systems and achieves the relative error rate reduction of 4.6% over the multi-condition training.

KW - End-to-end speech recognition

KW - Generative adversarial networks

KW - Robust speech recognition

KW - Speech enhancement

UR - http://www.scopus.com/inward/record.url?scp=85074682468&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2019-1242

DO - 10.21437/Interspeech.2019-1242

M3 - Conference Proceeding

AN - SCOPUS:85074682468

VL - 2019-September

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 491

EP - 495

BT - 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019

T2 - 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019

Y2 - 15 September 2019 through 19 September 2019

ER -

Liu B, Nie S, Liang S, Liu W, Yu M, Chen L et al. Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition. In 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019. Vol. 2019-September. 2019. p. 491-495. (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH). doi: 10.21437/Interspeech.2019-1242

Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this