TY - GEN
T1 - Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition
AU - Liu, Bin
AU - Nie, Shuai
AU - Liang, Shan
AU - Liu, Wenju
AU - Yu, Meng
AU - Chen, Lianwu
AU - Peng, Shouye
AU - Li, Changliang
N1 - Publisher Copyright:
© 2019 ISCA
PY - 2019
Y1 - 2019
N2 - Recently, the end-to-end system has made significant breakthroughs in the field of speech recognition. However, this single end-to-end architecture is not especially robust to the input variations interfered of noises and reverberations, resulting in performance degradation dramatically in reality. To alleviate this issue, the mainstream approach is to use a well-designed speech enhancement module as the front-end of ASR. However, enhancement modules would result in speech distortions and mismatches to training, which sometimes degrades the AS-R performance. In this paper, we propose a jointly adversarial enhancement training to boost robustness of end-to-end systems. Specifically, we use a jointly compositional scheme of mask-based enhancement network, attention-based encoder-decoder network and discriminant network during training. The discriminator is used to distinguish between the enhanced features from enhancement network and clean features, which could guide enhancement network to output towards the realistic distribution. With the joint optimization of the recognition, enhancement and adversarial loss, the compositional scheme is expected to learn more robust representations for the recognition task automatically. Systematic experiments on AISHELL-1 show that the proposed method improves the noise robustness of end-to-end systems and achieves the relative error rate reduction of 4.6% over the multi-condition training.
AB - Recently, the end-to-end system has made significant breakthroughs in the field of speech recognition. However, this single end-to-end architecture is not especially robust to the input variations interfered of noises and reverberations, resulting in performance degradation dramatically in reality. To alleviate this issue, the mainstream approach is to use a well-designed speech enhancement module as the front-end of ASR. However, enhancement modules would result in speech distortions and mismatches to training, which sometimes degrades the AS-R performance. In this paper, we propose a jointly adversarial enhancement training to boost robustness of end-to-end systems. Specifically, we use a jointly compositional scheme of mask-based enhancement network, attention-based encoder-decoder network and discriminant network during training. The discriminator is used to distinguish between the enhanced features from enhancement network and clean features, which could guide enhancement network to output towards the realistic distribution. With the joint optimization of the recognition, enhancement and adversarial loss, the compositional scheme is expected to learn more robust representations for the recognition task automatically. Systematic experiments on AISHELL-1 show that the proposed method improves the noise robustness of end-to-end systems and achieves the relative error rate reduction of 4.6% over the multi-condition training.
KW - End-to-end speech recognition
KW - Generative adversarial networks
KW - Robust speech recognition
KW - Speech enhancement
UR - http://www.scopus.com/inward/record.url?scp=85074682468&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2019-1242
DO - 10.21437/Interspeech.2019-1242
M3 - Conference Proceeding
AN - SCOPUS:85074682468
VL - 2019-September
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 491
EP - 495
BT - 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019
T2 - 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019
Y2 - 15 September 2019 through 19 September 2019
ER -