MKD: Mixup-Based Knowledge Distillation for Mandarin End-to-End Speech Recognition

Xing Wu; Yifan Jin; Jianjia Wang; Quan Qian; Yike Guo

doi:10.3390/a15050160

MKD: Mixup-Based Knowledge Distillation for Mandarin End-to-End Speech Recognition

Xing Wu^*, Yifan Jin, Jianjia Wang, Quan Qian, Yike Guo

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

3 Citations (Scopus)

Abstract

Large-scale automatic speech recognition model has achieved impressive performance. However, huge computational resources and massive amount of data are required to train an ASR model. Knowledge distillation is a prevalent model compression method which transfers the knowledge from large model to small model. To improve the efficiency of knowledge distillation for end-to-end speech recognition especially in the low-resource setting, a Mixup-based Knowledge Distillation (MKD) method is proposed which combines Mixup, a data-agnostic data augmentation method, with softmax-level knowledge distillation. A loss-level mixture is presented to address the problem caused by the non-linearity of label in the KL-divergence when adopting Mixup to the teacher–student framework. It is mathematically shown that optimizing the mixture of loss function is equivalent to optimize an upper bound of the original knowledge distillation loss. The proposed MKD takes the advantage of Mixup and brings robustness to the model even with a small amount of training data. The experiments on Aishell-1 show that MKD obtains a 15.6% and 3.3% relative improvement on two student models with different parameter scales compared with the existing methods. Experiments on data efficiency demonstrate MKD achieves similar results with only half of the original dataset.

Original language	English
Article number	160
Journal	Algorithms
Volume	15
Issue number	5
DOIs	https://doi.org/10.3390/a15050160
Publication status	Published - May 2022
Externally published	Yes

Keywords

data efficiency
end-to-end speech recognition
knowledge distillation
mixup
model compression

Access to Document

10.3390/a15050160

Cite this

@article{51b2e08feef54dadbef29a643075ec73,

title = "MKD: Mixup-Based Knowledge Distillation for Mandarin End-to-End Speech Recognition",

abstract = "Large-scale automatic speech recognition model has achieved impressive performance. However, huge computational resources and massive amount of data are required to train an ASR model. Knowledge distillation is a prevalent model compression method which transfers the knowledge from large model to small model. To improve the efficiency of knowledge distillation for end-to-end speech recognition especially in the low-resource setting, a Mixup-based Knowledge Distillation (MKD) method is proposed which combines Mixup, a data-agnostic data augmentation method, with softmax-level knowledge distillation. A loss-level mixture is presented to address the problem caused by the non-linearity of label in the KL-divergence when adopting Mixup to the teacher–student framework. It is mathematically shown that optimizing the mixture of loss function is equivalent to optimize an upper bound of the original knowledge distillation loss. The proposed MKD takes the advantage of Mixup and brings robustness to the model even with a small amount of training data. The experiments on Aishell-1 show that MKD obtains a 15.6% and 3.3% relative improvement on two student models with different parameter scales compared with the existing methods. Experiments on data efficiency demonstrate MKD achieves similar results with only half of the original dataset.",

keywords = "data efficiency, end-to-end speech recognition, knowledge distillation, mixup, model compression",

author = "Xing Wu and Yifan Jin and Jianjia Wang and Quan Qian and Yike Guo",

note = "Publisher Copyright: {\textcopyright} 2022 by the authors. Licensee MDPI, Basel, Switzerland.",

year = "2022",

month = may,

doi = "10.3390/a15050160",

language = "English",

volume = "15",

journal = "Algorithms",

issn = "1999-4893",

number = "5",

}

TY - JOUR

T1 - MKD

T2 - Mixup-Based Knowledge Distillation for Mandarin End-to-End Speech Recognition

AU - Wu, Xing

AU - Jin, Yifan

AU - Wang, Jianjia

AU - Qian, Quan

AU - Guo, Yike

PY - 2022/5

Y1 - 2022/5

N2 - Large-scale automatic speech recognition model has achieved impressive performance. However, huge computational resources and massive amount of data are required to train an ASR model. Knowledge distillation is a prevalent model compression method which transfers the knowledge from large model to small model. To improve the efficiency of knowledge distillation for end-to-end speech recognition especially in the low-resource setting, a Mixup-based Knowledge Distillation (MKD) method is proposed which combines Mixup, a data-agnostic data augmentation method, with softmax-level knowledge distillation. A loss-level mixture is presented to address the problem caused by the non-linearity of label in the KL-divergence when adopting Mixup to the teacher–student framework. It is mathematically shown that optimizing the mixture of loss function is equivalent to optimize an upper bound of the original knowledge distillation loss. The proposed MKD takes the advantage of Mixup and brings robustness to the model even with a small amount of training data. The experiments on Aishell-1 show that MKD obtains a 15.6% and 3.3% relative improvement on two student models with different parameter scales compared with the existing methods. Experiments on data efficiency demonstrate MKD achieves similar results with only half of the original dataset.

AB - Large-scale automatic speech recognition model has achieved impressive performance. However, huge computational resources and massive amount of data are required to train an ASR model. Knowledge distillation is a prevalent model compression method which transfers the knowledge from large model to small model. To improve the efficiency of knowledge distillation for end-to-end speech recognition especially in the low-resource setting, a Mixup-based Knowledge Distillation (MKD) method is proposed which combines Mixup, a data-agnostic data augmentation method, with softmax-level knowledge distillation. A loss-level mixture is presented to address the problem caused by the non-linearity of label in the KL-divergence when adopting Mixup to the teacher–student framework. It is mathematically shown that optimizing the mixture of loss function is equivalent to optimize an upper bound of the original knowledge distillation loss. The proposed MKD takes the advantage of Mixup and brings robustness to the model even with a small amount of training data. The experiments on Aishell-1 show that MKD obtains a 15.6% and 3.3% relative improvement on two student models with different parameter scales compared with the existing methods. Experiments on data efficiency demonstrate MKD achieves similar results with only half of the original dataset.

KW - data efficiency

KW - end-to-end speech recognition

KW - knowledge distillation

KW - mixup

KW - model compression

UR - http://www.scopus.com/inward/record.url?scp=85130398704&partnerID=8YFLogxK

U2 - 10.3390/a15050160

DO - 10.3390/a15050160

M3 - Article

AN - SCOPUS:85130398704

SN - 1999-4893

VL - 15

JO - Algorithms

JF - Algorithms

IS - 5

M1 - 160

ER -

MKD: Mixup-Based Knowledge Distillation for Mandarin End-to-End Speech Recognition

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this