MKD: Mixup-Based Knowledge Distillation for Mandarin End-to-End Speech Recognition

Xing Wu*, Yifan Jin, Jianjia Wang, Quan Qian, Yike Guo

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)


Large-scale automatic speech recognition model has achieved impressive performance. However, huge computational resources and massive amount of data are required to train an ASR model. Knowledge distillation is a prevalent model compression method which transfers the knowledge from large model to small model. To improve the efficiency of knowledge distillation for end-to-end speech recognition especially in the low-resource setting, a Mixup-based Knowledge Distillation (MKD) method is proposed which combines Mixup, a data-agnostic data augmentation method, with softmax-level knowledge distillation. A loss-level mixture is presented to address the problem caused by the non-linearity of label in the KL-divergence when adopting Mixup to the teacher–student framework. It is mathematically shown that optimizing the mixture of loss function is equivalent to optimize an upper bound of the original knowledge distillation loss. The proposed MKD takes the advantage of Mixup and brings robustness to the model even with a small amount of training data. The experiments on Aishell-1 show that MKD obtains a 15.6% and 3.3% relative improvement on two student models with different parameter scales compared with the existing methods. Experiments on data efficiency demonstrate MKD achieves similar results with only half of the original dataset.

Original languageEnglish
Article number160
Issue number5
Publication statusPublished - May 2022
Externally publishedYes


  • data efficiency
  • end-to-end speech recognition
  • knowledge distillation
  • mixup
  • model compression


Dive into the research topics of 'MKD: Mixup-Based Knowledge Distillation for Mandarin End-to-End Speech Recognition'. Together they form a unique fingerprint.

Cite this