TY - GEN
T1 - Exploiting Large-Scale Teacher-Student Training for On-Device Acoustic Models
AU - Liu, Jing
AU - Swaminathan, Rupak Vignesh
AU - Parthasarathi, Sree Hari Krishnan
AU - Lyu, Chunchuan
AU - Mouchtaris, Athanasios
AU - Kunzmann, Siegfried
N1 - Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM) with experiments spanning over 3000 h of GPU time, making our study one of the largest of its kind. We discuss SSL for AMs in a small footprint setting, showing that a smaller capacity model trained with 1 million hours of unsupervised data can outperform a baseline supervised system by 14.3% word error rate reduction (WERR). When increasing the supervised data to seven-fold, our gains diminish to 7.1% WERR; to improve SSL efficiency at larger supervised data regimes, we employ a step-wise distillation into a smaller model, obtaining a WERR of 14.4%. We then switch to SSL using larger student models in low data regimes; while learning efficiency with unsupervised data is higher, student models may outperform teacher models in such a setting. We develop a theoretical sketch to explain this behavior.
AB - We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM) with experiments spanning over 3000 h of GPU time, making our study one of the largest of its kind. We discuss SSL for AMs in a small footprint setting, showing that a smaller capacity model trained with 1 million hours of unsupervised data can outperform a baseline supervised system by 14.3% word error rate reduction (WERR). When increasing the supervised data to seven-fold, our gains diminish to 7.1% WERR; to improve SSL efficiency at larger supervised data regimes, we employ a step-wise distillation into a smaller model, obtaining a WERR of 14.4%. We then switch to SSL using larger student models in low data regimes; while learning efficiency with unsupervised data is higher, student models may outperform teacher models in such a setting. We develop a theoretical sketch to explain this behavior.
KW - Acoustic models
KW - Edge computing
KW - Semi-supervised learning
KW - Speech recognition
KW - Student-teacher learning
UR - http://www.scopus.com/inward/record.url?scp=85115201916&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-83527-9_35
DO - 10.1007/978-3-030-83527-9_35
M3 - Conference Proceeding
AN - SCOPUS:85115201916
SN - 9783030835262
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 413
EP - 424
BT - Text, Speech, and Dialogue - 24th International Conference, TSD 2021, Proceedings
A2 - Ekštein, Kamil
A2 - Pártl, František
A2 - Konopík, Miloslav
PB - Springer Science and Business Media Deutschland GmbH
T2 - 24th International Conference on Text, Speech, and Dialogue, TSD 2021
Y2 - 6 September 2021 through 9 September 2021
ER -