TY - GEN
T1 - Loss and Double-edge-triggered Detector for Robust Small-footprint Keyword Spotting
AU - Liu, Bin
AU - Nie, Shuai
AU - Zhang, Yaping
AU - Liang, Shan
AU - Yang, Zhanlei
AU - Liu, Wenju
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019
Y1 - 2019
N2 - Keyword spotting (KWS) system constitutes a critical component of human-computer interfaces, which detects the specific keyword from a continuous stream of audio. The goal of KWS is providing a high detection accuracy at a low false alarm rate while having small memory and computation requirements. The DNN-based KWS system faces a large class imbalance during training because the amount of data available for the keyword is usually much less than the background speech, which overwhelms training and leads to a degenerate model. In this paper, we explore the focal loss for the training of a small-footprint KWS system. It can automatically down-weight the contribution of easy samples during training and focus the model on hard samples, which naturally solves the class imbalance and allows us to efficiently utilize all data available. Furthermore, many keywords of Chinese conversational assistants are repeated words due to the idiomatic usage, such as 'XIAO DU XIAO DU'. We propose a double-edge-triggered detecting method for the repeated keyword, which significantly reduces the false alarm rate relative to the single threshold method. Systematic experiments demonstrate significant further improvements compared to the baseline system.
AB - Keyword spotting (KWS) system constitutes a critical component of human-computer interfaces, which detects the specific keyword from a continuous stream of audio. The goal of KWS is providing a high detection accuracy at a low false alarm rate while having small memory and computation requirements. The DNN-based KWS system faces a large class imbalance during training because the amount of data available for the keyword is usually much less than the background speech, which overwhelms training and leads to a degenerate model. In this paper, we explore the focal loss for the training of a small-footprint KWS system. It can automatically down-weight the contribution of easy samples during training and focus the model on hard samples, which naturally solves the class imbalance and allows us to efficiently utilize all data available. Furthermore, many keywords of Chinese conversational assistants are repeated words due to the idiomatic usage, such as 'XIAO DU XIAO DU'. We propose a double-edge-triggered detecting method for the repeated keyword, which significantly reduces the false alarm rate relative to the single threshold method. Systematic experiments demonstrate significant further improvements compared to the baseline system.
KW - double-edge-triggered detecting method
KW - focal loss
KW - keyword spotting
KW - speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85068980861&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2019.8682534
DO - 10.1109/ICASSP.2019.8682534
M3 - Conference Proceeding
AN - SCOPUS:85068980861
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 6361
EP - 6365
BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
Y2 - 12 May 2019 through 17 May 2019
ER -