TY - GEN
T1 - A comparison of attention mechanisms of convolutional neural network in weakly labeled audio tagging
AU - Hou, Yuanbo
AU - Kong, Qiuqiang
AU - Li, Shengchen
N1 - Publisher Copyright:
© Springer Nature Singapore Pte Ltd. 2019.
PY - 2019
Y1 - 2019
N2 - Audio tagging aims to predict the types of sound events occurring in audio clips. Recently, the convolutional recurrent neural network (CRNN) has achieved state-of-the-art performance in audio tagging. In CRNN, convolutional layers are applied on input audio features to extract high-level representations followed by recurrent layers. To better learn high-level representations of acoustic features, attention mechanisms were introduced to the convolutional layers of CRNN. Attention is a learning technique that could steer the model to information important to the task to obtain better performance. The two different attention mechanisms in the CRNN, the Squeeze-and-Excitation (SE) block and gated linear unit (GLU), are based on a gating mechanism, but their concerns are different. To compare the performance of the SE block and GLU, we propose to use a CRNN with a SE block (SE-CRNN) and a CRNN with a GLU (GLU-CRNN) in weakly labeled audio tagging and compare these results with the CRNN baseline. The experiments show that the GLU-CRNN achieves an area under curve score of 0.877 in polyphonic audio tagging, outperforming the SE-CRNN of 0.865 and the CRNN baseline of 0.838. The results show that the performance of attention based on GLU is better than the performance of attention based on the SE block in CRNN for weakly labeled polyphonic audio tagging.
AB - Audio tagging aims to predict the types of sound events occurring in audio clips. Recently, the convolutional recurrent neural network (CRNN) has achieved state-of-the-art performance in audio tagging. In CRNN, convolutional layers are applied on input audio features to extract high-level representations followed by recurrent layers. To better learn high-level representations of acoustic features, attention mechanisms were introduced to the convolutional layers of CRNN. Attention is a learning technique that could steer the model to information important to the task to obtain better performance. The two different attention mechanisms in the CRNN, the Squeeze-and-Excitation (SE) block and gated linear unit (GLU), are based on a gating mechanism, but their concerns are different. To compare the performance of the SE block and GLU, we propose to use a CRNN with a SE block (SE-CRNN) and a CRNN with a GLU (GLU-CRNN) in weakly labeled audio tagging and compare these results with the CRNN baseline. The experiments show that the GLU-CRNN achieves an area under curve score of 0.877 in polyphonic audio tagging, outperforming the SE-CRNN of 0.865 and the CRNN baseline of 0.838. The results show that the performance of attention based on GLU is better than the performance of attention based on the SE block in CRNN for weakly labeled polyphonic audio tagging.
KW - Audio tagging
KW - Convolutional neural network (CNN)
KW - Convolutional recurrent neural network (CRNN)
KW - Gated linear unit (GLU)
KW - Squeeze-and-Excitation (SE) block
UR - http://www.scopus.com/inward/record.url?scp=85070778521&partnerID=8YFLogxK
U2 - 10.1007/978-981-13-8707-4_8
DO - 10.1007/978-981-13-8707-4_8
M3 - Conference Proceeding
AN - SCOPUS:85070778521
SN - 9789811387067
T3 - Lecture Notes in Electrical Engineering
SP - 85
EP - 96
BT - Proceedings of the 6th Conference on Sound and Music Technology, CSMT - Revised Selected Papers, 2018
A2 - Li, Wei
A2 - Li, Shengchen
A2 - Shao, Xi
A2 - Li, Zijin
PB - Springer Verlag
T2 - 6th Conference on Sound and Music Technology, CSMT 2018
Y2 - 24 November 2018 through 26 November 2018
ER -