A comparison of attention mechanisms of convolutional neural network in weakly labeled audio tagging

Yuanbo Hou; Qiuqiang Kong; Shengchen Li

doi:10.1007/978-981-13-8707-4_8

A comparison of attention mechanisms of convolutional neural network in weakly labeled audio tagging

Yuanbo Hou^*, Qiuqiang Kong, Shengchen Li

^*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

1 Citation (Scopus)

Abstract

Audio tagging aims to predict the types of sound events occurring in audio clips. Recently, the convolutional recurrent neural network (CRNN) has achieved state-of-the-art performance in audio tagging. In CRNN, convolutional layers are applied on input audio features to extract high-level representations followed by recurrent layers. To better learn high-level representations of acoustic features, attention mechanisms were introduced to the convolutional layers of CRNN. Attention is a learning technique that could steer the model to information important to the task to obtain better performance. The two different attention mechanisms in the CRNN, the Squeeze-and-Excitation (SE) block and gated linear unit (GLU), are based on a gating mechanism, but their concerns are different. To compare the performance of the SE block and GLU, we propose to use a CRNN with a SE block (SE-CRNN) and a CRNN with a GLU (GLU-CRNN) in weakly labeled audio tagging and compare these results with the CRNN baseline. The experiments show that the GLU-CRNN achieves an area under curve score of 0.877 in polyphonic audio tagging, outperforming the SE-CRNN of 0.865 and the CRNN baseline of 0.838. The results show that the performance of attention based on GLU is better than the performance of attention based on the SE block in CRNN for weakly labeled polyphonic audio tagging.

Original language	English
Title of host publication	Proceedings of the 6th Conference on Sound and Music Technology, CSMT - Revised Selected Papers, 2018
Editors	Wei Li, Shengchen Li, Xi Shao, Zijin Li
Publisher	Springer Verlag
Pages	85-96
Number of pages	12
ISBN (Print)	9789811387067
DOIs	https://doi.org/10.1007/978-981-13-8707-4_8
Publication status	Published - 2019
Externally published	Yes
Event	6th Conference on Sound and Music Technology, CSMT 2018 - Xiamen, China Duration: 24 Nov 2018 → 26 Nov 2018

Publication series

Name	Lecture Notes in Electrical Engineering
Volume	568
ISSN (Print)	1876-1100
ISSN (Electronic)	1876-1119

Conference

Conference	6th Conference on Sound and Music Technology, CSMT 2018
Country/Territory	China
City	Xiamen
Period	24/11/18 → 26/11/18

Keywords

Audio tagging
Convolutional neural network (CNN)
Convolutional recurrent neural network (CRNN)
Gated linear unit (GLU)
Squeeze-and-Excitation (SE) block

Access to Document

10.1007/978-981-13-8707-4_8

Cite this

Hou, Y., Kong, Q., & Li, S. (2019). A comparison of attention mechanisms of convolutional neural network in weakly labeled audio tagging. In W. Li, S. Li, X. Shao, & Z. Li (Eds.), Proceedings of the 6th Conference on Sound and Music Technology, CSMT - Revised Selected Papers, 2018 (pp. 85-96). (Lecture Notes in Electrical Engineering; Vol. 568). Springer Verlag. https://doi.org/10.1007/978-981-13-8707-4_8

Hou, Yuanbo ; Kong, Qiuqiang ; Li, Shengchen. / A comparison of attention mechanisms of convolutional neural network in weakly labeled audio tagging. Proceedings of the 6th Conference on Sound and Music Technology, CSMT - Revised Selected Papers, 2018. editor / Wei Li ; Shengchen Li ; Xi Shao ; Zijin Li. Springer Verlag, 2019. pp. 85-96 (Lecture Notes in Electrical Engineering).

@inproceedings{ca2f4e057dc8473386cd33aefdc0e01d,

title = "A comparison of attention mechanisms of convolutional neural network in weakly labeled audio tagging",

abstract = "Audio tagging aims to predict the types of sound events occurring in audio clips. Recently, the convolutional recurrent neural network (CRNN) has achieved state-of-the-art performance in audio tagging. In CRNN, convolutional layers are applied on input audio features to extract high-level representations followed by recurrent layers. To better learn high-level representations of acoustic features, attention mechanisms were introduced to the convolutional layers of CRNN. Attention is a learning technique that could steer the model to information important to the task to obtain better performance. The two different attention mechanisms in the CRNN, the Squeeze-and-Excitation (SE) block and gated linear unit (GLU), are based on a gating mechanism, but their concerns are different. To compare the performance of the SE block and GLU, we propose to use a CRNN with a SE block (SE-CRNN) and a CRNN with a GLU (GLU-CRNN) in weakly labeled audio tagging and compare these results with the CRNN baseline. The experiments show that the GLU-CRNN achieves an area under curve score of 0.877 in polyphonic audio tagging, outperforming the SE-CRNN of 0.865 and the CRNN baseline of 0.838. The results show that the performance of attention based on GLU is better than the performance of attention based on the SE block in CRNN for weakly labeled polyphonic audio tagging.",

keywords = "Audio tagging, Convolutional neural network (CNN), Convolutional recurrent neural network (CRNN), Gated linear unit (GLU), Squeeze-and-Excitation (SE) block",

author = "Yuanbo Hou and Qiuqiang Kong and Shengchen Li",

note = "Publisher Copyright: {\textcopyright} Springer Nature Singapore Pte Ltd. 2019.; 6th Conference on Sound and Music Technology, CSMT 2018 ; Conference date: 24-11-2018 Through 26-11-2018",

year = "2019",

doi = "10.1007/978-981-13-8707-4_8",

language = "English",

isbn = "9789811387067",

series = "Lecture Notes in Electrical Engineering",

publisher = "Springer Verlag",

pages = "85--96",

editor = "Wei Li and Shengchen Li and Xi Shao and Zijin Li",

booktitle = "Proceedings of the 6th Conference on Sound and Music Technology, CSMT - Revised Selected Papers, 2018",

}

Hou, Y, Kong, Q & Li, S 2019, A comparison of attention mechanisms of convolutional neural network in weakly labeled audio tagging. in W Li, S Li, X Shao & Z Li (eds), Proceedings of the 6th Conference on Sound and Music Technology, CSMT - Revised Selected Papers, 2018. Lecture Notes in Electrical Engineering, vol. 568, Springer Verlag, pp. 85-96, 6th Conference on Sound and Music Technology, CSMT 2018, Xiamen, China, 24/11/18. https://doi.org/10.1007/978-981-13-8707-4_8

A comparison of attention mechanisms of convolutional neural network in weakly labeled audio tagging. / Hou, Yuanbo; Kong, Qiuqiang; Li, Shengchen.
Proceedings of the 6th Conference on Sound and Music Technology, CSMT - Revised Selected Papers, 2018. ed. / Wei Li; Shengchen Li; Xi Shao; Zijin Li. Springer Verlag, 2019. p. 85-96 (Lecture Notes in Electrical Engineering; Vol. 568).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - A comparison of attention mechanisms of convolutional neural network in weakly labeled audio tagging

AU - Hou, Yuanbo

AU - Kong, Qiuqiang

AU - Li, Shengchen

N1 - Publisher Copyright: © Springer Nature Singapore Pte Ltd. 2019.

PY - 2019

Y1 - 2019

N2 - Audio tagging aims to predict the types of sound events occurring in audio clips. Recently, the convolutional recurrent neural network (CRNN) has achieved state-of-the-art performance in audio tagging. In CRNN, convolutional layers are applied on input audio features to extract high-level representations followed by recurrent layers. To better learn high-level representations of acoustic features, attention mechanisms were introduced to the convolutional layers of CRNN. Attention is a learning technique that could steer the model to information important to the task to obtain better performance. The two different attention mechanisms in the CRNN, the Squeeze-and-Excitation (SE) block and gated linear unit (GLU), are based on a gating mechanism, but their concerns are different. To compare the performance of the SE block and GLU, we propose to use a CRNN with a SE block (SE-CRNN) and a CRNN with a GLU (GLU-CRNN) in weakly labeled audio tagging and compare these results with the CRNN baseline. The experiments show that the GLU-CRNN achieves an area under curve score of 0.877 in polyphonic audio tagging, outperforming the SE-CRNN of 0.865 and the CRNN baseline of 0.838. The results show that the performance of attention based on GLU is better than the performance of attention based on the SE block in CRNN for weakly labeled polyphonic audio tagging.

AB - Audio tagging aims to predict the types of sound events occurring in audio clips. Recently, the convolutional recurrent neural network (CRNN) has achieved state-of-the-art performance in audio tagging. In CRNN, convolutional layers are applied on input audio features to extract high-level representations followed by recurrent layers. To better learn high-level representations of acoustic features, attention mechanisms were introduced to the convolutional layers of CRNN. Attention is a learning technique that could steer the model to information important to the task to obtain better performance. The two different attention mechanisms in the CRNN, the Squeeze-and-Excitation (SE) block and gated linear unit (GLU), are based on a gating mechanism, but their concerns are different. To compare the performance of the SE block and GLU, we propose to use a CRNN with a SE block (SE-CRNN) and a CRNN with a GLU (GLU-CRNN) in weakly labeled audio tagging and compare these results with the CRNN baseline. The experiments show that the GLU-CRNN achieves an area under curve score of 0.877 in polyphonic audio tagging, outperforming the SE-CRNN of 0.865 and the CRNN baseline of 0.838. The results show that the performance of attention based on GLU is better than the performance of attention based on the SE block in CRNN for weakly labeled polyphonic audio tagging.

KW - Audio tagging

KW - Convolutional neural network (CNN)

KW - Convolutional recurrent neural network (CRNN)

KW - Gated linear unit (GLU)

KW - Squeeze-and-Excitation (SE) block

UR - http://www.scopus.com/inward/record.url?scp=85070778521&partnerID=8YFLogxK

U2 - 10.1007/978-981-13-8707-4_8

DO - 10.1007/978-981-13-8707-4_8

M3 - Conference Proceeding

AN - SCOPUS:85070778521

SN - 9789811387067

T3 - Lecture Notes in Electrical Engineering

SP - 85

EP - 96

BT - Proceedings of the 6th Conference on Sound and Music Technology, CSMT - Revised Selected Papers, 2018

A2 - Li, Wei

A2 - Li, Shengchen

A2 - Shao, Xi

A2 - Li, Zijin

PB - Springer Verlag

T2 - 6th Conference on Sound and Music Technology, CSMT 2018

Y2 - 24 November 2018 through 26 November 2018

ER -

Hou Y, Kong Q, Li S. A comparison of attention mechanisms of convolutional neural network in weakly labeled audio tagging. In Li W, Li S, Shao X, Li Z, editors, Proceedings of the 6th Conference on Sound and Music Technology, CSMT - Revised Selected Papers, 2018. Springer Verlag. 2019. p. 85-96. (Lecture Notes in Electrical Engineering). doi: 10.1007/978-981-13-8707-4_8

A comparison of attention mechanisms of convolutional neural network in weakly labeled audio tagging

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Cite this