Learning Attentive Representations for Environmental Sound Classification

Zhichao Zhang, Shugong Xu*, Shunqing Zhang, Tianhao Qiao, Shan Cao

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

56 Citations (Scopus)

Abstract

Environmental sound classification (ESC) is a challenging problem due to the complex temporal structure and diverse energy modulation patterns of environmental sounds. In order to deal with the former, temporal attention mechanism is originally adopted to focus on the informative frames. However, no existing works pay attention to the latter problem. In this paper, we consider the role of convolution filters in detecting energy modulation patterns and propose a channel attention mechanism to focus on the semantically relevant channels generated by corresponding filters. Furthermore, we incorporate the temporal attention and channel attention to enhance the representative power of CNN via generating complementary information. In addition, to avoid possible overfitting caused by limited training data, we explore a data augmentation scheme that is other contribution in this paper. We evaluate our proposed method on three benchmark ESC datasets: ESC-10 and ESC-50 and DCASE2016. Experimental results show the effectiveness of proposed method and achieve the state-of-the-art or competitive results in terms of classification accuracy. Finally, we visualize our attention results and observe that the proposed attention mechanism is able to lead the network to focus on the semantically relevant parts of environmental sounds.

Original languageEnglish
Article number8823934
Pages (from-to)130327-130339
Number of pages13
JournalIEEE Access
Volume7
DOIs
Publication statusPublished - 2019
Externally publishedYes

Keywords

  • attention mechanism
  • convolutional recurrent neural network
  • data augmentation
  • Environmental sound classification

Fingerprint

Dive into the research topics of 'Learning Attentive Representations for Environmental Sound Classification'. Together they form a unique fingerprint.

Cite this