Abstract
Environmental sound classification (ESC) is a challenging problem due to the complex temporal structure and diverse energy modulation patterns of environmental sounds. In order to deal with the former, temporal attention mechanism is originally adopted to focus on the informative frames. However, no existing works pay attention to the latter problem. In this paper, we consider the role of convolution filters in detecting energy modulation patterns and propose a channel attention mechanism to focus on the semantically relevant channels generated by corresponding filters. Furthermore, we incorporate the temporal attention and channel attention to enhance the representative power of CNN via generating complementary information. In addition, to avoid possible overfitting caused by limited training data, we explore a data augmentation scheme that is other contribution in this paper. We evaluate our proposed method on three benchmark ESC datasets: ESC-10 and ESC-50 and DCASE2016. Experimental results show the effectiveness of proposed method and achieve the state-of-the-art or competitive results in terms of classification accuracy. Finally, we visualize our attention results and observe that the proposed attention mechanism is able to lead the network to focus on the semantically relevant parts of environmental sounds.
Original language | English |
---|---|
Article number | 8823934 |
Pages (from-to) | 130327-130339 |
Number of pages | 13 |
Journal | IEEE Access |
Volume | 7 |
DOIs | |
Publication status | Published - 2019 |
Externally published | Yes |
Keywords
- attention mechanism
- convolutional recurrent neural network
- data augmentation
- Environmental sound classification