TY - JOUR
T1 - High accurate environmental sound classification
T2 - Sub-spectrogram segmentation versus temporal-frequency attention mechanism
AU - Qiao, Tianhao
AU - Zhang, Shunqing
AU - Cao, Shan
AU - Xu, Shugong
N1 - Publisher Copyright:
© 2021 by the authors. Licensee MDPI, Basel, Switzerland.
PY - 2021/8/2
Y1 - 2021/8/2
N2 - In the important and challenging field of environmental sound classification (ESC), a crucial and even decisive factor is the feature representation ability, which can directly affect the accuracy of classification. Therefore, the classification performance often depends to a large extent on whether the effective representative features can be extracted from the environmental sound. In this paper, we firstly propose a sub-spectrogram segmentation with score level fusion based ESC classification framework, and we adopt the proposed convolutional recurrent neural network (CRNN) for improv-ing the classification accuracy. By evaluating numerous truncation schemes, we numerically figure out the optimal number of sub-spectrograms and the corresponding band ranges, and, on this basis, we propose a joint attention mechanism with temporal and frequency attention mechanisms and use the global attention mechanism when generating the attention map. Finally, the numerical results show that the two frameworks we proposed can achieve 82.1% and 86.4% classification accuracy on the public environmental sound dataset ESC-50, respectively, which is equivalent to more than 13.5% improvement over the traditional baseline scheme.
AB - In the important and challenging field of environmental sound classification (ESC), a crucial and even decisive factor is the feature representation ability, which can directly affect the accuracy of classification. Therefore, the classification performance often depends to a large extent on whether the effective representative features can be extracted from the environmental sound. In this paper, we firstly propose a sub-spectrogram segmentation with score level fusion based ESC classification framework, and we adopt the proposed convolutional recurrent neural network (CRNN) for improv-ing the classification accuracy. By evaluating numerous truncation schemes, we numerically figure out the optimal number of sub-spectrograms and the corresponding band ranges, and, on this basis, we propose a joint attention mechanism with temporal and frequency attention mechanisms and use the global attention mechanism when generating the attention map. Finally, the numerical results show that the two frameworks we proposed can achieve 82.1% and 86.4% classification accuracy on the public environmental sound dataset ESC-50, respectively, which is equivalent to more than 13.5% improvement over the traditional baseline scheme.
KW - Convolutional recurrent neural network
KW - Environmental sound classification
KW - Score level fusion
KW - Sub-spectrogram segmentation
KW - Temporal-frequency attention mechanism
UR - http://www.scopus.com/inward/record.url?scp=85112496906&partnerID=8YFLogxK
U2 - 10.3390/s21165500
DO - 10.3390/s21165500
M3 - Article
C2 - 34450942
AN - SCOPUS:85112496906
SN - 1424-8220
VL - 21
JO - Sensors
JF - Sensors
IS - 16
M1 - 5500
ER -