High accurate environmental sound classification: Sub-spectrogram segmentation versus temporal-frequency attention mechanism

Tianhao Qiao; Shunqing Zhang; Shan Cao; Shugong Xu

doi:10.3390/s21165500

High accurate environmental sound classification: Sub-spectrogram segmentation versus temporal-frequency attention mechanism

Tianhao Qiao, Shunqing Zhang^*, Shan Cao, Shugong Xu

^*Corresponding author for this work

Shanghai University

Research output: Contribution to journal › Article › peer-review

9 Citations (Scopus)

Abstract

In the important and challenging field of environmental sound classification (ESC), a crucial and even decisive factor is the feature representation ability, which can directly affect the accuracy of classification. Therefore, the classification performance often depends to a large extent on whether the effective representative features can be extracted from the environmental sound. In this paper, we firstly propose a sub-spectrogram segmentation with score level fusion based ESC classification framework, and we adopt the proposed convolutional recurrent neural network (CRNN) for improv-ing the classification accuracy. By evaluating numerous truncation schemes, we numerically figure out the optimal number of sub-spectrograms and the corresponding band ranges, and, on this basis, we propose a joint attention mechanism with temporal and frequency attention mechanisms and use the global attention mechanism when generating the attention map. Finally, the numerical results show that the two frameworks we proposed can achieve 82.1% and 86.4% classification accuracy on the public environmental sound dataset ESC-50, respectively, which is equivalent to more than 13.5% improvement over the traditional baseline scheme.

Original language	English
Article number	5500
Journal	Sensors
Volume	21
Issue number	16
DOIs	https://doi.org/10.3390/s21165500
Publication status	Published - 2 Aug 2021
Externally published	Yes

Keywords

Convolutional recurrent neural network
Environmental sound classification
Score level fusion
Sub-spectrogram segmentation
Temporal-frequency attention mechanism

Access to Document

10.3390/s21165500

Cite this

@article{8995ddc7bbd04bbdac96d27d49fc42da,

title = "High accurate environmental sound classification: Sub-spectrogram segmentation versus temporal-frequency attention mechanism",

abstract = "In the important and challenging field of environmental sound classification (ESC), a crucial and even decisive factor is the feature representation ability, which can directly affect the accuracy of classification. Therefore, the classification performance often depends to a large extent on whether the effective representative features can be extracted from the environmental sound. In this paper, we firstly propose a sub-spectrogram segmentation with score level fusion based ESC classification framework, and we adopt the proposed convolutional recurrent neural network (CRNN) for improv-ing the classification accuracy. By evaluating numerous truncation schemes, we numerically figure out the optimal number of sub-spectrograms and the corresponding band ranges, and, on this basis, we propose a joint attention mechanism with temporal and frequency attention mechanisms and use the global attention mechanism when generating the attention map. Finally, the numerical results show that the two frameworks we proposed can achieve 82.1% and 86.4% classification accuracy on the public environmental sound dataset ESC-50, respectively, which is equivalent to more than 13.5% improvement over the traditional baseline scheme.",

keywords = "Convolutional recurrent neural network, Environmental sound classification, Score level fusion, Sub-spectrogram segmentation, Temporal-frequency attention mechanism",

author = "Tianhao Qiao and Shunqing Zhang and Shan Cao and Shugong Xu",

note = "Publisher Copyright: {\textcopyright} 2021 by the authors. Licensee MDPI, Basel, Switzerland.",

year = "2021",

month = aug,

day = "2",

doi = "10.3390/s21165500",

language = "English",

volume = "21",

journal = "Sensors",

issn = "1424-8220",

publisher = "MDPI (Basel, Switzerland) ",

number = "16",

}

TY - JOUR

T1 - High accurate environmental sound classification

T2 - Sub-spectrogram segmentation versus temporal-frequency attention mechanism

AU - Qiao, Tianhao

AU - Zhang, Shunqing

AU - Cao, Shan

AU - Xu, Shugong

PY - 2021/8/2

Y1 - 2021/8/2

N2 - In the important and challenging field of environmental sound classification (ESC), a crucial and even decisive factor is the feature representation ability, which can directly affect the accuracy of classification. Therefore, the classification performance often depends to a large extent on whether the effective representative features can be extracted from the environmental sound. In this paper, we firstly propose a sub-spectrogram segmentation with score level fusion based ESC classification framework, and we adopt the proposed convolutional recurrent neural network (CRNN) for improv-ing the classification accuracy. By evaluating numerous truncation schemes, we numerically figure out the optimal number of sub-spectrograms and the corresponding band ranges, and, on this basis, we propose a joint attention mechanism with temporal and frequency attention mechanisms and use the global attention mechanism when generating the attention map. Finally, the numerical results show that the two frameworks we proposed can achieve 82.1% and 86.4% classification accuracy on the public environmental sound dataset ESC-50, respectively, which is equivalent to more than 13.5% improvement over the traditional baseline scheme.

AB - In the important and challenging field of environmental sound classification (ESC), a crucial and even decisive factor is the feature representation ability, which can directly affect the accuracy of classification. Therefore, the classification performance often depends to a large extent on whether the effective representative features can be extracted from the environmental sound. In this paper, we firstly propose a sub-spectrogram segmentation with score level fusion based ESC classification framework, and we adopt the proposed convolutional recurrent neural network (CRNN) for improv-ing the classification accuracy. By evaluating numerous truncation schemes, we numerically figure out the optimal number of sub-spectrograms and the corresponding band ranges, and, on this basis, we propose a joint attention mechanism with temporal and frequency attention mechanisms and use the global attention mechanism when generating the attention map. Finally, the numerical results show that the two frameworks we proposed can achieve 82.1% and 86.4% classification accuracy on the public environmental sound dataset ESC-50, respectively, which is equivalent to more than 13.5% improvement over the traditional baseline scheme.

KW - Convolutional recurrent neural network

KW - Environmental sound classification

KW - Score level fusion

KW - Sub-spectrogram segmentation

KW - Temporal-frequency attention mechanism

UR - http://www.scopus.com/inward/record.url?scp=85112496906&partnerID=8YFLogxK

U2 - 10.3390/s21165500

DO - 10.3390/s21165500

M3 - Article

C2 - 34450942

AN - SCOPUS:85112496906

SN - 1424-8220

VL - 21

JO - Sensors

JF - Sensors

IS - 16

M1 - 5500

ER -

High accurate environmental sound classification: Sub-spectrogram segmentation versus temporal-frequency attention mechanism

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this