TY - JOUR
T1 - Frequency-dependent auto-pooling function for weakly supervised sound event detection
AU - Liu, Sichen
AU - Yang, Feiran
AU - Cao, Yin
AU - Yang, Jun
N1 - Funding Information:
This work was supported by the Youth Innovation Promotion Association of Chinese Academy of Sciences under Grant 2018027, National Natural Science Foundation of China under Grants 11804368 and 11674348, IACAS Young Elite Researcher Project QNYC201812, National Key R&D Program of China under Grant 2017YFC0804900, and the Strategic Priority Research Program of Chinese Academy of Sciences under Grant XDC02020400.
Publisher Copyright:
© 2021, The Author(s).
PY - 2021/12
Y1 - 2021/12
N2 - Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separation-based method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDC-block) to estimate time-frequency (T-F) masks of each sound event from a T-F representation of an audio clip. DDC-block is experimentally proven to be more effective and computationally lighter than “VGG-like” block. To fully utilize frequency characteristics of sound events, we then propose a frequency-dependent auto-pooling (FAP) function to obtain the clip-level present probability of each sound event class. A combination of two schemes, named DDC-FAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDC-FAP has a better performance than the state-of-the-art source separation-based method in SED task.
AB - Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separation-based method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDC-block) to estimate time-frequency (T-F) masks of each sound event from a T-F representation of an audio clip. DDC-block is experimentally proven to be more effective and computationally lighter than “VGG-like” block. To fully utilize frequency characteristics of sound events, we then propose a frequency-dependent auto-pooling (FAP) function to obtain the clip-level present probability of each sound event class. A combination of two schemes, named DDC-FAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDC-FAP has a better performance than the state-of-the-art source separation-based method in SED task.
KW - Auto-pooling function
KW - Depthwise separable convolution
KW - Sound event detection
KW - Weakly supervised
UR - http://www.scopus.com/inward/record.url?scp=85106048012&partnerID=8YFLogxK
U2 - 10.1186/s13636-021-00206-7
DO - 10.1186/s13636-021-00206-7
M3 - Article
AN - SCOPUS:85106048012
SN - 1687-4714
VL - 2021
JO - Eurasip Journal on Audio, Speech, and Music Processing
JF - Eurasip Journal on Audio, Speech, and Music Processing
IS - 1
M1 - 19
ER -