Frequency-dependent auto-pooling function for weakly supervised sound event detection

Sichen Liu; Feiran Yang; Yin Cao; Jun Yang

doi:10.1186/s13636-021-00206-7

Frequency-dependent auto-pooling function for weakly supervised sound event detection

Sichen Liu, Feiran Yang^*, Yin Cao, Jun Yang

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

3 Citations (Scopus)

Abstract

Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separation-based method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDC-block) to estimate time-frequency (T-F) masks of each sound event from a T-F representation of an audio clip. DDC-block is experimentally proven to be more effective and computationally lighter than “VGG-like” block. To fully utilize frequency characteristics of sound events, we then propose a frequency-dependent auto-pooling (FAP) function to obtain the clip-level present probability of each sound event class. A combination of two schemes, named DDC-FAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDC-FAP has a better performance than the state-of-the-art source separation-based method in SED task.

Original language	English
Article number	19
Journal	Eurasip Journal on Audio, Speech, and Music Processing
Volume	2021
Issue number	1
DOIs	https://doi.org/10.1186/s13636-021-00206-7
Publication status	Published - Dec 2021
Externally published	Yes

Keywords

Auto-pooling function
Depthwise separable convolution
Sound event detection
Weakly supervised

Access to Document

10.1186/s13636-021-00206-7

Cite this

@article{2a26b620bbae4872a85425c90e70c0d8,

title = "Frequency-dependent auto-pooling function for weakly supervised sound event detection",

abstract = "Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separation-based method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDC-block) to estimate time-frequency (T-F) masks of each sound event from a T-F representation of an audio clip. DDC-block is experimentally proven to be more effective and computationally lighter than “VGG-like” block. To fully utilize frequency characteristics of sound events, we then propose a frequency-dependent auto-pooling (FAP) function to obtain the clip-level present probability of each sound event class. A combination of two schemes, named DDC-FAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDC-FAP has a better performance than the state-of-the-art source separation-based method in SED task.",

keywords = "Auto-pooling function, Depthwise separable convolution, Sound event detection, Weakly supervised",

author = "Sichen Liu and Feiran Yang and Yin Cao and Jun Yang",

note = "Funding Information: This work was supported by the Youth Innovation Promotion Association of Chinese Academy of Sciences under Grant 2018027, National Natural Science Foundation of China under Grants 11804368 and 11674348, IACAS Young Elite Researcher Project QNYC201812, National Key R&D Program of China under Grant 2017YFC0804900, and the Strategic Priority Research Program of Chinese Academy of Sciences under Grant XDC02020400. Publisher Copyright: {\textcopyright} 2021, The Author(s).",

year = "2021",

month = dec,

doi = "10.1186/s13636-021-00206-7",

language = "English",

volume = "2021",

journal = "Eurasip Journal on Audio, Speech, and Music Processing",

issn = "1687-4714",

number = "1",

}

TY - JOUR

T1 - Frequency-dependent auto-pooling function for weakly supervised sound event detection

AU - Liu, Sichen

AU - Yang, Feiran

AU - Cao, Yin

AU - Yang, Jun

N1 - Funding Information: This work was supported by the Youth Innovation Promotion Association of Chinese Academy of Sciences under Grant 2018027, National Natural Science Foundation of China under Grants 11804368 and 11674348, IACAS Young Elite Researcher Project QNYC201812, National Key R&D Program of China under Grant 2017YFC0804900, and the Strategic Priority Research Program of Chinese Academy of Sciences under Grant XDC02020400. Publisher Copyright: © 2021, The Author(s).

PY - 2021/12

Y1 - 2021/12

N2 - Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separation-based method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDC-block) to estimate time-frequency (T-F) masks of each sound event from a T-F representation of an audio clip. DDC-block is experimentally proven to be more effective and computationally lighter than “VGG-like” block. To fully utilize frequency characteristics of sound events, we then propose a frequency-dependent auto-pooling (FAP) function to obtain the clip-level present probability of each sound event class. A combination of two schemes, named DDC-FAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDC-FAP has a better performance than the state-of-the-art source separation-based method in SED task.

AB - Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separation-based method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDC-block) to estimate time-frequency (T-F) masks of each sound event from a T-F representation of an audio clip. DDC-block is experimentally proven to be more effective and computationally lighter than “VGG-like” block. To fully utilize frequency characteristics of sound events, we then propose a frequency-dependent auto-pooling (FAP) function to obtain the clip-level present probability of each sound event class. A combination of two schemes, named DDC-FAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDC-FAP has a better performance than the state-of-the-art source separation-based method in SED task.

KW - Auto-pooling function

KW - Depthwise separable convolution

KW - Sound event detection

KW - Weakly supervised

UR - http://www.scopus.com/inward/record.url?scp=85106048012&partnerID=8YFLogxK

U2 - 10.1186/s13636-021-00206-7

DO - 10.1186/s13636-021-00206-7

M3 - Article

AN - SCOPUS:85106048012

SN - 1687-4714

VL - 2021

JO - Eurasip Journal on Audio, Speech, and Music Processing

JF - Eurasip Journal on Audio, Speech, and Music Processing

IS - 1

M1 - 19

ER -

Frequency-dependent auto-pooling function for weakly supervised sound event detection

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this