Frequency-dependent auto-pooling function for weakly supervised sound event detection

Sichen Liu, Feiran Yang*, Yin Cao, Jun Yang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)

Abstract

Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separation-based method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDC-block) to estimate time-frequency (T-F) masks of each sound event from a T-F representation of an audio clip. DDC-block is experimentally proven to be more effective and computationally lighter than “VGG-like” block. To fully utilize frequency characteristics of sound events, we then propose a frequency-dependent auto-pooling (FAP) function to obtain the clip-level present probability of each sound event class. A combination of two schemes, named DDC-FAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDC-FAP has a better performance than the state-of-the-art source separation-based method in SED task.

Original languageEnglish
Article number19
JournalEurasip Journal on Audio, Speech, and Music Processing
Volume2021
Issue number1
DOIs
Publication statusPublished - Dec 2021
Externally publishedYes

Keywords

  • Auto-pooling function
  • Depthwise separable convolution
  • Sound event detection
  • Weakly supervised

Fingerprint

Dive into the research topics of 'Frequency-dependent auto-pooling function for weakly supervised sound event detection'. Together they form a unique fingerprint.

Cite this