Text-Queried Target Sound Event Localization

Jinzheng Zhao*, Xinyuan Qian, Yong Xu, Haohe Liu*, Yin Cao, Davide Berghi*, Wenwu Wang*

*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceedingConference Proceedingpeer-review

1 Citation (Scopus)

Abstract

Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classes in DCASE challenges. In this paper, we propose text-queried target sound event localization (SEL), a new paradigm that allows the user to input the text to describe the sound event, and the SEL model can predict the location of the related sound event. The proposed task presents a more user-friendly way for human-computer interaction. We provide a benchmark study for the proposed task and perform experiments on datasets created by simulated room impulse response (RIR) and real RIR to validate the effectiveness of the proposed methods. We hope that our benchmark will inspire the interest and additional research for text-queried sound source localization.

Original languageEnglish
Title of host publication32nd European Signal Processing Conference, EUSIPCO 2024 - Proceedings
PublisherEuropean Signal Processing Conference, EUSIPCO
Pages261-265
Number of pages5
ISBN (Electronic)9789464593617
DOIs
Publication statusPublished - 2024
Event32nd European Signal Processing Conference, EUSIPCO 2024 - Lyon, France
Duration: 26 Aug 202430 Aug 2024

Publication series

NameEuropean Signal Processing Conference
ISSN (Print)2219-5491

Conference

Conference32nd European Signal Processing Conference, EUSIPCO 2024
Country/TerritoryFrance
CityLyon
Period26/08/2430/08/24

Keywords

  • multimodal fusion
  • sound event localization and detection

Fingerprint

Dive into the research topics of 'Text-Queried Target Sound Event Localization'. Together they form a unique fingerprint.

Cite this