Text-Queried Target Sound Event Localization

Jinzheng Zhao; Xinyuan Qian; Yong Xu; Haohe Liu; Yin Cao; Davide Berghi; Wenwu Wang

doi:10.23919/eusipco63174.2024.10715199

Text-Queried Target Sound Event Localization

Jinzheng Zhao^*, Xinyuan Qian, Yong Xu, Haohe Liu^*, Yin Cao, Davide Berghi^*, Wenwu Wang^*

^*Corresponding author for this work

Department of Intelligent Science

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

1 Citation (Scopus)

Abstract

Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classes in DCASE challenges. In this paper, we propose text-queried target sound event localization (SEL), a new paradigm that allows the user to input the text to describe the sound event, and the SEL model can predict the location of the related sound event. The proposed task presents a more user-friendly way for human-computer interaction. We provide a benchmark study for the proposed task and perform experiments on datasets created by simulated room impulse response (RIR) and real RIR to validate the effectiveness of the proposed methods. We hope that our benchmark will inspire the interest and additional research for text-queried sound source localization.

Original language	English
Title of host publication	32nd European Signal Processing Conference, EUSIPCO 2024 - Proceedings
Publisher	European Signal Processing Conference, EUSIPCO
Pages	261-265
Number of pages	5
ISBN (Electronic)	9789464593617
DOIs	https://doi.org/10.23919/eusipco63174.2024.10715199
Publication status	Published - 2024
Event	32nd European Signal Processing Conference, EUSIPCO 2024 - Lyon, France Duration: 26 Aug 2024 → 30 Aug 2024

Publication series

Name	European Signal Processing Conference
ISSN (Print)	2219-5491

Conference

Conference	32nd European Signal Processing Conference, EUSIPCO 2024
Country/Territory	France
City	Lyon
Period	26/08/24 → 30/08/24

Keywords

multimodal fusion
sound event localization and detection

Access to Document

10.23919/eusipco63174.2024.10715199

Cite this

@inproceedings{5c4193669e7f4009ace5ce7f0a8d60b6,

title = "Text-Queried Target Sound Event Localization",

abstract = "Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classes in DCASE challenges. In this paper, we propose text-queried target sound event localization (SEL), a new paradigm that allows the user to input the text to describe the sound event, and the SEL model can predict the location of the related sound event. The proposed task presents a more user-friendly way for human-computer interaction. We provide a benchmark study for the proposed task and perform experiments on datasets created by simulated room impulse response (RIR) and real RIR to validate the effectiveness of the proposed methods. We hope that our benchmark will inspire the interest and additional research for text-queried sound source localization.",

keywords = "multimodal fusion, sound event localization and detection",

author = "Jinzheng Zhao and Xinyuan Qian and Yong Xu and Haohe Liu and Yin Cao and Davide Berghi and Wenwu Wang",

note = "Publisher Copyright: {\textcopyright} 2024 European Signal Processing Conference, EUSIPCO. All rights reserved.; 32nd European Signal Processing Conference, EUSIPCO 2024 ; Conference date: 26-08-2024 Through 30-08-2024",

year = "2024",

doi = "10.23919/eusipco63174.2024.10715199",

language = "English",

series = "European Signal Processing Conference",

publisher = "European Signal Processing Conference, EUSIPCO",

pages = "261--265",

booktitle = "32nd European Signal Processing Conference, EUSIPCO 2024 - Proceedings",

}

Zhao, J, Qian, X, Xu, Y, Liu, H, Cao, Y, Berghi, D & Wang, W 2024, Text-Queried Target Sound Event Localization. in 32nd European Signal Processing Conference, EUSIPCO 2024 - Proceedings. European Signal Processing Conference, European Signal Processing Conference, EUSIPCO, pp. 261-265, 32nd European Signal Processing Conference, EUSIPCO 2024, Lyon, France, 26/08/24. https://doi.org/10.23919/eusipco63174.2024.10715199

Text-Queried Target Sound Event Localization. / Zhao, Jinzheng; Qian, Xinyuan; Xu, Yong et al.
32nd European Signal Processing Conference, EUSIPCO 2024 - Proceedings. European Signal Processing Conference, EUSIPCO, 2024. p. 261-265 (European Signal Processing Conference).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Text-Queried Target Sound Event Localization

AU - Zhao, Jinzheng

AU - Qian, Xinyuan

AU - Xu, Yong

AU - Liu, Haohe

AU - Cao, Yin

AU - Berghi, Davide

AU - Wang, Wenwu

PY - 2024

Y1 - 2024

N2 - Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classes in DCASE challenges. In this paper, we propose text-queried target sound event localization (SEL), a new paradigm that allows the user to input the text to describe the sound event, and the SEL model can predict the location of the related sound event. The proposed task presents a more user-friendly way for human-computer interaction. We provide a benchmark study for the proposed task and perform experiments on datasets created by simulated room impulse response (RIR) and real RIR to validate the effectiveness of the proposed methods. We hope that our benchmark will inspire the interest and additional research for text-queried sound source localization.

AB - Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classes in DCASE challenges. In this paper, we propose text-queried target sound event localization (SEL), a new paradigm that allows the user to input the text to describe the sound event, and the SEL model can predict the location of the related sound event. The proposed task presents a more user-friendly way for human-computer interaction. We provide a benchmark study for the proposed task and perform experiments on datasets created by simulated room impulse response (RIR) and real RIR to validate the effectiveness of the proposed methods. We hope that our benchmark will inspire the interest and additional research for text-queried sound source localization.

KW - multimodal fusion

KW - sound event localization and detection

UR - http://www.scopus.com/inward/record.url?scp=85208424553&partnerID=8YFLogxK

U2 - 10.23919/eusipco63174.2024.10715199

DO - 10.23919/eusipco63174.2024.10715199

M3 - Conference Proceeding

AN - SCOPUS:85208424553

T3 - European Signal Processing Conference

SP - 261

EP - 265

BT - 32nd European Signal Processing Conference, EUSIPCO 2024 - Proceedings

PB - European Signal Processing Conference, EUSIPCO

T2 - 32nd European Signal Processing Conference, EUSIPCO 2024

Y2 - 26 August 2024 through 30 August 2024

ER -

Text-Queried Target Sound Event Localization

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this