TY - JOUR
T1 - An improved event-independent network for polyphonic sound event localization and detection
AU - Cao, Yin
AU - Iqbal, Turab
AU - Kong, Qiuqiang
AU - An, Fengyan
AU - Wang, Wenwu
AU - Plumbley, Mark D.
N1 - Funding Information:
This work was supported in part by EPSRC Grants EP/P022529/1, EP/N014111/1 “Making Sense of Sounds”, EP/T019751/1 “AI for Sound”, National Natural Science Foundation of China (Grant No. 11804365), and EPSRC grant EP/N509772/1, “DTP 2016-2017 University of Surrey”.
Publisher Copyright:
© 2021 IEEE
PY - 2021
Y1 - 2021
N2 - Polyphonic sound event localization and detection (SELD), which jointly performs sound event detection (SED) and direction-of-arrival (DoA) estimation, detects the type and occurrence time of sound events as well as their corresponding DoA angles simultaneously. We study the SELD task from a multi-task learning perspective. Two open problems are addressed in this paper. Firstly, to detect overlapping sound events of the same type but with different DoAs, we propose to use a trackwise output format and solve the accompanying track permutation problem with permutation-invariant training. Multi-head self-attention is further used to separate tracks. Secondly, a previous finding is that, by using hard parameter-sharing, SELD suffers from a performance loss compared with learning the subtasks separately. This is solved by a soft parameter-sharing scheme. We term the proposed method as Event Independent Network V2 (EINV2), which is an improved version of our previously-proposed method and an end-to-end network for SELD. We show that our proposed EINV2 for joint SED and DoA estimation outperforms previous methods by a large margin, and has comparable performance to state-of-the-art ensemble models.
AB - Polyphonic sound event localization and detection (SELD), which jointly performs sound event detection (SED) and direction-of-arrival (DoA) estimation, detects the type and occurrence time of sound events as well as their corresponding DoA angles simultaneously. We study the SELD task from a multi-task learning perspective. Two open problems are addressed in this paper. Firstly, to detect overlapping sound events of the same type but with different DoAs, we propose to use a trackwise output format and solve the accompanying track permutation problem with permutation-invariant training. Multi-head self-attention is further used to separate tracks. Secondly, a previous finding is that, by using hard parameter-sharing, SELD suffers from a performance loss compared with learning the subtasks separately. This is solved by a soft parameter-sharing scheme. We term the proposed method as Event Independent Network V2 (EINV2), which is an improved version of our previously-proposed method and an end-to-end network for SELD. We show that our proposed EINV2 for joint SED and DoA estimation outperforms previous methods by a large margin, and has comparable performance to state-of-the-art ensemble models.
KW - Direction of arrival
KW - Event-independent
KW - Multitask learning
KW - Permutation-invariant training
KW - Sound event localization and detection
UR - http://www.scopus.com/inward/record.url?scp=85109062273&partnerID=8YFLogxK
U2 - 10.1109/ICASSP39728.2021.9413473
DO - 10.1109/ICASSP39728.2021.9413473
M3 - Conference article
AN - SCOPUS:85109062273
SN - 1520-6149
VL - 2021-June
SP - 885
EP - 889
JO - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
JF - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
T2 - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021
Y2 - 6 June 2021 through 11 June 2021
ER -