Language-based Audio Retrieval with Co-Attention Networks

Haoran Sun; Zimu Wang; Qiuyi Chen; Jianjun Chen; Jia Wang; Haiyang Zhang

doi:10.1109/SWC62898.2024.00251

Language-based Audio Retrieval with Co-Attention Networks

Haoran Sun, Zimu Wang, Qiuyi Chen, Jianjun Chen, Jia Wang, Haiyang Zhang^*

^*Corresponding author for this work

Xi'an Jiaotong-Liverpool University

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

Abstract

In recent years, user-generated audio content has proliferated across various media platforms, creating a growing need for efficient retrieval methods that allow users to search for audio clips using natural language queries. This task, known as language-based audio retrieval, presents significant challenges due to the complexity of learning semantic representations from heterogeneous data across both text and audio modalities. In this work, we introduce a novel framework for the language-based audio retrieval task that leverages co-attention mechanismto jointly learn meaningful representations from both modalities. To enhance the model's ability to capture fine-grained cross-modal interactions, we propose a cascaded co-attention architecture, where co-attention modules are stacked or iterated to progressively refine the semantic alignment between text and audio. Experiments conducted on two public datasets show that the proposed method can achieve better performance than the state-of-the-art method. Specifically, our best performed co-attention model achieves a 16.6% improvement in mean Average Precision on Clotho dataset, and a 1 5. 1 % improvement on AudioCaps.

Original language	English
Title of host publication	Proceedings - 2024 IEEE Smart World Congress, SWC 2024 - 2024 IEEE Ubiquitous Intelligence and Computing, Autonomous and Trusted Computing, Digital Twin, Metaverse, Privacy Computing and Data Security, Scalable Computing and Communications
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	1633-1638
Number of pages	6
ISBN (Electronic)	9798331520861
DOIs	https://doi.org/10.1109/SWC62898.2024.00251
Publication status	Published - 2024
Event	10th IEEE Smart World Congress, SWC 2024 - Nadi, Fiji Duration: 2 Dec 2024 → 7 Dec 2024

Publication series

Name	Proceedings - 2024 IEEE Smart World Congress, SWC 2024 - 2024 IEEE Ubiquitous Intelligence and Computing, Autonomous and Trusted Computing, Digital Twin, Metaverse, Privacy Computing and Data Security, Scalable Computing and Communications

Conference

Conference	10th IEEE Smart World Congress, SWC 2024
Country/Territory	Fiji
City	Nadi
Period	2/12/24 → 7/12/24

Keywords

co-attention mechanism
information retrieval
machine learning
textaudio retrieval

Access to Document

10.1109/SWC62898.2024.00251

Cite this

Sun, H., Wang, Z., Chen, Q., Chen, J., Wang, J., & Zhang, H. (2024). Language-based Audio Retrieval with Co-Attention Networks. In Proceedings - 2024 IEEE Smart World Congress, SWC 2024 - 2024 IEEE Ubiquitous Intelligence and Computing, Autonomous and Trusted Computing, Digital Twin, Metaverse, Privacy Computing and Data Security, Scalable Computing and Communications (pp. 1633-1638). (Proceedings - 2024 IEEE Smart World Congress, SWC 2024 - 2024 IEEE Ubiquitous Intelligence and Computing, Autonomous and Trusted Computing, Digital Twin, Metaverse, Privacy Computing and Data Security, Scalable Computing and Communications). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/SWC62898.2024.00251

Sun, Haoran ; Wang, Zimu ; Chen, Qiuyi et al. / Language-based Audio Retrieval with Co-Attention Networks. Proceedings - 2024 IEEE Smart World Congress, SWC 2024 - 2024 IEEE Ubiquitous Intelligence and Computing, Autonomous and Trusted Computing, Digital Twin, Metaverse, Privacy Computing and Data Security, Scalable Computing and Communications. Institute of Electrical and Electronics Engineers Inc., 2024. pp. 1633-1638 (Proceedings - 2024 IEEE Smart World Congress, SWC 2024 - 2024 IEEE Ubiquitous Intelligence and Computing, Autonomous and Trusted Computing, Digital Twin, Metaverse, Privacy Computing and Data Security, Scalable Computing and Communications).

@inproceedings{70d236c248d7482f957fd3241f57c7c4,

title = "Language-based Audio Retrieval with Co-Attention Networks",

abstract = "In recent years, user-generated audio content has proliferated across various media platforms, creating a growing need for efficient retrieval methods that allow users to search for audio clips using natural language queries. This task, known as language-based audio retrieval, presents significant challenges due to the complexity of learning semantic representations from heterogeneous data across both text and audio modalities. In this work, we introduce a novel framework for the language-based audio retrieval task that leverages co-attention mechanismto jointly learn meaningful representations from both modalities. To enhance the model's ability to capture fine-grained cross-modal interactions, we propose a cascaded co-attention architecture, where co-attention modules are stacked or iterated to progressively refine the semantic alignment between text and audio. Experiments conducted on two public datasets show that the proposed method can achieve better performance than the state-of-the-art method. Specifically, our best performed co-attention model achieves a 16.6% improvement in mean Average Precision on Clotho dataset, and a 1 5. 1 % improvement on AudioCaps.",

keywords = "co-attention mechanism, information retrieval, machine learning, textaudio retrieval",

author = "Haoran Sun and Zimu Wang and Qiuyi Chen and Jianjun Chen and Jia Wang and Haiyang Zhang",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 10th IEEE Smart World Congress, SWC 2024 ; Conference date: 02-12-2024 Through 07-12-2024",

year = "2024",

doi = "10.1109/SWC62898.2024.00251",

language = "English",

series = "Proceedings - 2024 IEEE Smart World Congress, SWC 2024 - 2024 IEEE Ubiquitous Intelligence and Computing, Autonomous and Trusted Computing, Digital Twin, Metaverse, Privacy Computing and Data Security, Scalable Computing and Communications",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "1633--1638",

booktitle = "Proceedings - 2024 IEEE Smart World Congress, SWC 2024 - 2024 IEEE Ubiquitous Intelligence and Computing, Autonomous and Trusted Computing, Digital Twin, Metaverse, Privacy Computing and Data Security, Scalable Computing and Communications",

}

Sun, H, Wang, Z, Chen, Q, Chen, J , Wang, J & Zhang, H 2024, Language-based Audio Retrieval with Co-Attention Networks. in Proceedings - 2024 IEEE Smart World Congress, SWC 2024 - 2024 IEEE Ubiquitous Intelligence and Computing, Autonomous and Trusted Computing, Digital Twin, Metaverse, Privacy Computing and Data Security, Scalable Computing and Communications. Proceedings - 2024 IEEE Smart World Congress, SWC 2024 - 2024 IEEE Ubiquitous Intelligence and Computing, Autonomous and Trusted Computing, Digital Twin, Metaverse, Privacy Computing and Data Security, Scalable Computing and Communications, Institute of Electrical and Electronics Engineers Inc., pp. 1633-1638, 10th IEEE Smart World Congress, SWC 2024, Nadi, Fiji, 2/12/24. https://doi.org/10.1109/SWC62898.2024.00251

Language-based Audio Retrieval with Co-Attention Networks. / Sun, Haoran; Wang, Zimu; Chen, Qiuyi et al.
Proceedings - 2024 IEEE Smart World Congress, SWC 2024 - 2024 IEEE Ubiquitous Intelligence and Computing, Autonomous and Trusted Computing, Digital Twin, Metaverse, Privacy Computing and Data Security, Scalable Computing and Communications. Institute of Electrical and Electronics Engineers Inc., 2024. p. 1633-1638 (Proceedings - 2024 IEEE Smart World Congress, SWC 2024 - 2024 IEEE Ubiquitous Intelligence and Computing, Autonomous and Trusted Computing, Digital Twin, Metaverse, Privacy Computing and Data Security, Scalable Computing and Communications).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Language-based Audio Retrieval with Co-Attention Networks

AU - Sun, Haoran

AU - Wang, Zimu

AU - Chen, Qiuyi

AU - Chen, Jianjun

AU - Wang, Jia

AU - Zhang, Haiyang

PY - 2024

Y1 - 2024

N2 - In recent years, user-generated audio content has proliferated across various media platforms, creating a growing need for efficient retrieval methods that allow users to search for audio clips using natural language queries. This task, known as language-based audio retrieval, presents significant challenges due to the complexity of learning semantic representations from heterogeneous data across both text and audio modalities. In this work, we introduce a novel framework for the language-based audio retrieval task that leverages co-attention mechanismto jointly learn meaningful representations from both modalities. To enhance the model's ability to capture fine-grained cross-modal interactions, we propose a cascaded co-attention architecture, where co-attention modules are stacked or iterated to progressively refine the semantic alignment between text and audio. Experiments conducted on two public datasets show that the proposed method can achieve better performance than the state-of-the-art method. Specifically, our best performed co-attention model achieves a 16.6% improvement in mean Average Precision on Clotho dataset, and a 1 5. 1 % improvement on AudioCaps.

AB - In recent years, user-generated audio content has proliferated across various media platforms, creating a growing need for efficient retrieval methods that allow users to search for audio clips using natural language queries. This task, known as language-based audio retrieval, presents significant challenges due to the complexity of learning semantic representations from heterogeneous data across both text and audio modalities. In this work, we introduce a novel framework for the language-based audio retrieval task that leverages co-attention mechanismto jointly learn meaningful representations from both modalities. To enhance the model's ability to capture fine-grained cross-modal interactions, we propose a cascaded co-attention architecture, where co-attention modules are stacked or iterated to progressively refine the semantic alignment between text and audio. Experiments conducted on two public datasets show that the proposed method can achieve better performance than the state-of-the-art method. Specifically, our best performed co-attention model achieves a 16.6% improvement in mean Average Precision on Clotho dataset, and a 1 5. 1 % improvement on AudioCaps.

KW - co-attention mechanism

KW - information retrieval

KW - machine learning

KW - textaudio retrieval

UR - http://www.scopus.com/inward/record.url?scp=105002248958&partnerID=8YFLogxK

U2 - 10.1109/SWC62898.2024.00251

DO - 10.1109/SWC62898.2024.00251

M3 - Conference Proceeding

AN - SCOPUS:105002248958

T3 - Proceedings - 2024 IEEE Smart World Congress, SWC 2024 - 2024 IEEE Ubiquitous Intelligence and Computing, Autonomous and Trusted Computing, Digital Twin, Metaverse, Privacy Computing and Data Security, Scalable Computing and Communications

SP - 1633

EP - 1638

BT - Proceedings - 2024 IEEE Smart World Congress, SWC 2024 - 2024 IEEE Ubiquitous Intelligence and Computing, Autonomous and Trusted Computing, Digital Twin, Metaverse, Privacy Computing and Data Security, Scalable Computing and Communications

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 10th IEEE Smart World Congress, SWC 2024

Y2 - 2 December 2024 through 7 December 2024

ER -

Sun H, Wang Z, Chen Q, Chen J , Wang J , Zhang H. Language-based Audio Retrieval with Co-Attention Networks. In Proceedings - 2024 IEEE Smart World Congress, SWC 2024 - 2024 IEEE Ubiquitous Intelligence and Computing, Autonomous and Trusted Computing, Digital Twin, Metaverse, Privacy Computing and Data Security, Scalable Computing and Communications. Institute of Electrical and Electronics Engineers Inc. 2024. p. 1633-1638. (Proceedings - 2024 IEEE Smart World Congress, SWC 2024 - 2024 IEEE Ubiquitous Intelligence and Computing, Autonomous and Trusted Computing, Digital Twin, Metaverse, Privacy Computing and Data Security, Scalable Computing and Communications). doi: 10.1109/SWC62898.2024.00251

Language-based Audio Retrieval with Co-Attention Networks

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this