Language-based Audio Retrieval with GPT-Augmented Captions and Self-Attended Audio Clips

Fuyu Gu, Yang Gu, Yiyan Xu, Haoran Sun, Yushan Pan, Shengchen Li, Haiyang Zhang*

*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceedingConference Proceedingpeer-review

Abstract

With the explosion of user-generated content in recent years, efficient methods for organizing multimedia databases based on content and retrieving relevant items have become essential. Language-based audio retrieval seeks to find relevant audio clips based on natural language queries. However, there exists a scarcity of datasets specifically developed for this task. Moreover, the language annotations often carry biases, leading to unsatisfactory retrieval accuracy. In this work, we propose a novel framework for language-based audio retrieval that aims to: 1) utilize GPT-generated text to augment audio captions, thereby improving language diversity; 2) employ audio self-attention mechanisms to capture intricate acoustic features and temporal dependencies. Experiments conducted on two public datasets, containing both short- and long-term audios, demonstrate that our framework can achieve significant performance improvements compared with other methods. Specifically, the proposed framework can achieve a 27% increase in mean average precision (mAP) on the Clotho dataset, and a 31% improvement in mAP on the AudioCaps dataset compared with the baseline.

Original languageEnglish
Title of host publicationProceedings of the 2024 27th International Conference on Computer Supported Cooperative Work in Design, CSCWD 2024
EditorsWeiming Shen, Weiming Shen, Jean-Paul Barthes, Junzhou Luo, Tie Qiu, Xiaobo Zhou, Jinghui Zhang, Haibin Zhu, Kunkun Peng, Tianyi Xu, Ning Chen
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages858-863
Number of pages6
ISBN (Electronic)9798350349184
DOIs
Publication statusPublished - 2024
Event27th International Conference on Computer Supported Cooperative Work in Design, CSCWD 2024 - Tianjin, China
Duration: 8 May 202410 May 2024

Publication series

NameInternational Conference on Computer Supported Cooperative Work in Design (CSCWD)

Conference

Conference27th International Conference on Computer Supported Cooperative Work in Design, CSCWD 2024
Country/TerritoryChina
CityTianjin
Period8/05/2410/05/24

Keywords

  • contrastive learning
  • information retrieval
  • machine learning
  • text-audio retrieval

Fingerprint

Dive into the research topics of 'Language-based Audio Retrieval with GPT-Augmented Captions and Self-Attended Audio Clips'. Together they form a unique fingerprint.

Cite this