TY - GEN
T1 - Language-based Audio Retrieval with GPT-Augmented Captions and Self-Attended Audio Clips
AU - Gu, Fuyu
AU - Gu, Yang
AU - Xu, Yiyan
AU - Sun, Haoran
AU - Pan, Yushan
AU - Li, Shengchen
AU - Zhang, Haiyang
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - With the explosion of user-generated content in recent years, efficient methods for organizing multimedia databases based on content and retrieving relevant items have become essential. Language-based audio retrieval seeks to find relevant audio clips based on natural language queries. However, there exists a scarcity of datasets specifically developed for this task. Moreover, the language annotations often carry biases, leading to unsatisfactory retrieval accuracy. In this work, we propose a novel framework for language-based audio retrieval that aims to: 1) utilize GPT-generated text to augment audio captions, thereby improving language diversity; 2) employ audio self-attention mechanisms to capture intricate acoustic features and temporal dependencies. Experiments conducted on two public datasets, containing both short- and long-term audios, demonstrate that our framework can achieve significant performance improvements compared with other methods. Specifically, the proposed framework can achieve a 27% increase in mean average precision (mAP) on the Clotho dataset, and a 31% improvement in mAP on the AudioCaps dataset compared with the baseline.
AB - With the explosion of user-generated content in recent years, efficient methods for organizing multimedia databases based on content and retrieving relevant items have become essential. Language-based audio retrieval seeks to find relevant audio clips based on natural language queries. However, there exists a scarcity of datasets specifically developed for this task. Moreover, the language annotations often carry biases, leading to unsatisfactory retrieval accuracy. In this work, we propose a novel framework for language-based audio retrieval that aims to: 1) utilize GPT-generated text to augment audio captions, thereby improving language diversity; 2) employ audio self-attention mechanisms to capture intricate acoustic features and temporal dependencies. Experiments conducted on two public datasets, containing both short- and long-term audios, demonstrate that our framework can achieve significant performance improvements compared with other methods. Specifically, the proposed framework can achieve a 27% increase in mean average precision (mAP) on the Clotho dataset, and a 31% improvement in mAP on the AudioCaps dataset compared with the baseline.
KW - contrastive learning
KW - information retrieval
KW - machine learning
KW - text-audio retrieval
UR - http://www.scopus.com/inward/record.url?scp=85199094230&partnerID=8YFLogxK
U2 - 10.1109/CSCWD61410.2024.10580534
DO - 10.1109/CSCWD61410.2024.10580534
M3 - Conference Proceeding
AN - SCOPUS:85199094230
T3 - International Conference on Computer Supported Cooperative Work in Design (CSCWD)
SP - 858
EP - 863
BT - Proceedings of the 2024 27th International Conference on Computer Supported Cooperative Work in Design, CSCWD 2024
A2 - Shen, Weiming
A2 - Shen, Weiming
A2 - Barthes, Jean-Paul
A2 - Luo, Junzhou
A2 - Qiu, Tie
A2 - Zhou, Xiaobo
A2 - Zhang, Jinghui
A2 - Zhu, Haibin
A2 - Peng, Kunkun
A2 - Xu, Tianyi
A2 - Chen, Ning
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 27th International Conference on Computer Supported Cooperative Work in Design, CSCWD 2024
Y2 - 8 May 2024 through 10 May 2024
ER -