TY - CONF
T1 - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
AU - Liu, Xubo
AU - Huang, Qiushi
AU - Mei, Xinhao
AU - Liu, Haohe
AU - Kong, Qiuqiang
AU - Sun, Jianyuan
AU - Li, Shengchen
AU - Ko, Tom
AU - Zhang, Yu
AU - Tang, Lilian H.
AU - Plumbley, Mark D.
AU - Kılıç, Volkan
AU - Wang, Wenwu
N1 - Funding Information:
This work is partly supported by UK Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 “AI for Sound”, a Newton Institutional Links Award from the British Council, titled “Automated Captioning of Image and Audio for Visually and Hearing Impaired” (Grant number 623805725), British Broadcasting Corporation Research and Development (BBC R&D), a PhD scholarship from the University of Surrey, and a Research Scholarship from the China Scholarship Council (CSC). For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising.
Publisher Copyright:
© 2023 International Speech Communication Association. All rights reserved.
PY - 2023/8/31
Y1 - 2023/8/31
N2 - Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.
AB - Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.
KW - Audio captioning
KW - attention mechanism
KW - audio-visual learning
KW - multimodal learning
UR - http://www.scopus.com/inward/record.url?scp=85168726157&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2023-914
DO - 10.21437/Interspeech.2023-914
M3 - Paper
AN - SCOPUS:85168726157
SP - 2838
EP - 2842
T2 - 24th International Speech Communication Association, Interspeech 2023
Y2 - 20 August 2023 through 24 August 2023
ER -