Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Xubo Liu; Qiushi Huang; Xinhao Mei; Haohe Liu; Qiuqiang Kong; Jianyuan Sun; Shengchen Li; Tom Ko; Yu Zhang; Lilian H. Tang; Mark D. Plumbley; Volkan Kılıç; Wenwu Wang

doi:10.21437/Interspeech.2023-914

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang

Department of Intelligent Science

Research output: Contribution to conference › Paper › peer-review

8 Citations (Scopus)

Abstract

Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.

Original language	English
Pages	2838-2842
Number of pages	5
DOIs	https://doi.org/10.21437/Interspeech.2023-914
Publication status	Published - 31 Aug 2023
Event	24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland Duration: 20 Aug 2023 → 24 Aug 2023

Conference

Conference	24th International Speech Communication Association, Interspeech 2023
Country/Territory	Ireland
City	Dublin
Period	20/08/23 → 24/08/23

Keywords

Audio captioning
attention mechanism
audio-visual learning
multimodal learning

Access to Document

10.21437/Interspeech.2023-914

Cite this

Liu, X., Huang, Q., Mei, X., Liu, H., Kong, Q., Sun, J., Li, S., Ko, T., Zhang, Y., Tang, L. H., Plumbley, M. D., Kılıç, V., & Wang, W. (2023). Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention. 2838-2842. Paper presented at 24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland. https://doi.org/10.21437/Interspeech.2023-914

@conference{5526719da4a64e0eb312461ae678f928,

title = "Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention",

abstract = "Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.",

keywords = "Audio captioning, attention mechanism, audio-visual learning, multimodal learning",

author = "Xubo Liu and Qiushi Huang and Xinhao Mei and Haohe Liu and Qiuqiang Kong and Jianyuan Sun and Shengchen Li and Tom Ko and Yu Zhang and Tang, {Lilian H.} and Plumbley, {Mark D.} and Volkan Kılı{\c c} and Wenwu Wang",

note = "Funding Information: This work is partly supported by UK Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 “AI for Sound”, a Newton Institutional Links Award from the British Council, titled “Automated Captioning of Image and Audio for Visually and Hearing Impaired” (Grant number 623805725), British Broadcasting Corporation Research and Development (BBC R&D), a PhD scholarship from the University of Surrey, and a Research Scholarship from the China Scholarship Council (CSC). For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising. Publisher Copyright: {\textcopyright} 2023 International Speech Communication Association. All rights reserved.; 24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

year = "2023",

month = aug,

day = "31",

doi = "10.21437/Interspeech.2023-914",

language = "English",

pages = "2838--2842",

}

Liu, X, Huang, Q, Mei, X, Liu, H, Kong, Q, Sun, J, Li, S, Ko, T, Zhang, Y, Tang, LH, Plumbley, MD, Kılıç, V & Wang, W 2023, 'Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention', Paper presented at 24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, 20/08/23 - 24/08/23 pp. 2838-2842. https://doi.org/10.21437/Interspeech.2023-914

TY - CONF

T1 - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

AU - Liu, Xubo

AU - Huang, Qiushi

AU - Mei, Xinhao

AU - Liu, Haohe

AU - Kong, Qiuqiang

AU - Sun, Jianyuan

AU - Li, Shengchen

AU - Ko, Tom

AU - Zhang, Yu

AU - Tang, Lilian H.

AU - Plumbley, Mark D.

AU - Kılıç, Volkan

AU - Wang, Wenwu

N1 - Funding Information: This work is partly supported by UK Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 “AI for Sound”, a Newton Institutional Links Award from the British Council, titled “Automated Captioning of Image and Audio for Visually and Hearing Impaired” (Grant number 623805725), British Broadcasting Corporation Research and Development (BBC R&D), a PhD scholarship from the University of Surrey, and a Research Scholarship from the China Scholarship Council (CSC). For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising. Publisher Copyright: © 2023 International Speech Communication Association. All rights reserved.

PY - 2023/8/31

Y1 - 2023/8/31

N2 - Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.

AB - Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.

KW - Audio captioning

KW - attention mechanism

KW - audio-visual learning

KW - multimodal learning

UR - http://www.scopus.com/inward/record.url?scp=85168726157&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-914

DO - 10.21437/Interspeech.2023-914

M3 - Paper

AN - SCOPUS:85168726157

SP - 2838

EP - 2842

T2 - 24th International Speech Communication Association, Interspeech 2023

Y2 - 20 August 2023 through 24 August 2023

ER -

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Abstract

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this