基于自注意力机制的多模态场景分类

Yue Chang; Yuanbo Hou; Yizhou Tan; Shengchen Li; Xi Shao

doi:10.15943/j.cnki.fdxb-jns.20230208.006

基于自注意力机制的多模态场景分类

Translated title of the contribution: Multimodal Scene Classification Based on Self-Attention Mechanism

Yue Chang, Yuanbo Hou, Yizhou Tan, Shengchen Li, Xi Shao^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

Aiming at the problem that the accuracy of scene classification can be affected by interference information due to the simultaneous occurrence of multiple events in real environment scenes, a multimodal scene classification method based on self-attention mechanism is proposed. First, audio features need to be extracted and the information that needs attention is obtained by self-attention mechanism; Then, the videos are extracted by frames and the image features are extracted through ResNet 50; Finally, the two modalities’ features are concatenated and self-attention is used again to capture and classify the feature information. The experimental results based on the DCASE2021 Challenge Task 1B dataset show that compared with its baseline system, the simple splicing of bimodal information and unimodal decision classification with video-assisted audio and audio-assisted video, multimodal scene classification system based on self-attention mechanism has better accuracy than the system based on unimodal mutually assisted decision.

Translated title of the contribution	Multimodal Scene Classification Based on Self-Attention Mechanism
Original language	Chinese (Traditional)
Pages (from-to)	46-52
Number of pages	7
Journal	Journal of Fudan University (Natural Science)
Volume	62
Issue number	1
DOIs	https://doi.org/10.15943/j.cnki.fdxb-jns.20230208.006
Publication status	Published - Feb 2023
Externally published	Yes

Keywords

audio-visual scene classification
auxiliary learning
multimodal fusion
self-attention

Access to Document

10.15943/j.cnki.fdxb-jns.20230208.006

Cite this

@article{a2c08f2cf6924cd284ae7a47b2e64fda,

title = "基于自注意力机制的多模态场景分类",

abstract = "Aiming at the problem that the accuracy of scene classification can be affected by interference information due to the simultaneous occurrence of multiple events in real environment scenes, a multimodal scene classification method based on self-attention mechanism is proposed. First, audio features need to be extracted and the information that needs attention is obtained by self-attention mechanism; Then, the videos are extracted by frames and the image features are extracted through ResNet 50; Finally, the two modalities{\textquoteright} features are concatenated and self-attention is used again to capture and classify the feature information. The experimental results based on the DCASE2021 Challenge Task 1B dataset show that compared with its baseline system, the simple splicing of bimodal information and unimodal decision classification with video-assisted audio and audio-assisted video, multimodal scene classification system based on self-attention mechanism has better accuracy than the system based on unimodal mutually assisted decision.",

keywords = "audio-visual scene classification, auxiliary learning, multimodal fusion, self-attention",

author = "Yue Chang and Yuanbo Hou and Yizhou Tan and Shengchen Li and Xi Shao",

year = "2023",

month = feb,

doi = "10.15943/j.cnki.fdxb-jns.20230208.006",

language = "繁体中文",

volume = "62",

pages = "46--52",

journal = "Journal of Fudan University (Natural Science)",

issn = "0427-7104",

number = "1",

}

TY - JOUR

T1 - 基于自注意力机制的多模态场景分类

AU - Chang, Yue

AU - Hou, Yuanbo

AU - Tan, Yizhou

AU - Li, Shengchen

AU - Shao, Xi

PY - 2023/2

Y1 - 2023/2

N2 - Aiming at the problem that the accuracy of scene classification can be affected by interference information due to the simultaneous occurrence of multiple events in real environment scenes, a multimodal scene classification method based on self-attention mechanism is proposed. First, audio features need to be extracted and the information that needs attention is obtained by self-attention mechanism; Then, the videos are extracted by frames and the image features are extracted through ResNet 50; Finally, the two modalities’ features are concatenated and self-attention is used again to capture and classify the feature information. The experimental results based on the DCASE2021 Challenge Task 1B dataset show that compared with its baseline system, the simple splicing of bimodal information and unimodal decision classification with video-assisted audio and audio-assisted video, multimodal scene classification system based on self-attention mechanism has better accuracy than the system based on unimodal mutually assisted decision.

AB - Aiming at the problem that the accuracy of scene classification can be affected by interference information due to the simultaneous occurrence of multiple events in real environment scenes, a multimodal scene classification method based on self-attention mechanism is proposed. First, audio features need to be extracted and the information that needs attention is obtained by self-attention mechanism; Then, the videos are extracted by frames and the image features are extracted through ResNet 50; Finally, the two modalities’ features are concatenated and self-attention is used again to capture and classify the feature information. The experimental results based on the DCASE2021 Challenge Task 1B dataset show that compared with its baseline system, the simple splicing of bimodal information and unimodal decision classification with video-assisted audio and audio-assisted video, multimodal scene classification system based on self-attention mechanism has better accuracy than the system based on unimodal mutually assisted decision.

KW - audio-visual scene classification

KW - auxiliary learning

KW - multimodal fusion

KW - self-attention

UR - http://www.scopus.com/inward/record.url?scp=85199006141&partnerID=8YFLogxK

U2 - 10.15943/j.cnki.fdxb-jns.20230208.006

DO - 10.15943/j.cnki.fdxb-jns.20230208.006

M3 - 文章

AN - SCOPUS:85199006141

SN - 0427-7104

VL - 62

SP - 46

EP - 52

JO - Journal of Fudan University (Natural Science)

JF - Journal of Fudan University (Natural Science)

IS - 1

ER -

基于自注意力机制的多模态场景分类

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this