基于自注意力机制的多模态场景分类

Translated title of the contribution: Multimodal Scene Classification Based on Self-Attention Mechanism

Yue Chang, Yuanbo Hou, Yizhou Tan, Shengchen Li, Xi Shao*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Aiming at the problem that the accuracy of scene classification can be affected by interference information due to the simultaneous occurrence of multiple events in real environment scenes, a multimodal scene classification method based on self-attention mechanism is proposed. First, audio features need to be extracted and the information that needs attention is obtained by self-attention mechanism; Then, the videos are extracted by frames and the image features are extracted through ResNet 50; Finally, the two modalities’ features are concatenated and self-attention is used again to capture and classify the feature information. The experimental results based on the DCASE2021 Challenge Task 1B dataset show that compared with its baseline system, the simple splicing of bimodal information and unimodal decision classification with video-assisted audio and audio-assisted video, multimodal scene classification system based on self-attention mechanism has better accuracy than the system based on unimodal mutually assisted decision.

Translated title of the contributionMultimodal Scene Classification Based on Self-Attention Mechanism
Original languageChinese (Traditional)
Pages (from-to)46-52
Number of pages7
JournalJournal of Fudan University (Natural Science)
Volume62
Issue number1
DOIs
Publication statusPublished - Feb 2023
Externally publishedYes

Keywords

  • audio-visual scene classification
  • auxiliary learning
  • multimodal fusion
  • self-attention

Fingerprint

Dive into the research topics of 'Multimodal Scene Classification Based on Self-Attention Mechanism'. Together they form a unique fingerprint.

Cite this