Abstract
Aiming at the problem that the accuracy of scene classification can be affected by interference information due to the simultaneous occurrence of multiple events in real environment scenes, a multimodal scene classification method based on self-attention mechanism is proposed. First, audio features need to be extracted and the information that needs attention is obtained by self-attention mechanism; Then, the videos are extracted by frames and the image features are extracted through ResNet 50; Finally, the two modalities’ features are concatenated and self-attention is used again to capture and classify the feature information. The experimental results based on the DCASE2021 Challenge Task 1B dataset show that compared with its baseline system, the simple splicing of bimodal information and unimodal decision classification with video-assisted audio and audio-assisted video, multimodal scene classification system based on self-attention mechanism has better accuracy than the system based on unimodal mutually assisted decision.
Translated title of the contribution | Multimodal Scene Classification Based on Self-Attention Mechanism |
---|---|
Original language | Chinese (Traditional) |
Pages (from-to) | 46-52 |
Number of pages | 7 |
Journal | Journal of Fudan University (Natural Science) |
Volume | 62 |
Issue number | 1 |
DOIs | |
Publication status | Published - Feb 2023 |
Externally published | Yes |
Keywords
- audio-visual scene classification
- auxiliary learning
- multimodal fusion
- self-attention