Abstract
Given the low accuracy of multi-modal scene classification, this paper proposes a multi-modal scene classification method assisted by mutual coders. First, The audio part extracts the features of the input audio data and uses the self-attention mechanism to obtain the attention information. The image part extracts the frame images of the video, and then extracts the features through the ResNet50. Second, the extracted dual-mode information is entered into the mutual encoder. The mutual encoder performs feature fusion by extracting the hidden layer features of each mode. The new features after fusion are combined with the attention mechanism to assist the video features. In this model, the mutual coder is an auxiliary system for feature fusion. The experiment is conducted on the DCASE2021 Challenge Task 1B dataset, and the results show that the mutual encoder can improve the classification accuracy.
Translated title of the contribution | Multimodal scene classification for encoder-assisted videos |
---|---|
Original language | Chinese (Traditional) |
Pages (from-to) | 104-110 |
Number of pages | 7 |
Journal | Nanjing Youdian Daxue Xuebao (Ziran Kexue Ban)/Journal of Nanjing University of Posts and Telecommunications (Natural Science) |
Volume | 43 |
Issue number | 1 |
DOIs | |
Publication status | Published - Feb 2023 |
Keywords
- audio-visual scene classification
- encoder
- multimodal learning
- self-attention mechanism
- variational autoencoder