互编码器辅助视频的多模态场景分类

Translated title of the contribution: Multimodal scene classification for encoder-assisted videos

Tianyang Huang, Yuanbo Hou, Shengchen Li, Xi Shao*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Given the low accuracy of multi-modal scene classification, this paper proposes a multi-modal scene classification method assisted by mutual coders. First, The audio part extracts the features of the input audio data and uses the self-attention mechanism to obtain the attention information. The image part extracts the frame images of the video, and then extracts the features through the ResNet50. Second, the extracted dual-mode information is entered into the mutual encoder. The mutual encoder performs feature fusion by extracting the hidden layer features of each mode. The new features after fusion are combined with the attention mechanism to assist the video features. In this model, the mutual coder is an auxiliary system for feature fusion. The experiment is conducted on the DCASE2021 Challenge Task 1B dataset, and the results show that the mutual encoder can improve the classification accuracy.

Translated title of the contributionMultimodal scene classification for encoder-assisted videos
Original languageChinese (Traditional)
Pages (from-to)104-110
Number of pages7
JournalNanjing Youdian Daxue Xuebao (Ziran Kexue Ban)/Journal of Nanjing University of Posts and Telecommunications (Natural Science)
Volume43
Issue number1
DOIs
Publication statusPublished - Feb 2023

Keywords

  • audio-visual scene classification
  • encoder
  • multimodal learning
  • self-attention mechanism
  • variational autoencoder

Fingerprint

Dive into the research topics of 'Multimodal scene classification for encoder-assisted videos'. Together they form a unique fingerprint.

Cite this