互编码器辅助视频的多模态场景分类

Tianyang Huang; Yuanbo Hou; Shengchen Li; Xi Shao

doi:10.14132/j.cnki.1673-5439.2023.01.013

互编码器辅助视频的多模态场景分类

Translated title of the contribution: Multimodal scene classification for encoder-assisted videos

Tianyang Huang, Yuanbo Hou, Shengchen Li, Xi Shao^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

Given the low accuracy of multi-modal scene classification, this paper proposes a multi-modal scene classification method assisted by mutual coders. First, The audio part extracts the features of the input audio data and uses the self-attention mechanism to obtain the attention information. The image part extracts the frame images of the video, and then extracts the features through the ResNet50. Second, the extracted dual-mode information is entered into the mutual encoder. The mutual encoder performs feature fusion by extracting the hidden layer features of each mode. The new features after fusion are combined with the attention mechanism to assist the video features. In this model, the mutual coder is an auxiliary system for feature fusion. The experiment is conducted on the DCASE2021 Challenge Task 1B dataset, and the results show that the mutual encoder can improve the classification accuracy.

Translated title of the contribution	Multimodal scene classification for encoder-assisted videos
Original language	Chinese (Traditional)
Pages (from-to)	104-110
Number of pages	7
Journal	Nanjing Youdian Daxue Xuebao (Ziran Kexue Ban)/Journal of Nanjing University of Posts and Telecommunications (Natural Science)
Volume	43
Issue number	1
DOIs	https://doi.org/10.14132/j.cnki.1673-5439.2023.01.013
Publication status	Published - Feb 2023

Keywords

audio-visual scene classification
encoder
multimodal learning
self-attention mechanism
variational autoencoder

Access to Document

10.14132/j.cnki.1673-5439.2023.01.013

Cite this

@article{0d7d61a3aaf04152aa319b9a11e36244,

title = "互编码器辅助视频的多模态场景分类",

abstract = "Given the low accuracy of multi-modal scene classification, this paper proposes a multi-modal scene classification method assisted by mutual coders. First, The audio part extracts the features of the input audio data and uses the self-attention mechanism to obtain the attention information. The image part extracts the frame images of the video, and then extracts the features through the ResNet50. Second, the extracted dual-mode information is entered into the mutual encoder. The mutual encoder performs feature fusion by extracting the hidden layer features of each mode. The new features after fusion are combined with the attention mechanism to assist the video features. In this model, the mutual coder is an auxiliary system for feature fusion. The experiment is conducted on the DCASE2021 Challenge Task 1B dataset, and the results show that the mutual encoder can improve the classification accuracy.",

keywords = "audio-visual scene classification, encoder, multimodal learning, self-attention mechanism, variational autoencoder",

author = "Tianyang Huang and Yuanbo Hou and Shengchen Li and Xi Shao",

year = "2023",

month = feb,

doi = "10.14132/j.cnki.1673-5439.2023.01.013",

language = "繁体中文",

volume = "43",

pages = "104--110",

journal = "Nanjing Youdian Daxue Xuebao (Ziran Kexue Ban)/Journal of Nanjing University of Posts and Telecommunications (Natural Science)",

issn = "1673-5439",

number = "1",

}

TY - JOUR

T1 - 互编码器辅助视频的多模态场景分类

AU - Huang, Tianyang

AU - Hou, Yuanbo

AU - Li, Shengchen

AU - Shao, Xi

PY - 2023/2

Y1 - 2023/2

N2 - Given the low accuracy of multi-modal scene classification, this paper proposes a multi-modal scene classification method assisted by mutual coders. First, The audio part extracts the features of the input audio data and uses the self-attention mechanism to obtain the attention information. The image part extracts the frame images of the video, and then extracts the features through the ResNet50. Second, the extracted dual-mode information is entered into the mutual encoder. The mutual encoder performs feature fusion by extracting the hidden layer features of each mode. The new features after fusion are combined with the attention mechanism to assist the video features. In this model, the mutual coder is an auxiliary system for feature fusion. The experiment is conducted on the DCASE2021 Challenge Task 1B dataset, and the results show that the mutual encoder can improve the classification accuracy.

AB - Given the low accuracy of multi-modal scene classification, this paper proposes a multi-modal scene classification method assisted by mutual coders. First, The audio part extracts the features of the input audio data and uses the self-attention mechanism to obtain the attention information. The image part extracts the frame images of the video, and then extracts the features through the ResNet50. Second, the extracted dual-mode information is entered into the mutual encoder. The mutual encoder performs feature fusion by extracting the hidden layer features of each mode. The new features after fusion are combined with the attention mechanism to assist the video features. In this model, the mutual coder is an auxiliary system for feature fusion. The experiment is conducted on the DCASE2021 Challenge Task 1B dataset, and the results show that the mutual encoder can improve the classification accuracy.

KW - audio-visual scene classification

KW - encoder

KW - multimodal learning

KW - self-attention mechanism

KW - variational autoencoder

UR - http://www.scopus.com/inward/record.url?scp=85161260696&partnerID=8YFLogxK

U2 - 10.14132/j.cnki.1673-5439.2023.01.013

DO - 10.14132/j.cnki.1673-5439.2023.01.013

M3 - 文章

AN - SCOPUS:85161260696

SN - 1673-5439

VL - 43

SP - 104

EP - 110

JO - Nanjing Youdian Daxue Xuebao (Ziran Kexue Ban)/Journal of Nanjing University of Posts and Telecommunications (Natural Science)

JF - Nanjing Youdian Daxue Xuebao (Ziran Kexue Ban)/Journal of Nanjing University of Posts and Telecommunications (Natural Science)

IS - 1

ER -

互编码器辅助视频的多模态场景分类

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this