TY - JOUR
T1 - VLP2MSA
T2 - Expanding vision-language pre-training to multimodal sentiment analysis
AU - Yi, Guofeng
AU - Fan, Cunhang
AU - Zhu, Kang
AU - Lv, Zhao
AU - Liang, Shan
AU - Wen, Zhengqi
AU - Pei, Guanxiong
AU - Li, Taihao
AU - Tao, Jianhua
N1 - Publisher Copyright:
© 2023 Elsevier B.V.
PY - 2024/1/11
Y1 - 2024/1/11
N2 - Large-scale vision-and-language representation learning has improved performance on various joint vision-language downstream tasks. In this work, our objective is to extend it effectively to multimodal sentiment analysis tasks and address two urgent challenges in this field: (1) the low contribution of the visual modality (2) the design of an effective multimodal fusion architecture. To overcome the imbalance between the visual and textual modalities, we propose an inter-frame hybrid transformer, which extends the recent CLIP and Timesformer architectures. This module extracts spatio-temporal features from sparsely sampled video frames, not only focusing on facial expressions but also capturing body movement information, providing a more comprehensive visual representation compared to the traditional direct use of pre-extracted facial information. Additionally, we tackle the challenge of modality heterogeneity in the fusion architecture by introducing a new scheme that prompts and aligns the video and text information before fusing them. Specifically, We generate discriminative text prompts based on the video content information to enhance the text representation and align the unimodal video-text features using a video-text contrastive loss. Our proposed end-to-end trainable model demonstrates state-of-the-art performance on three widely-used datasets using only two modalities: MOSI, MOSEI, and CH-SIMS. These experimental results validate the effectiveness of our approach in improving multimodal sentiment analysis tasks.
AB - Large-scale vision-and-language representation learning has improved performance on various joint vision-language downstream tasks. In this work, our objective is to extend it effectively to multimodal sentiment analysis tasks and address two urgent challenges in this field: (1) the low contribution of the visual modality (2) the design of an effective multimodal fusion architecture. To overcome the imbalance between the visual and textual modalities, we propose an inter-frame hybrid transformer, which extends the recent CLIP and Timesformer architectures. This module extracts spatio-temporal features from sparsely sampled video frames, not only focusing on facial expressions but also capturing body movement information, providing a more comprehensive visual representation compared to the traditional direct use of pre-extracted facial information. Additionally, we tackle the challenge of modality heterogeneity in the fusion architecture by introducing a new scheme that prompts and aligns the video and text information before fusing them. Specifically, We generate discriminative text prompts based on the video content information to enhance the text representation and align the unimodal video-text features using a video-text contrastive loss. Our proposed end-to-end trainable model demonstrates state-of-the-art performance on three widely-used datasets using only two modalities: MOSI, MOSEI, and CH-SIMS. These experimental results validate the effectiveness of our approach in improving multimodal sentiment analysis tasks.
KW - Multimodal fusion
KW - Multimodal sentiment analysis
KW - Vision-language
UR - http://www.scopus.com/inward/record.url?scp=85175660074&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2023.111136
DO - 10.1016/j.knosys.2023.111136
M3 - Article
AN - SCOPUS:85175660074
SN - 0950-7051
VL - 283
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 111136
ER -