VLP2MSA: Expanding vision-language pre-training to multimodal sentiment analysis

Guofeng Yi; Cunhang Fan; Kang Zhu; Zhao Lv; Shan Liang; Zhengqi Wen; Guanxiong Pei; Taihao Li; Jianhua Tao

doi:10.1016/j.knosys.2023.111136

VLP2MSA: Expanding vision-language pre-training to multimodal sentiment analysis

Guofeng Yi, Cunhang Fan^*, Kang Zhu, Zhao Lv, Shan Liang, Zhengqi Wen, Guanxiong Pei, Taihao Li, Jianhua Tao^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

16 Citations (Scopus)

Abstract

Large-scale vision-and-language representation learning has improved performance on various joint vision-language downstream tasks. In this work, our objective is to extend it effectively to multimodal sentiment analysis tasks and address two urgent challenges in this field: (1) the low contribution of the visual modality (2) the design of an effective multimodal fusion architecture. To overcome the imbalance between the visual and textual modalities, we propose an inter-frame hybrid transformer, which extends the recent CLIP and Timesformer architectures. This module extracts spatio-temporal features from sparsely sampled video frames, not only focusing on facial expressions but also capturing body movement information, providing a more comprehensive visual representation compared to the traditional direct use of pre-extracted facial information. Additionally, we tackle the challenge of modality heterogeneity in the fusion architecture by introducing a new scheme that prompts and aligns the video and text information before fusing them. Specifically, We generate discriminative text prompts based on the video content information to enhance the text representation and align the unimodal video-text features using a video-text contrastive loss. Our proposed end-to-end trainable model demonstrates state-of-the-art performance on three widely-used datasets using only two modalities: MOSI, MOSEI, and CH-SIMS. These experimental results validate the effectiveness of our approach in improving multimodal sentiment analysis tasks.

Original language	English
Article number	111136
Journal	Knowledge-Based Systems
Volume	283
DOIs	https://doi.org/10.1016/j.knosys.2023.111136
Publication status	Published - 11 Jan 2024
Externally published	Yes

Keywords

Multimodal fusion
Multimodal sentiment analysis
Vision-language

Access to Document

10.1016/j.knosys.2023.111136

Cite this

@article{751d6ebdcf5546f9bdddeb3934934727,

title = "VLP2MSA: Expanding vision-language pre-training to multimodal sentiment analysis",

abstract = "Large-scale vision-and-language representation learning has improved performance on various joint vision-language downstream tasks. In this work, our objective is to extend it effectively to multimodal sentiment analysis tasks and address two urgent challenges in this field: (1) the low contribution of the visual modality (2) the design of an effective multimodal fusion architecture. To overcome the imbalance between the visual and textual modalities, we propose an inter-frame hybrid transformer, which extends the recent CLIP and Timesformer architectures. This module extracts spatio-temporal features from sparsely sampled video frames, not only focusing on facial expressions but also capturing body movement information, providing a more comprehensive visual representation compared to the traditional direct use of pre-extracted facial information. Additionally, we tackle the challenge of modality heterogeneity in the fusion architecture by introducing a new scheme that prompts and aligns the video and text information before fusing them. Specifically, We generate discriminative text prompts based on the video content information to enhance the text representation and align the unimodal video-text features using a video-text contrastive loss. Our proposed end-to-end trainable model demonstrates state-of-the-art performance on three widely-used datasets using only two modalities: MOSI, MOSEI, and CH-SIMS. These experimental results validate the effectiveness of our approach in improving multimodal sentiment analysis tasks.",

keywords = "Multimodal fusion, Multimodal sentiment analysis, Vision-language",

author = "Guofeng Yi and Cunhang Fan and Kang Zhu and Zhao Lv and Shan Liang and Zhengqi Wen and Guanxiong Pei and Taihao Li and Jianhua Tao",

note = "Publisher Copyright: {\textcopyright} 2023 Elsevier B.V.",

year = "2024",

month = jan,

day = "11",

doi = "10.1016/j.knosys.2023.111136",

language = "English",

volume = "283",

journal = "Knowledge-Based Systems",

issn = "0950-7051",

publisher = "Elsevier",

}

TY - JOUR

T1 - VLP2MSA

T2 - Expanding vision-language pre-training to multimodal sentiment analysis

AU - Yi, Guofeng

AU - Fan, Cunhang

AU - Zhu, Kang

AU - Lv, Zhao

AU - Liang, Shan

AU - Wen, Zhengqi

AU - Pei, Guanxiong

AU - Li, Taihao

AU - Tao, Jianhua

PY - 2024/1/11

Y1 - 2024/1/11

N2 - Large-scale vision-and-language representation learning has improved performance on various joint vision-language downstream tasks. In this work, our objective is to extend it effectively to multimodal sentiment analysis tasks and address two urgent challenges in this field: (1) the low contribution of the visual modality (2) the design of an effective multimodal fusion architecture. To overcome the imbalance between the visual and textual modalities, we propose an inter-frame hybrid transformer, which extends the recent CLIP and Timesformer architectures. This module extracts spatio-temporal features from sparsely sampled video frames, not only focusing on facial expressions but also capturing body movement information, providing a more comprehensive visual representation compared to the traditional direct use of pre-extracted facial information. Additionally, we tackle the challenge of modality heterogeneity in the fusion architecture by introducing a new scheme that prompts and aligns the video and text information before fusing them. Specifically, We generate discriminative text prompts based on the video content information to enhance the text representation and align the unimodal video-text features using a video-text contrastive loss. Our proposed end-to-end trainable model demonstrates state-of-the-art performance on three widely-used datasets using only two modalities: MOSI, MOSEI, and CH-SIMS. These experimental results validate the effectiveness of our approach in improving multimodal sentiment analysis tasks.

AB - Large-scale vision-and-language representation learning has improved performance on various joint vision-language downstream tasks. In this work, our objective is to extend it effectively to multimodal sentiment analysis tasks and address two urgent challenges in this field: (1) the low contribution of the visual modality (2) the design of an effective multimodal fusion architecture. To overcome the imbalance between the visual and textual modalities, we propose an inter-frame hybrid transformer, which extends the recent CLIP and Timesformer architectures. This module extracts spatio-temporal features from sparsely sampled video frames, not only focusing on facial expressions but also capturing body movement information, providing a more comprehensive visual representation compared to the traditional direct use of pre-extracted facial information. Additionally, we tackle the challenge of modality heterogeneity in the fusion architecture by introducing a new scheme that prompts and aligns the video and text information before fusing them. Specifically, We generate discriminative text prompts based on the video content information to enhance the text representation and align the unimodal video-text features using a video-text contrastive loss. Our proposed end-to-end trainable model demonstrates state-of-the-art performance on three widely-used datasets using only two modalities: MOSI, MOSEI, and CH-SIMS. These experimental results validate the effectiveness of our approach in improving multimodal sentiment analysis tasks.

KW - Multimodal fusion

KW - Multimodal sentiment analysis

KW - Vision-language

UR - http://www.scopus.com/inward/record.url?scp=85175660074&partnerID=8YFLogxK

U2 - 10.1016/j.knosys.2023.111136

DO - 10.1016/j.knosys.2023.111136

M3 - Article

AN - SCOPUS:85175660074

SN - 0950-7051

VL - 283

JO - Knowledge-Based Systems

JF - Knowledge-Based Systems

M1 - 111136

ER -

VLP2MSA: Expanding vision-language pre-training to multimodal sentiment analysis

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this