Multimodal sentiment analysis based on disentangled representation learning and cross-modal-context association mining

Zuhe Li; Panbo Liu; Yushan Pan; Weiping Ding; Jun Yu; Haoran Chen; Weihua Liu; Yiming Luo; Hao Wang

doi:10.1016/j.neucom.2024.128940

Multimodal sentiment analysis based on disentangled representation learning and cross-modal-context association mining

Zuhe Li, Panbo Liu, Yushan Pan, Weiping Ding^*, Jun Yu, Haoran Chen, Weihua Liu, Yiming Luo, Hao Wang

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

Multimodal sentiment analysis aims to extract sentiment information expressed by users from multimodal data, including linguistic, acoustic, and visual cues. However, the heterogeneity of multimodal data leads to disparities in modal distribution, thereby impacting the model's ability to effectively integrate complementarity and redundancy across modalities. Additionally, existing approaches often merge modalities directly after obtaining their representations, overlooking potential emotional correlations between them. To tackle these challenges, we propose a Multiview Collaborative Perception (MVCP) framework for multimodal sentiment analysis. This framework consists primarily of two modules: Multimodal Disentangled Representation Learning (MDRL) and Cross-Modal Context Association Mining (CMCAM). The MDRL module employs a joint learning layer comprising a common encoder and an exclusive encoder. This layer maps multimodal data to a hypersphere, learning common and exclusive representations for each modality, thus mitigating the semantic gap arising from modal heterogeneity. To further bridge semantic gaps and capture complex inter-modal correlations, the CMCAM module utilizes multiple attention mechanisms to mine cross-modal and contextual sentiment associations, yielding joint representations with rich multimodal semantic interactions. In this stage, the CMCAM module only discovers the correlation information among the common representations in order to maintain the exclusive representations of different modalities. Finally, a multitask learning framework is adopted to achieve parameter sharing between single-modal tasks and improve sentiment prediction performance. Experimental results on the MOSI and MOSEI datasets demonstrate the effectiveness of the proposed method.

Original language	English
Article number	128940
Journal	Neurocomputing
Volume	617
DOIs	https://doi.org/10.1016/j.neucom.2024.128940
Publication status	Published - 7 Feb 2025

Keywords

Linguistic guided-multihead attention
Multimodal association mining
Multimodal fusion
Multimodal representation learning
Multimodal sentiment analysis

Access to Document

10.1016/j.neucom.2024.128940

Cite this

@article{1fa9403ff98f4a50a03622c94bfd983b,

title = "Multimodal sentiment analysis based on disentangled representation learning and cross-modal-context association mining",

abstract = "Multimodal sentiment analysis aims to extract sentiment information expressed by users from multimodal data, including linguistic, acoustic, and visual cues. However, the heterogeneity of multimodal data leads to disparities in modal distribution, thereby impacting the model's ability to effectively integrate complementarity and redundancy across modalities. Additionally, existing approaches often merge modalities directly after obtaining their representations, overlooking potential emotional correlations between them. To tackle these challenges, we propose a Multiview Collaborative Perception (MVCP) framework for multimodal sentiment analysis. This framework consists primarily of two modules: Multimodal Disentangled Representation Learning (MDRL) and Cross-Modal Context Association Mining (CMCAM). The MDRL module employs a joint learning layer comprising a common encoder and an exclusive encoder. This layer maps multimodal data to a hypersphere, learning common and exclusive representations for each modality, thus mitigating the semantic gap arising from modal heterogeneity. To further bridge semantic gaps and capture complex inter-modal correlations, the CMCAM module utilizes multiple attention mechanisms to mine cross-modal and contextual sentiment associations, yielding joint representations with rich multimodal semantic interactions. In this stage, the CMCAM module only discovers the correlation information among the common representations in order to maintain the exclusive representations of different modalities. Finally, a multitask learning framework is adopted to achieve parameter sharing between single-modal tasks and improve sentiment prediction performance. Experimental results on the MOSI and MOSEI datasets demonstrate the effectiveness of the proposed method.",

keywords = "Linguistic guided-multihead attention, Multimodal association mining, Multimodal fusion, Multimodal representation learning, Multimodal sentiment analysis",

author = "Zuhe Li and Panbo Liu and Yushan Pan and Weiping Ding and Jun Yu and Haoran Chen and Weihua Liu and Yiming Luo and Hao Wang",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier B.V.",

year = "2025",

month = feb,

day = "7",

doi = "10.1016/j.neucom.2024.128940",

language = "English",

volume = "617",

journal = "Neurocomputing",

issn = "0925-2312",

}

TY - JOUR

T1 - Multimodal sentiment analysis based on disentangled representation learning and cross-modal-context association mining

AU - Li, Zuhe

AU - Liu, Panbo

AU - Pan, Yushan

AU - Ding, Weiping

AU - Yu, Jun

AU - Chen, Haoran

AU - Liu, Weihua

AU - Luo, Yiming

AU - Wang, Hao

PY - 2025/2/7

Y1 - 2025/2/7

N2 - Multimodal sentiment analysis aims to extract sentiment information expressed by users from multimodal data, including linguistic, acoustic, and visual cues. However, the heterogeneity of multimodal data leads to disparities in modal distribution, thereby impacting the model's ability to effectively integrate complementarity and redundancy across modalities. Additionally, existing approaches often merge modalities directly after obtaining their representations, overlooking potential emotional correlations between them. To tackle these challenges, we propose a Multiview Collaborative Perception (MVCP) framework for multimodal sentiment analysis. This framework consists primarily of two modules: Multimodal Disentangled Representation Learning (MDRL) and Cross-Modal Context Association Mining (CMCAM). The MDRL module employs a joint learning layer comprising a common encoder and an exclusive encoder. This layer maps multimodal data to a hypersphere, learning common and exclusive representations for each modality, thus mitigating the semantic gap arising from modal heterogeneity. To further bridge semantic gaps and capture complex inter-modal correlations, the CMCAM module utilizes multiple attention mechanisms to mine cross-modal and contextual sentiment associations, yielding joint representations with rich multimodal semantic interactions. In this stage, the CMCAM module only discovers the correlation information among the common representations in order to maintain the exclusive representations of different modalities. Finally, a multitask learning framework is adopted to achieve parameter sharing between single-modal tasks and improve sentiment prediction performance. Experimental results on the MOSI and MOSEI datasets demonstrate the effectiveness of the proposed method.

AB - Multimodal sentiment analysis aims to extract sentiment information expressed by users from multimodal data, including linguistic, acoustic, and visual cues. However, the heterogeneity of multimodal data leads to disparities in modal distribution, thereby impacting the model's ability to effectively integrate complementarity and redundancy across modalities. Additionally, existing approaches often merge modalities directly after obtaining their representations, overlooking potential emotional correlations between them. To tackle these challenges, we propose a Multiview Collaborative Perception (MVCP) framework for multimodal sentiment analysis. This framework consists primarily of two modules: Multimodal Disentangled Representation Learning (MDRL) and Cross-Modal Context Association Mining (CMCAM). The MDRL module employs a joint learning layer comprising a common encoder and an exclusive encoder. This layer maps multimodal data to a hypersphere, learning common and exclusive representations for each modality, thus mitigating the semantic gap arising from modal heterogeneity. To further bridge semantic gaps and capture complex inter-modal correlations, the CMCAM module utilizes multiple attention mechanisms to mine cross-modal and contextual sentiment associations, yielding joint representations with rich multimodal semantic interactions. In this stage, the CMCAM module only discovers the correlation information among the common representations in order to maintain the exclusive representations of different modalities. Finally, a multitask learning framework is adopted to achieve parameter sharing between single-modal tasks and improve sentiment prediction performance. Experimental results on the MOSI and MOSEI datasets demonstrate the effectiveness of the proposed method.

KW - Linguistic guided-multihead attention

KW - Multimodal association mining

KW - Multimodal fusion

KW - Multimodal representation learning

KW - Multimodal sentiment analysis

UR - http://www.scopus.com/inward/record.url?scp=85210365078&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2024.128940

DO - 10.1016/j.neucom.2024.128940

M3 - Article

AN - SCOPUS:85210365078

SN - 0925-2312

VL - 617

JO - Neurocomputing

JF - Neurocomputing

M1 - 128940

ER -

Multimodal sentiment analysis based on disentangled representation learning and cross-modal-context association mining

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

环境赋使场景下的多模态机械臂响应技术研究

Cite this

Multimodal sentiment analysis based on disentangled representation learning and cross-modal-context association mining

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Projects

环境赋使场景下的多模态机械臂响应技术研究

Cite this