Multimodal sentiment analysis based on disentangled representation learning and cross-modal-context association mining

Zuhe Li, Panbo Liu, Yushan Pan, Weiping Ding*, Jun Yu, Haoran Chen, Weihua Liu, Yiming Luo, Hao Wang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Multimodal sentiment analysis aims to extract sentiment information expressed by users from multimodal data, including linguistic, acoustic, and visual cues. However, the heterogeneity of multimodal data leads to disparities in modal distribution, thereby impacting the model's ability to effectively integrate complementarity and redundancy across modalities. Additionally, existing approaches often merge modalities directly after obtaining their representations, overlooking potential emotional correlations between them. To tackle these challenges, we propose a Multiview Collaborative Perception (MVCP) framework for multimodal sentiment analysis. This framework consists primarily of two modules: Multimodal Disentangled Representation Learning (MDRL) and Cross-Modal Context Association Mining (CMCAM). The MDRL module employs a joint learning layer comprising a common encoder and an exclusive encoder. This layer maps multimodal data to a hypersphere, learning common and exclusive representations for each modality, thus mitigating the semantic gap arising from modal heterogeneity. To further bridge semantic gaps and capture complex inter-modal correlations, the CMCAM module utilizes multiple attention mechanisms to mine cross-modal and contextual sentiment associations, yielding joint representations with rich multimodal semantic interactions. In this stage, the CMCAM module only discovers the correlation information among the common representations in order to maintain the exclusive representations of different modalities. Finally, a multitask learning framework is adopted to achieve parameter sharing between single-modal tasks and improve sentiment prediction performance. Experimental results on the MOSI and MOSEI datasets demonstrate the effectiveness of the proposed method.

Original languageEnglish
Article number128940
JournalNeurocomputing
Volume617
DOIs
Publication statusPublished - 7 Feb 2025

Keywords

  • Linguistic guided-multihead attention
  • Multimodal association mining
  • Multimodal fusion
  • Multimodal representation learning
  • Multimodal sentiment analysis

Fingerprint

Dive into the research topics of 'Multimodal sentiment analysis based on disentangled representation learning and cross-modal-context association mining'. Together they form a unique fingerprint.

Cite this