Multi-level correlation mining framework with self-supervised label generation for multimodal sentiment analysis

Zuhe Li, Qingbing Guo, Yushan Pan*, Weiping Ding, Jun Yu, Yazhou Zhang, Weihua Liu, Haoran Chen, Hao Wang, Ying Xie

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

8 Citations (Scopus)

Abstract

Fusion and co-learning are major challenges in multimodal sentiment analysis. Most existing methods either ignore the basic relationships among modalities or fail to maximize their potential correlations. They also do not leverage the knowledge from resource-rich modalities in the analysis of resource-poor modalities. To address these challenges, we propose a multimodal sentiment analysis method based on multilevel correlation mining and self-supervised multi-task learning. First, we propose a unimodal feature fusion- and linguistics-guided Transformer-based framework, multi-level correlation mining framework, to overcome the difficulty of multimodal information fusion. The module exploits the correlation information between modalities from low to high levels. Second, we divided the multimodal sentiment analysis task into one multimodal task and three unimodal tasks (linguistic, acoustic, and visual tasks), and designed a self-supervised label generation module (SLGM) to generate sentiment labels for unimodal tasks. SLGM-based multi-task learning overcomes the lack of unimodal labels in co-learning. Through extensive experiments on the CMU-MOSI and CMU-MOSEI datasets, we demonstrated the superiority of the proposed multi-level correlation mining framework to state-of-the-art methods.

Original languageEnglish
Article number101891
JournalInformation Fusion
Volume99
DOIs
Publication statusPublished - Nov 2023

Keywords

  • Linguistic-guided transformer
  • Multimodal sentiment analysis
  • Self-supervised label generation
  • Unimodal feature fusion

Fingerprint

Dive into the research topics of 'Multi-level correlation mining framework with self-supervised label generation for multimodal sentiment analysis'. Together they form a unique fingerprint.

Cite this