Text-dominant multimodal perception network for sentiment analysis based on cross-modal semantic enhancements

Zuhe Li, Panbo Liu, Yushan Pan*, Jun Yu, Weihua Liu, Haoran Chen, Yiming Luo, Hao Wang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Abstract: Multimodal sentiment analysis (MSA) aims to discern the emotional information expressed by users in the multimodal data they upload on various social media platforms. In most previous studies, these modalities (audio A, visual V, and text T) were typically treated equally, overlooking the lower representation quality inherent in audio and visual modalities. This oversight often results in inaccurate interaction information when audio or visual modalities are used as the primary input, thereby negatively impacting the model’s sentiment predictions. In this paper, we propose a text-dominant multimodal perception network with cross-modal transformer-based semantic enhancement. The network comprises primarily a text-dominant multimodal perception (TDMP) module and a cross-modal transformer-based semantic enhancement (TSE) module. TDMP leverages the text modality to dominate intermodal interactions, extracting high correlation and differentiation features from each modality, thereby obtaining more accurate representations for each modality. The TSE module uses a transformer architecture to convert the audio and visual modality features into text features. By applying KL divergence constraints, it ensures that the translated modality representations capture as much emotional information as possible while maintaining high similarity to the original text modality representations. This enhances the original text modality semantics while mitigating the negative impact of the input. Extensive experiments on the CMU-MOSI and CMU-MOSEI datasets demonstrate the effectiveness of our proposed model.

Original languageEnglish
Article number188
JournalApplied Intelligence
Volume55
Issue number2
DOIs
Publication statusPublished - Jan 2025

Keywords

  • Modality translation
  • Semantic enhancement
  • Text dominance

Fingerprint

Dive into the research topics of 'Text-dominant multimodal perception network for sentiment analysis based on cross-modal semantic enhancements'. Together they form a unique fingerprint.

Cite this