TY - JOUR
T1 - Text-guided deep correlation mining and self-learning feature fusion framework for multimodal sentiment analysis
AU - Zhu, Minghui
AU - He, Xiaojiang
AU - Qiao, Baojie
AU - Luo, Yiming
AU - Li, Zuhe
AU - Pan, Yushan
N1 - Publisher Copyright:
© 2025 Elsevier B.V.
PY - 2025/4/22
Y1 - 2025/4/22
N2 - Multimodal sentiment analysis has garnered widespread attention due to its applications in fields such as human–robot interaction, offering a 10% to 20% accuracy(binary) improvement over unimodal sentiment analysis. However, existing methods still face significant challenges: (1) insufficient utilization of textual information, which impacts the effectiveness of modality fusion and correlation mining; (2) Excessive focus on modality fusion, with a lack of in-depth exploration of the correlations between individual modalities; and (3) the absence of unimodal labels in most multimodal sentiment analysis datasets, leading to challenges in co-learning scenarios. To address these issues, we propose a text-guided deep correlation mining and self-learning feature fusion framework using a multi-task learning strategy. This framework divides sentiment analysis into a multimodal task and three unimodal tasks (linguistic, acoustic, and visual). For unimodal tasks, we designed the Text-Guided Deep Information Correlation Mining Module (TUDCM), which fully explores the correlations between modalities under the guidance of textual information. For the multimodal task, we introduce a Self-Learning Text-Guided Multimodal Fusion Attention (SLTG-Attention) mechanism to enhance the role of textual information and adaptively learn relationships between modalities for efficient fusion. Additionally, we design a Multi-Distance Label Generation Module (MDLGM) to generate more accurate unimodal labels for co-learning scenarios. Extensive experiments on the MOSI, MOSEI, and SIMS datasets demonstrate that our framework significantly outperforms existing methods, achieving an approximate 1% improvement in accuracy. On the MOSI dataset, our method achieves 0.672 MAE, 0.816 correlation, 86.46% accuracy(binary), and 86.52% F1, with similar outstanding results observed on other datasets.
AB - Multimodal sentiment analysis has garnered widespread attention due to its applications in fields such as human–robot interaction, offering a 10% to 20% accuracy(binary) improvement over unimodal sentiment analysis. However, existing methods still face significant challenges: (1) insufficient utilization of textual information, which impacts the effectiveness of modality fusion and correlation mining; (2) Excessive focus on modality fusion, with a lack of in-depth exploration of the correlations between individual modalities; and (3) the absence of unimodal labels in most multimodal sentiment analysis datasets, leading to challenges in co-learning scenarios. To address these issues, we propose a text-guided deep correlation mining and self-learning feature fusion framework using a multi-task learning strategy. This framework divides sentiment analysis into a multimodal task and three unimodal tasks (linguistic, acoustic, and visual). For unimodal tasks, we designed the Text-Guided Deep Information Correlation Mining Module (TUDCM), which fully explores the correlations between modalities under the guidance of textual information. For the multimodal task, we introduce a Self-Learning Text-Guided Multimodal Fusion Attention (SLTG-Attention) mechanism to enhance the role of textual information and adaptively learn relationships between modalities for efficient fusion. Additionally, we design a Multi-Distance Label Generation Module (MDLGM) to generate more accurate unimodal labels for co-learning scenarios. Extensive experiments on the MOSI, MOSEI, and SIMS datasets demonstrate that our framework significantly outperforms existing methods, achieving an approximate 1% improvement in accuracy. On the MOSI dataset, our method achieves 0.672 MAE, 0.816 correlation, 86.46% accuracy(binary), and 86.52% F1, with similar outstanding results observed on other datasets.
KW - Multi-Distance Label Generation
KW - Multimodal sentiment analysis
KW - Self-supervised multimodal fusion
KW - Text-guided correlation mining
UR - https://www.scopus.com/pages/publications/85219494864
U2 - 10.1016/j.knosys.2025.113249
DO - 10.1016/j.knosys.2025.113249
M3 - Article
AN - SCOPUS:85219494864
SN - 0950-7051
VL - 315
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 113249
ER -