Text-guided deep correlation mining and self-learning feature fusion framework for multimodal sentiment analysis

Minghui Zhu; Xiaojiang He; Baojie Qiao; Yiming Luo; Zuhe Li; Yushan Pan

doi:10.1016/j.knosys.2025.113249

Text-guided deep correlation mining and self-learning feature fusion framework for multimodal sentiment analysis

Minghui Zhu, Xiaojiang He, Baojie Qiao, Yiming Luo, Zuhe Li, Yushan Pan^*

^*Corresponding author for this work

Department of Computing

Research output: Contribution to journal › Article › peer-review

Abstract

Multimodal sentiment analysis has garnered widespread attention due to its applications in fields such as human–robot interaction, offering a 10% to 20% accuracy(binary) improvement over unimodal sentiment analysis. However, existing methods still face significant challenges: (1) insufficient utilization of textual information, which impacts the effectiveness of modality fusion and correlation mining; (2) Excessive focus on modality fusion, with a lack of in-depth exploration of the correlations between individual modalities; and (3) the absence of unimodal labels in most multimodal sentiment analysis datasets, leading to challenges in co-learning scenarios. To address these issues, we propose a text-guided deep correlation mining and self-learning feature fusion framework using a multi-task learning strategy. This framework divides sentiment analysis into a multimodal task and three unimodal tasks (linguistic, acoustic, and visual). For unimodal tasks, we designed the Text-Guided Deep Information Correlation Mining Module (TUDCM), which fully explores the correlations between modalities under the guidance of textual information. For the multimodal task, we introduce a Self-Learning Text-Guided Multimodal Fusion Attention (SLTG-Attention) mechanism to enhance the role of textual information and adaptively learn relationships between modalities for efficient fusion. Additionally, we design a Multi-Distance Label Generation Module (MDLGM) to generate more accurate unimodal labels for co-learning scenarios. Extensive experiments on the MOSI, MOSEI, and SIMS datasets demonstrate that our framework significantly outperforms existing methods, achieving an approximate 1% improvement in accuracy. On the MOSI dataset, our method achieves 0.672 MAE, 0.816 correlation, 86.46% accuracy(binary), and 86.52% F1, with similar outstanding results observed on other datasets.

Original language	English
Article number	113249
Journal	Knowledge-Based Systems
Volume	315
DOIs	https://doi.org/10.1016/j.knosys.2025.113249 https://doi.org/10.1016/j.knosys.2025.113249
Publication status	Published - 22 Apr 2025

Keywords

Multi-Distance Label Generation
Multimodal sentiment analysis
Self-supervised multimodal fusion
Text-guided correlation mining

Access to Document

Cite this

@article{108dbc44b0704d9aae25d99d189b32f4,

title = "Text-guided deep correlation mining and self-learning feature fusion framework for multimodal sentiment analysis",

abstract = "Multimodal sentiment analysis has garnered widespread attention due to its applications in fields such as human–robot interaction, offering a 10% to 20% accuracy(binary) improvement over unimodal sentiment analysis. However, existing methods still face significant challenges: (1) insufficient utilization of textual information, which impacts the effectiveness of modality fusion and correlation mining; (2) Excessive focus on modality fusion, with a lack of in-depth exploration of the correlations between individual modalities; and (3) the absence of unimodal labels in most multimodal sentiment analysis datasets, leading to challenges in co-learning scenarios. To address these issues, we propose a text-guided deep correlation mining and self-learning feature fusion framework using a multi-task learning strategy. This framework divides sentiment analysis into a multimodal task and three unimodal tasks (linguistic, acoustic, and visual). For unimodal tasks, we designed the Text-Guided Deep Information Correlation Mining Module (TUDCM), which fully explores the correlations between modalities under the guidance of textual information. For the multimodal task, we introduce a Self-Learning Text-Guided Multimodal Fusion Attention (SLTG-Attention) mechanism to enhance the role of textual information and adaptively learn relationships between modalities for efficient fusion. Additionally, we design a Multi-Distance Label Generation Module (MDLGM) to generate more accurate unimodal labels for co-learning scenarios. Extensive experiments on the MOSI, MOSEI, and SIMS datasets demonstrate that our framework significantly outperforms existing methods, achieving an approximate 1% improvement in accuracy. On the MOSI dataset, our method achieves 0.672 MAE, 0.816 correlation, 86.46% accuracy(binary), and 86.52% F1, with similar outstanding results observed on other datasets.",

keywords = "Multi-Distance Label Generation, Multimodal sentiment analysis, Self-supervised multimodal fusion, Text-guided correlation mining",

author = "Minghui Zhu and Xiaojiang He and Baojie Qiao and Yiming Luo and Zuhe Li and Yushan Pan",

note = "Publisher Copyright: {\textcopyright} 2025 Elsevier B.V.",

year = "2025",

month = apr,

day = "22",

doi = "10.1016/j.knosys.2025.113249",

language = "English",

volume = "315",

journal = "Knowledge-Based Systems",

issn = "0950-7051",

publisher = "Elsevier",

}

TY - JOUR

T1 - Text-guided deep correlation mining and self-learning feature fusion framework for multimodal sentiment analysis

AU - Zhu, Minghui

AU - He, Xiaojiang

AU - Qiao, Baojie

AU - Luo, Yiming

AU - Li, Zuhe

AU - Pan, Yushan

PY - 2025/4/22

Y1 - 2025/4/22

N2 - Multimodal sentiment analysis has garnered widespread attention due to its applications in fields such as human–robot interaction, offering a 10% to 20% accuracy(binary) improvement over unimodal sentiment analysis. However, existing methods still face significant challenges: (1) insufficient utilization of textual information, which impacts the effectiveness of modality fusion and correlation mining; (2) Excessive focus on modality fusion, with a lack of in-depth exploration of the correlations between individual modalities; and (3) the absence of unimodal labels in most multimodal sentiment analysis datasets, leading to challenges in co-learning scenarios. To address these issues, we propose a text-guided deep correlation mining and self-learning feature fusion framework using a multi-task learning strategy. This framework divides sentiment analysis into a multimodal task and three unimodal tasks (linguistic, acoustic, and visual). For unimodal tasks, we designed the Text-Guided Deep Information Correlation Mining Module (TUDCM), which fully explores the correlations between modalities under the guidance of textual information. For the multimodal task, we introduce a Self-Learning Text-Guided Multimodal Fusion Attention (SLTG-Attention) mechanism to enhance the role of textual information and adaptively learn relationships between modalities for efficient fusion. Additionally, we design a Multi-Distance Label Generation Module (MDLGM) to generate more accurate unimodal labels for co-learning scenarios. Extensive experiments on the MOSI, MOSEI, and SIMS datasets demonstrate that our framework significantly outperforms existing methods, achieving an approximate 1% improvement in accuracy. On the MOSI dataset, our method achieves 0.672 MAE, 0.816 correlation, 86.46% accuracy(binary), and 86.52% F1, with similar outstanding results observed on other datasets.

AB - Multimodal sentiment analysis has garnered widespread attention due to its applications in fields such as human–robot interaction, offering a 10% to 20% accuracy(binary) improvement over unimodal sentiment analysis. However, existing methods still face significant challenges: (1) insufficient utilization of textual information, which impacts the effectiveness of modality fusion and correlation mining; (2) Excessive focus on modality fusion, with a lack of in-depth exploration of the correlations between individual modalities; and (3) the absence of unimodal labels in most multimodal sentiment analysis datasets, leading to challenges in co-learning scenarios. To address these issues, we propose a text-guided deep correlation mining and self-learning feature fusion framework using a multi-task learning strategy. This framework divides sentiment analysis into a multimodal task and three unimodal tasks (linguistic, acoustic, and visual). For unimodal tasks, we designed the Text-Guided Deep Information Correlation Mining Module (TUDCM), which fully explores the correlations between modalities under the guidance of textual information. For the multimodal task, we introduce a Self-Learning Text-Guided Multimodal Fusion Attention (SLTG-Attention) mechanism to enhance the role of textual information and adaptively learn relationships between modalities for efficient fusion. Additionally, we design a Multi-Distance Label Generation Module (MDLGM) to generate more accurate unimodal labels for co-learning scenarios. Extensive experiments on the MOSI, MOSEI, and SIMS datasets demonstrate that our framework significantly outperforms existing methods, achieving an approximate 1% improvement in accuracy. On the MOSI dataset, our method achieves 0.672 MAE, 0.816 correlation, 86.46% accuracy(binary), and 86.52% F1, with similar outstanding results observed on other datasets.

KW - Multi-Distance Label Generation

KW - Multimodal sentiment analysis

KW - Self-supervised multimodal fusion

KW - Text-guided correlation mining

UR - http://www.scopus.com/inward/record.url?scp=85219494864&partnerID=8YFLogxK

U2 - 10.1016/j.knosys.2025.113249

DO - 10.1016/j.knosys.2025.113249

M3 - Article

AN - SCOPUS:85219494864

SN - 0950-7051

VL - 315

JO - Knowledge-Based Systems

JF - Knowledge-Based Systems

M1 - 113249

ER -

Text-guided deep correlation mining and self-learning feature fusion framework for multimodal sentiment analysis

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Multimodal Robot Arm Response Technology in the Context of Environmental Assignments

环境赋使场景下的多模态机械臂响应技术研究

江苏省双创博士项目

Cite this

Text-guided deep correlation mining and self-learning feature fusion framework for multimodal sentiment analysis

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Projects

Multimodal Robot Arm Response Technology in the Context of Environmental Assignments

环境赋使场景下的多模态机械臂响应技术研究

江苏省双创博士项目

Cite this