Scale-Selectable Global Information and Discrepancy Learning Network for Multimodal Sentiment Analysis

Xiaojiang He; Yushan Pan; Xinfei Guo; Zhijie Xu; Chenguang Yang

doi:10.1109/TAFFC.2025.3580779

Scale-Selectable Global Information and Discrepancy Learning Network for Multimodal Sentiment Analysis

Xiaojiang He, Yushan Pan^*, Xinfei Guo, Zhijie Xu, Chenguang Yang

^*Corresponding author for this work

Xi'an Jiaotong-Liverpool University

Research output: Contribution to journal › Article › peer-review

Abstract

Multimodal sentiment analysis and depression detection are pivotal for advancing human-computer interaction, yet significant challenges remain. First, the limited extraction of global contextual information within individual modalities risks the loss of modal-specific features. Second, existing methods often prioritize unaligned textual interactions, neglecting critical inter-modal discrepancies. To address these issues, we propose the Scale-Selectable Global and Discrepancy Learning Network (SSGDL), an innovative framework that integrates two core modules: the Cross-Shaped Dynamic Scale Attention Module (CSDSA) and the Primary-Secondary modal Discrepancy Learning Module (PS-MDL). The CS-DSA dynamically selects scales and employs cross-shaped attention to capture comprehensive global context and intricate internal correlations, effectively producing a fused modal representation. Meanwhile, the PS-MDL designates the fused modal as primary and utilizes cross-attention mechanisms to learn discrepancy representations between it and other modalities (textual, acoustic, and visual). By leveraging intermodal discrepancies, SSGDL achieves a more nuanced and holistic understanding of emotional content. Extensive experiments on three benchmark multimodal sentiment analysis datasets (MOSI, MOSEI, SIMS) and a depression detection dataset (AVEC2019) demonstrate that SSGDL consistently outperforms state-of-theart approaches, setting a new benchmark for multimodal affective computing.

Original language	English
Journal	IEEE Transactions on Affective Computing
DOIs	https://doi.org/10.1109/TAFFC.2025.3580779
Publication status	Accepted/In press - 2025

Keywords

depression detection
Inter-modal Discrepancy Learning
Multimodal Sentiment Analysis
Neuro-scientific theories
Scale-Selectabl Global Information

Access to Document

10.1109/TAFFC.2025.3580779

Cite this

@article{64acc7a0563046aa94619188b49b14c1,

title = "Scale-Selectable Global Information and Discrepancy Learning Network for Multimodal Sentiment Analysis",

abstract = "Multimodal sentiment analysis and depression detection are pivotal for advancing human-computer interaction, yet significant challenges remain. First, the limited extraction of global contextual information within individual modalities risks the loss of modal-specific features. Second, existing methods often prioritize unaligned textual interactions, neglecting critical inter-modal discrepancies. To address these issues, we propose the Scale-Selectable Global and Discrepancy Learning Network (SSGDL), an innovative framework that integrates two core modules: the Cross-Shaped Dynamic Scale Attention Module (CSDSA) and the Primary-Secondary modal Discrepancy Learning Module (PS-MDL). The CS-DSA dynamically selects scales and employs cross-shaped attention to capture comprehensive global context and intricate internal correlations, effectively producing a fused modal representation. Meanwhile, the PS-MDL designates the fused modal as primary and utilizes cross-attention mechanisms to learn discrepancy representations between it and other modalities (textual, acoustic, and visual). By leveraging intermodal discrepancies, SSGDL achieves a more nuanced and holistic understanding of emotional content. Extensive experiments on three benchmark multimodal sentiment analysis datasets (MOSI, MOSEI, SIMS) and a depression detection dataset (AVEC2019) demonstrate that SSGDL consistently outperforms state-of-theart approaches, setting a new benchmark for multimodal affective computing.",

keywords = "depression detection, Inter-modal Discrepancy Learning, Multimodal Sentiment Analysis, Neuro-scientific theories, Scale-Selectabl Global Information",

author = "Xiaojiang He and Yushan Pan and Xinfei Guo and Zhijie Xu and Chenguang Yang",

note = "Publisher Copyright: {\textcopyright} 2010-2012 IEEE.",

year = "2025",

doi = "10.1109/TAFFC.2025.3580779",

language = "English",

journal = "IEEE Transactions on Affective Computing",

issn = "1949-3045",

}

TY - JOUR

T1 - Scale-Selectable Global Information and Discrepancy Learning Network for Multimodal Sentiment Analysis

AU - He, Xiaojiang

AU - Pan, Yushan

AU - Guo, Xinfei

AU - Xu, Zhijie

AU - Yang, Chenguang

PY - 2025

Y1 - 2025

N2 - Multimodal sentiment analysis and depression detection are pivotal for advancing human-computer interaction, yet significant challenges remain. First, the limited extraction of global contextual information within individual modalities risks the loss of modal-specific features. Second, existing methods often prioritize unaligned textual interactions, neglecting critical inter-modal discrepancies. To address these issues, we propose the Scale-Selectable Global and Discrepancy Learning Network (SSGDL), an innovative framework that integrates two core modules: the Cross-Shaped Dynamic Scale Attention Module (CSDSA) and the Primary-Secondary modal Discrepancy Learning Module (PS-MDL). The CS-DSA dynamically selects scales and employs cross-shaped attention to capture comprehensive global context and intricate internal correlations, effectively producing a fused modal representation. Meanwhile, the PS-MDL designates the fused modal as primary and utilizes cross-attention mechanisms to learn discrepancy representations between it and other modalities (textual, acoustic, and visual). By leveraging intermodal discrepancies, SSGDL achieves a more nuanced and holistic understanding of emotional content. Extensive experiments on three benchmark multimodal sentiment analysis datasets (MOSI, MOSEI, SIMS) and a depression detection dataset (AVEC2019) demonstrate that SSGDL consistently outperforms state-of-theart approaches, setting a new benchmark for multimodal affective computing.

AB - Multimodal sentiment analysis and depression detection are pivotal for advancing human-computer interaction, yet significant challenges remain. First, the limited extraction of global contextual information within individual modalities risks the loss of modal-specific features. Second, existing methods often prioritize unaligned textual interactions, neglecting critical inter-modal discrepancies. To address these issues, we propose the Scale-Selectable Global and Discrepancy Learning Network (SSGDL), an innovative framework that integrates two core modules: the Cross-Shaped Dynamic Scale Attention Module (CSDSA) and the Primary-Secondary modal Discrepancy Learning Module (PS-MDL). The CS-DSA dynamically selects scales and employs cross-shaped attention to capture comprehensive global context and intricate internal correlations, effectively producing a fused modal representation. Meanwhile, the PS-MDL designates the fused modal as primary and utilizes cross-attention mechanisms to learn discrepancy representations between it and other modalities (textual, acoustic, and visual). By leveraging intermodal discrepancies, SSGDL achieves a more nuanced and holistic understanding of emotional content. Extensive experiments on three benchmark multimodal sentiment analysis datasets (MOSI, MOSEI, SIMS) and a depression detection dataset (AVEC2019) demonstrate that SSGDL consistently outperforms state-of-theart approaches, setting a new benchmark for multimodal affective computing.

KW - depression detection

KW - Inter-modal Discrepancy Learning

KW - Multimodal Sentiment Analysis

KW - Neuro-scientific theories

KW - Scale-Selectabl Global Information

UR - http://www.scopus.com/inward/record.url?scp=105008785242&partnerID=8YFLogxK

U2 - 10.1109/TAFFC.2025.3580779

DO - 10.1109/TAFFC.2025.3580779

M3 - Article

AN - SCOPUS:105008785242

SN - 1949-3045

JO - IEEE Transactions on Affective Computing

JF - IEEE Transactions on Affective Computing

ER -

Scale-Selectable Global Information and Discrepancy Learning Network for Multimodal Sentiment Analysis

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this