TY - JOUR
T1 - Scale-Selectable Global Information and Discrepancy Learning Network for Multimodal Sentiment Analysis
AU - He, Xiaojiang
AU - Pan, Yushan
AU - Guo, Xinfei
AU - Xu, Zhijie
AU - Yang, Chenguang
N1 - Received 5 January 2025; revised 1 June 2025; accepted 15 June 2025. Date of publication 18 June 2025; date of current version 3 December 2025.
Publisher Copyright:
© 2010-2012 IEEE.
PY - 2025/12
Y1 - 2025/12
N2 - Multimodal sentiment analysis and depression detection are pivotal for advancing human-computer interaction, yet significant challenges remain. First, the limited extraction of global contextual information within individual modalities risks the loss of modal-specific features. Second, existing methods often prioritize unaligned textual interactions, neglecting critical inter-modal discrepancies. To address these issues, we propose the Scale-Selectable Global and Discrepancy Learning Network (SSGDL), an innovative framework that integrates two core modules: the Cross-Shaped Dynamic Scale Attention Module (CSDSA) and the Primary-Secondary modal Discrepancy Learning Module (PS-MDL). The CS-DSA dynamically selects scales and employs cross-shaped attention to capture comprehensive global context and intricate internal correlations, effectively producing a fused modal representation. Meanwhile, the PS-MDL designates the fused modal as primary and utilizes cross-attention mechanisms to learn discrepancy representations between it and other modalities (textual, acoustic, and visual). By leveraging intermodal discrepancies, SSGDL achieves a more nuanced and holistic understanding of emotional content. Extensive experiments on three benchmark multimodal sentiment analysis datasets (MOSI, MOSEI, SIMS) and a depression detection dataset (AVEC2019) demonstrate that SSGDL consistently outperforms state-of-theart approaches, setting a new benchmark for multimodal affective computing.
AB - Multimodal sentiment analysis and depression detection are pivotal for advancing human-computer interaction, yet significant challenges remain. First, the limited extraction of global contextual information within individual modalities risks the loss of modal-specific features. Second, existing methods often prioritize unaligned textual interactions, neglecting critical inter-modal discrepancies. To address these issues, we propose the Scale-Selectable Global and Discrepancy Learning Network (SSGDL), an innovative framework that integrates two core modules: the Cross-Shaped Dynamic Scale Attention Module (CSDSA) and the Primary-Secondary modal Discrepancy Learning Module (PS-MDL). The CS-DSA dynamically selects scales and employs cross-shaped attention to capture comprehensive global context and intricate internal correlations, effectively producing a fused modal representation. Meanwhile, the PS-MDL designates the fused modal as primary and utilizes cross-attention mechanisms to learn discrepancy representations between it and other modalities (textual, acoustic, and visual). By leveraging intermodal discrepancies, SSGDL achieves a more nuanced and holistic understanding of emotional content. Extensive experiments on three benchmark multimodal sentiment analysis datasets (MOSI, MOSEI, SIMS) and a depression detection dataset (AVEC2019) demonstrate that SSGDL consistently outperforms state-of-theart approaches, setting a new benchmark for multimodal affective computing.
KW - Multimodal sentiment analysis
KW - depression detection
KW - inter-modal discrep- ancy learning
KW - neuro-scientific theories
KW - scale-selectabl global information
UR - https://www.scopus.com/pages/publications/105008785242
U2 - 10.1109/TAFFC.2025.3580779
DO - 10.1109/TAFFC.2025.3580779
M3 - Article
AN - SCOPUS:105008785242
SN - 1949-3045
VL - 16
SP - 3169
EP - 3182
JO - IEEE Transactions on Affective Computing
JF - IEEE Transactions on Affective Computing
IS - 4
ER -