Establishing Nuanced Multi-Modal Attention for Weakly Supervised Semantic Segmentation of Remote Sensing Scenes

Qiming Zhang, Junjie Zhang*, Huaxi Huang, Fangyu Wu, Hongwen Yu

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Weakly Supervised Semantic Segmentation (WSSS) with image-level labels reduces reliance on pixel-level annotations for remote sensing (RS) imagery. However, in natural scenes, WSSS frequently faces challenges such as imprecise localization, extraneous activations, and class ambiguity. These challenges are particularly pronounced in RS images, characterized by complex backgrounds, substantial scale variations, and dense small-object distributions, complicating the distinction between intra-class variations and inter-class similarities. To tackle these challenges, we introduce a class-constrained multi-modal attention framework aimed at enhancing the localization accuracy of class activation maps (CAMs). Specifically, we design class-specific tokens to capture the visual characteristics of each target class. As these tokens initially lack explicit constraints, we integrate the textual branch of the RemoteCLIP model to leverage class-related linguistic priors, which collaborate with visual features to encode the specific semantics of diverse objects. Furthermore, the multi-modal collaborative optimization module dynamically establishes tailored attention mechanisms for both global and regional features, thereby improving class discriminability among targets to mitigate challenges like inter-class similarity and dense small-object distributions. By refining class-specific attention, textual semantic attention, and patch-level pairwise affinity weights, the quality of generated pseudo-masks is markedly enhanced. Concurrently, to ensure domain-invariant feature learning, we align the backbone features with the CLIP visual embedding by minimizing the distribution disparity between the two in the latent space, semantic consistency is therefore preserved. The experimental results validate the effectiveness and robustness of our proposed method, achieving significant performance improvements on two representative RS WSSS datasets.

Original languageEnglish
Article number0b00006493e2ae07
JournalIEEE Geoscience and Remote Sensing Letters
DOIs
Publication statusAccepted/In press - 2025

Keywords

  • Image-level
  • Multi-Modal Attention
  • Textual Semantic Attention

Fingerprint

Dive into the research topics of 'Establishing Nuanced Multi-Modal Attention for Weakly Supervised Semantic Segmentation of Remote Sensing Scenes'. Together they form a unique fingerprint.

Cite this