CLIMS++: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation

Jinheng Xie; Songhe Deng; Xianxu Hou; Zhaochuan Luo; Linlin Shen; Yawen Huang; Yefeng Zheng; Mike Zheng Shou

doi:10.1007/s11263-025-02442-2

CLIMS++: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation

Jinheng Xie, Songhe Deng, Xianxu Hou, Zhaochuan Luo, Linlin Shen^*, Yawen Huang, Yefeng Zheng, Mike Zheng Shou

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

While promising results have been achieved in weakly-supervised semantic segmentation (WSSS), limited supervision from image-level tags inevitably induces discriminative reliance and spurious relations between target classes and background regions. Thus, Class Activation Map (CAM) usually tends to activate discriminative object regions and falsely includes lots of class-related backgrounds. Without pixel-level supervisions, it could be very difficult to enlarge the foreground activation and suppress those false activation of background regions. In this paper, we propose a novel framework of Cross Language Image Matching with Automatic Context Discovery (CLIMS++), based on the recently introduced Contrastive Language-Image Pre-training (CLIP) model, for WSSS. The core idea of our framework is to introduce natural language supervision to activate more complete object regions and suppress class-related background regions in CAM. In particular, we design object, background region, and text label matching losses to guide the model to excite more reasonable object regions of each category. In addition, we propose to automatically find spurious relations between foreground categories and backgrounds, through which a background suppression loss is designed to suppress the activation of class-related backgrounds. The above designs enable the proposed CLIMS++ to generate a more complete and compact activation map for the target objects. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 datasets show that our CLIMS++ significantly outperforms the previous state-of-the-art methods.

Original language	English
Journal	International Journal of Computer Vision
DOIs	https://doi.org/10.1007/s11263-025-02442-2
Publication status	Accepted/In press - 2025

Keywords

Multi-modal learning
Semantic segmentation
Weakly-supervised learning

Access to Document

10.1007/s11263-025-02442-2

Fingerprint

Dive into the research topics of 'CLIMS++: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation'. Together they form a unique fingerprint.

Cite this

Xie, J., Deng, S., Hou, X., Luo, Z., Shen, L., Huang, Y., Zheng, Y., & Shou, M. Z. (Accepted/In press). CLIMS++: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation. International Journal of Computer Vision. https://doi.org/10.1007/s11263-025-02442-2

@article{8efa43ee82354454a5a335008e305803,

title = "CLIMS++: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation",

abstract = "While promising results have been achieved in weakly-supervised semantic segmentation (WSSS), limited supervision from image-level tags inevitably induces discriminative reliance and spurious relations between target classes and background regions. Thus, Class Activation Map (CAM) usually tends to activate discriminative object regions and falsely includes lots of class-related backgrounds. Without pixel-level supervisions, it could be very difficult to enlarge the foreground activation and suppress those false activation of background regions. In this paper, we propose a novel framework of Cross Language Image Matching with Automatic Context Discovery (CLIMS++), based on the recently introduced Contrastive Language-Image Pre-training (CLIP) model, for WSSS. The core idea of our framework is to introduce natural language supervision to activate more complete object regions and suppress class-related background regions in CAM. In particular, we design object, background region, and text label matching losses to guide the model to excite more reasonable object regions of each category. In addition, we propose to automatically find spurious relations between foreground categories and backgrounds, through which a background suppression loss is designed to suppress the activation of class-related backgrounds. The above designs enable the proposed CLIMS++ to generate a more complete and compact activation map for the target objects. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 datasets show that our CLIMS++ significantly outperforms the previous state-of-the-art methods.",

keywords = "Multi-modal learning, Semantic segmentation, Weakly-supervised learning",

author = "Jinheng Xie and Songhe Deng and Xianxu Hou and Zhaochuan Luo and Linlin Shen and Yawen Huang and Yefeng Zheng and Shou, {Mike Zheng}",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.",

year = "2025",

doi = "10.1007/s11263-025-02442-2",

language = "English",

journal = "International Journal of Computer Vision",

issn = "0920-5691",

}

Xie, J, Deng, S, Hou, X, Luo, Z, Shen, L, Huang, Y, Zheng, Y & Shou, MZ 2025, 'CLIMS++: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation', International Journal of Computer Vision. https://doi.org/10.1007/s11263-025-02442-2

CLIMS++: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation. / Xie, Jinheng; Deng, Songhe; Hou, Xianxu et al.
In: International Journal of Computer Vision, 2025.

Research output: Contribution to journal › Article › peer-review

TY - JOUR

T1 - CLIMS++: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation

T2 - Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation

AU - Xie, Jinheng

AU - Deng, Songhe

AU - Hou, Xianxu

AU - Luo, Zhaochuan

AU - Shen, Linlin

AU - Huang, Yawen

AU - Zheng, Yefeng

AU - Shou, Mike Zheng

N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.

PY - 2025

Y1 - 2025

N2 - While promising results have been achieved in weakly-supervised semantic segmentation (WSSS), limited supervision from image-level tags inevitably induces discriminative reliance and spurious relations between target classes and background regions. Thus, Class Activation Map (CAM) usually tends to activate discriminative object regions and falsely includes lots of class-related backgrounds. Without pixel-level supervisions, it could be very difficult to enlarge the foreground activation and suppress those false activation of background regions. In this paper, we propose a novel framework of Cross Language Image Matching with Automatic Context Discovery (CLIMS++), based on the recently introduced Contrastive Language-Image Pre-training (CLIP) model, for WSSS. The core idea of our framework is to introduce natural language supervision to activate more complete object regions and suppress class-related background regions in CAM. In particular, we design object, background region, and text label matching losses to guide the model to excite more reasonable object regions of each category. In addition, we propose to automatically find spurious relations between foreground categories and backgrounds, through which a background suppression loss is designed to suppress the activation of class-related backgrounds. The above designs enable the proposed CLIMS++ to generate a more complete and compact activation map for the target objects. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 datasets show that our CLIMS++ significantly outperforms the previous state-of-the-art methods.

AB - While promising results have been achieved in weakly-supervised semantic segmentation (WSSS), limited supervision from image-level tags inevitably induces discriminative reliance and spurious relations between target classes and background regions. Thus, Class Activation Map (CAM) usually tends to activate discriminative object regions and falsely includes lots of class-related backgrounds. Without pixel-level supervisions, it could be very difficult to enlarge the foreground activation and suppress those false activation of background regions. In this paper, we propose a novel framework of Cross Language Image Matching with Automatic Context Discovery (CLIMS++), based on the recently introduced Contrastive Language-Image Pre-training (CLIP) model, for WSSS. The core idea of our framework is to introduce natural language supervision to activate more complete object regions and suppress class-related background regions in CAM. In particular, we design object, background region, and text label matching losses to guide the model to excite more reasonable object regions of each category. In addition, we propose to automatically find spurious relations between foreground categories and backgrounds, through which a background suppression loss is designed to suppress the activation of class-related backgrounds. The above designs enable the proposed CLIMS++ to generate a more complete and compact activation map for the target objects. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 datasets show that our CLIMS++ significantly outperforms the previous state-of-the-art methods.

KW - Multi-modal learning

KW - Semantic segmentation

KW - Weakly-supervised learning

UR - http://www.scopus.com/inward/record.url?scp=105004450811&partnerID=8YFLogxK

U2 - 10.1007/s11263-025-02442-2

DO - 10.1007/s11263-025-02442-2

M3 - Article

AN - SCOPUS:105004450811

SN - 0920-5691

JO - International Journal of Computer Vision

JF - International Journal of Computer Vision

ER -

CLIMS++: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this