Adaptive Patch Contrast for Weakly Supervised Semantic Segmentation

Wangyu Wu, Tianhong Dai, Zhenhong Chen, Xiaowei Huang, Jimin Xiao, Fei Ma*, Renrong Ouyang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Weakly Supervised Semantic Segmentation (WSSS), using only image-level labels, has garnered significant attention due to its cost-effectiveness. Typically, the framework involves using image-level labels as training data to generate pixel-level pseudo-labels with refinements. Recently, methods based on Vision Transformers (ViT) have demonstrated superior capabilities in generating reliable pseudo-labels, particularly in recognizing complete object regions. However, current ViT-based approaches have some limitations in the use of patch embeddings, being prone to being dominated by certain abnormal patches, as well as many multi-stage methods being time-consuming and lengthy in training, thus lacking efficiency. Therefore, in this paper, we introduce a novel ViT-based WSSS method named Adaptive Patch Contrast (APC) that significantly enhances patch embedding learning for improved segmentation effectiveness. APC utilizes an Adaptive-K Pooling (AKP) layer to address the limitations of previous max pooling selection methods. Additionally, we propose a Patch Contrastive Learning (PCL) to enhance patch embeddings, thereby further improving the final results. We developed an end-to-end single-stage framework without CAM, which improved training efficiency. Experimental results demonstrate that our method performs exceptionally well on public datasets, outperforming other state-of-the-art WSSS methods with a shorter training time.

Original languageEnglish
Article number109626
JournalEngineering Applications of Artificial Intelligence
Volume139
DOIs
Publication statusPublished - Jan 2025

Keywords

  • Contrastive learning
  • Semantic segmentation
  • Vision Transformer
  • Weakly supervised learning

Fingerprint

Dive into the research topics of 'Adaptive Patch Contrast for Weakly Supervised Semantic Segmentation'. Together they form a unique fingerprint.

Cite this