Generative Prompt Controlled Diffusion for weakly supervised semantic segmentation

Wangyu Wu, Tianhong Dai, Zhenhong Chen, Xiaowei Huang, Fei Ma*, Jimin Xiao

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Weakly supervised semantic segmentation (WSSS), aiming to train segmentation models solely using image-level labels, has received significant attention. Existing approaches mainly concentrate on creating high-quality pseudo labels by utilizing existing images and their corresponding image-level labels. However, a major challenge arises when the available dataset is limited, as the quality of pseudo labels degrades significantly. In this paper, we tackle this challenge from a different perspective by introducing a novel approach called Generative Prompt Controlled Diffusion (GPCD) for data augmentation. This approach enhances the current labeled datasets by augmenting them with a variety of images, achieved through controlled diffusion guided by Generative Pre-trained Transformer (GPT) prompts. In this process, the existing images and image-level labels provide the necessary control information, while GPT enriches the prompts to generate diverse backgrounds. Moreover, we make an original contribution by integrating data source information as tokens into the Vision Transformer (ViT) framework, which improves the ability of downstream WSSS models to recognize the origins of augmented images. Our proposed GPCD approach clearly surpasses existing state-of-the-art methods, with its advantages being more pronounced when the available data is scarce, thereby demonstrating the effectiveness of our method. Our source code will be released.

Original languageEnglish
Article number130103
JournalNeurocomputing
Volume638
DOIs
Publication statusPublished - 14 Jul 2025

Keywords

  • Diffusion model
  • Prompt generation
  • Semantic segmentation
  • Weakly-supervised learning

Fingerprint

Dive into the research topics of 'Generative Prompt Controlled Diffusion for weakly supervised semantic segmentation'. Together they form a unique fingerprint.

Cite this