M-SEE: A multi-scale encoder enhancement framework for end-to-end Weakly Supervised Semantic Segmentation

Ziqian Yang, Xinqiao Zhao, Chao Yao, Quan Zhang, Jimin Xiao*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

End-to-end image-level Weakly Supervised Semantic Segmentation (WSSS) has received increasing attention due to its simple but effective implementation. It helps to alleviate the laborious annotation costs required in semantic segmentation. In this work, we find that not all discriminative features can be extracted by a transformer encoder under image-level supervision. Thus, the decoder in end-to-end WSSS fails to predict a satisfying segmentation result. To solve this issue, we propose a Multi-Scale Encoder Enhancement (M-SEE) framework for enabling the encoder to extract comprehensive discriminative features and improve WSSS performance. Specifically, we first resize the original training image to various scales and calculate Class Activation Map (CAM) for each scale image. Then, reliable discriminative regions are mined based on the CAM and decoder segmentation result. Finally, a knowledge distillation loss is calculated among features of original scale and the scaled features of selected reliable discriminative regions. Experimental results show that our M-SEE framework achieves new state-of-the-art performances with 74.8% on PASCAL VOC 2012 test split and 45.8% on MS COCO 2014 validation split. Codes will be released.

Original languageEnglish
Article number111348
JournalPattern Recognition
Volume162
DOIs
Publication statusPublished - Jun 2025

Keywords

  • End-to-end framework
  • Image-level labels
  • Knowledge distillation
  • Weakly Supervised Semantic Segmentation

Fingerprint

Dive into the research topics of 'M-SEE: A multi-scale encoder enhancement framework for end-to-end Weakly Supervised Semantic Segmentation'. Together they form a unique fingerprint.

Cite this