Abstract
End-to-end image-level Weakly Supervised Semantic Segmentation (WSSS) has received increasing attention due to its simple but effective implementation. It helps to alleviate the laborious annotation costs required in semantic segmentation. In this work, we find that not all discriminative features can be extracted by a transformer encoder under image-level supervision. Thus, the decoder in end-to-end WSSS fails to predict a satisfying segmentation result. To solve this issue, we propose a Multi-Scale Encoder Enhancement (M-SEE) framework for enabling the encoder to extract comprehensive discriminative features and improve WSSS performance. Specifically, we first resize the original training image to various scales and calculate Class Activation Map (CAM) for each scale image. Then, reliable discriminative regions are mined based on the CAM and decoder segmentation result. Finally, a knowledge distillation loss is calculated among features of original scale and the scaled features of selected reliable discriminative regions. Experimental results show that our M-SEE framework achieves new state-of-the-art performances with 74.8% on PASCAL VOC 2012 test split and 45.8% on MS COCO 2014 validation split. Codes will be released.
| Original language | English |
|---|---|
| Article number | 111348 |
| Journal | Pattern Recognition |
| Volume | 162 |
| DOIs | |
| Publication status | Published - Jun 2025 |
Keywords
- End-to-end framework
- Image-level labels
- Knowledge distillation
- Weakly Supervised Semantic Segmentation
Fingerprint
Dive into the research topics of 'M-SEE: A multi-scale encoder enhancement framework for end-to-end Weakly Supervised Semantic Segmentation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver