TY - JOUR
T1 - M-SEE
T2 - A multi-scale encoder enhancement framework for end-to-end Weakly Supervised Semantic Segmentation
AU - Yang, Ziqian
AU - Zhao, Xinqiao
AU - Yao, Chao
AU - Zhang, Quan
AU - Xiao, Jimin
N1 - Publisher Copyright:
© 2025 Elsevier Ltd
PY - 2025/6
Y1 - 2025/6
N2 - End-to-end image-level Weakly Supervised Semantic Segmentation (WSSS) has received increasing attention due to its simple but effective implementation. It helps to alleviate the laborious annotation costs required in semantic segmentation. In this work, we find that not all discriminative features can be extracted by a transformer encoder under image-level supervision. Thus, the decoder in end-to-end WSSS fails to predict a satisfying segmentation result. To solve this issue, we propose a Multi-Scale Encoder Enhancement (M-SEE) framework for enabling the encoder to extract comprehensive discriminative features and improve WSSS performance. Specifically, we first resize the original training image to various scales and calculate Class Activation Map (CAM) for each scale image. Then, reliable discriminative regions are mined based on the CAM and decoder segmentation result. Finally, a knowledge distillation loss is calculated among features of original scale and the scaled features of selected reliable discriminative regions. Experimental results show that our M-SEE framework achieves new state-of-the-art performances with 74.8% on PASCAL VOC 2012 test split and 45.8% on MS COCO 2014 validation split. Codes will be released.
AB - End-to-end image-level Weakly Supervised Semantic Segmentation (WSSS) has received increasing attention due to its simple but effective implementation. It helps to alleviate the laborious annotation costs required in semantic segmentation. In this work, we find that not all discriminative features can be extracted by a transformer encoder under image-level supervision. Thus, the decoder in end-to-end WSSS fails to predict a satisfying segmentation result. To solve this issue, we propose a Multi-Scale Encoder Enhancement (M-SEE) framework for enabling the encoder to extract comprehensive discriminative features and improve WSSS performance. Specifically, we first resize the original training image to various scales and calculate Class Activation Map (CAM) for each scale image. Then, reliable discriminative regions are mined based on the CAM and decoder segmentation result. Finally, a knowledge distillation loss is calculated among features of original scale and the scaled features of selected reliable discriminative regions. Experimental results show that our M-SEE framework achieves new state-of-the-art performances with 74.8% on PASCAL VOC 2012 test split and 45.8% on MS COCO 2014 validation split. Codes will be released.
KW - End-to-end framework
KW - Image-level labels
KW - Knowledge distillation
KW - Weakly Supervised Semantic Segmentation
UR - http://www.scopus.com/inward/record.url?scp=85216231570&partnerID=8YFLogxK
U2 - 10.1016/j.patcog.2025.111348
DO - 10.1016/j.patcog.2025.111348
M3 - Article
AN - SCOPUS:85216231570
SN - 0031-3203
VL - 162
JO - Pattern Recognition
JF - Pattern Recognition
M1 - 111348
ER -