TY - GEN
T1 - Adversarial Erasing Transformer for Weakly Supervised Semantic Segmentation
AU - Zhang, Bingfeng
AU - Yu, Siyue
AU - Gao, Xuru
AU - Sun, Mingjie
AU - Lim, Eng Gee
AU - Xiao, Jimin
N1 - Publisher Copyright:
© 2024 The Authors.
PY - 2024/10/16
Y1 - 2024/10/16
N2 - Weakly supervised semantic segmentation has attracted a lot of attention recently. Previous methods can be divided into two types, which are single-stage training and multi-stage training. In this paper, we focus on multi-stage training for image-level weakly supervised semantic segmentation. Many recent methods have tried to use transformer architecture as the backbone for CAM generation since it can capture global relationships to refine CAM accurately. However, we observe that such a backbone still fails to generate complete and smooth CAM. We argue that this is because the attention mechanism in the transformer can only pay attention to the most discriminative relationships. It is difficult to capture semantic-level long-range pair-wise relationships under image-level supervision. Thus, we propose an adversarial erasing transformer network called AETN, where an erasing attention mechanism is designed to establish more extensive pair-wise relationships. To cope with erasing, more target features will be forced to activate. Thus, better feature representation can be obtained for more accurate CAM generation. Besides, to further help our network learn better feature representation, we propose a self-consistent learning mechanism based on different augmentations. In this way, our AETN outperforms recent methods. Our AETN achieves 73.0 mIoU on the PASCAL VOC 2012 val set and 73.9 mIoU on the PASCAL VOC 2012 test set. Code is available a https://github.com/siyueyu/AETN.
AB - Weakly supervised semantic segmentation has attracted a lot of attention recently. Previous methods can be divided into two types, which are single-stage training and multi-stage training. In this paper, we focus on multi-stage training for image-level weakly supervised semantic segmentation. Many recent methods have tried to use transformer architecture as the backbone for CAM generation since it can capture global relationships to refine CAM accurately. However, we observe that such a backbone still fails to generate complete and smooth CAM. We argue that this is because the attention mechanism in the transformer can only pay attention to the most discriminative relationships. It is difficult to capture semantic-level long-range pair-wise relationships under image-level supervision. Thus, we propose an adversarial erasing transformer network called AETN, where an erasing attention mechanism is designed to establish more extensive pair-wise relationships. To cope with erasing, more target features will be forced to activate. Thus, better feature representation can be obtained for more accurate CAM generation. Besides, to further help our network learn better feature representation, we propose a self-consistent learning mechanism based on different augmentations. In this way, our AETN outperforms recent methods. Our AETN achieves 73.0 mIoU on the PASCAL VOC 2012 val set and 73.9 mIoU on the PASCAL VOC 2012 test set. Code is available a https://github.com/siyueyu/AETN.
UR - http://www.scopus.com/inward/record.url?scp=85213322735&partnerID=8YFLogxK
U2 - 10.3233/FAIA240510
DO - 10.3233/FAIA240510
M3 - Conference Proceeding
AN - SCOPUS:85213322735
T3 - Frontiers in Artificial Intelligence and Applications
SP - 370
EP - 377
BT - ECAI 2024 - 27th European Conference on Artificial Intelligence, Including 13th Conference on Prestigious Applications of Intelligent Systems, PAIS 2024, Proceedings
A2 - Endriss, Ulle
A2 - Melo, Francisco S.
A2 - Bach, Kerstin
A2 - Bugarin-Diz, Alberto
A2 - Alonso-Moral, Jose M.
A2 - Barro, Senen
A2 - Heintz, Fredrik
PB - IOS Press BV
T2 - 27th European Conference on Artificial Intelligence, ECAI 2024
Y2 - 19 October 2024 through 24 October 2024
ER -