TY - GEN
T1 - DisFaceRep
T2 - 33rd ACM International Conference on Multimedia, MM 2025
AU - Wang, Xiaoqin
AU - Hou, Xianxu
AU - Ding, Meidan
AU - Chen, Junliang
AU - Deng, Kaijun
AU - Xie, Jinheng
AU - Shen, Linlin
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/10/27
Y1 - 2025/10/27
N2 - Face parsing aims to segment facial images into key components such as eyes, lips, and eyebrows. While existing methods rely on dense pixel-level annotations, such annotations are expensive and labor-intensive to obtain. To reduce annotation cost, we introduce Weakly Supervised Face Parsing (WSFP), a new task setting that performs dense facial component segmentation using only weak supervision, such as image-level labels and natural language descriptions. WSFP introduces unique challenges due to the high co-occurrence and visual similarity of facial components, which lead to ambiguous activations and degraded parsing performance. To address this, we propose DisFaceRep, a representation disentanglement framework designed to separate co-occurring facial components through both explicit and implicit mechanisms. Specifically, we introduce a co-occurring component disentanglement strategy to explicitly reduce dataset-level bias, and a text-guided component disentanglement loss to guide component separation using language supervision implicitly. Extensive experiments on CelebAMask-HQ, LaPa, and Helen demonstrate the difficulty of WSFP and the effectiveness of DisFaceRep, which significantly outperforms existing weakly supervised semantic segmentation methods. The code will be released at https://github.com/CVI-SZU/DisFaceRep.
AB - Face parsing aims to segment facial images into key components such as eyes, lips, and eyebrows. While existing methods rely on dense pixel-level annotations, such annotations are expensive and labor-intensive to obtain. To reduce annotation cost, we introduce Weakly Supervised Face Parsing (WSFP), a new task setting that performs dense facial component segmentation using only weak supervision, such as image-level labels and natural language descriptions. WSFP introduces unique challenges due to the high co-occurrence and visual similarity of facial components, which lead to ambiguous activations and degraded parsing performance. To address this, we propose DisFaceRep, a representation disentanglement framework designed to separate co-occurring facial components through both explicit and implicit mechanisms. Specifically, we introduce a co-occurring component disentanglement strategy to explicitly reduce dataset-level bias, and a text-guided component disentanglement loss to guide component separation using language supervision implicitly. Extensive experiments on CelebAMask-HQ, LaPa, and Helen demonstrate the difficulty of WSFP and the effectiveness of DisFaceRep, which significantly outperforms existing weakly supervised semantic segmentation methods. The code will be released at https://github.com/CVI-SZU/DisFaceRep.
KW - face parsing
KW - visual language models
KW - weakly supervised learning
UR - https://www.scopus.com/pages/publications/105024071970
U2 - 10.1145/3746027.3755285
DO - 10.1145/3746027.3755285
M3 - Conference Proceeding
AN - SCOPUS:105024071970
T3 - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
SP - 4020
EP - 4029
BT - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PB - Association for Computing Machinery, Inc
Y2 - 27 October 2025 through 31 October 2025
ER -