TY - JOUR
T1 - Boosting remote semantic segmentation using vision-and-language foundation model
AU - Zhang, Qiuyue
AU - Zhang, Zhiwang
AU - Wen, Shiting
AU - Pang, Chaoyi
AU - Wu, Fangyu
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2025.
PY - 2025/7
Y1 - 2025/7
N2 - In recent years, visual analysis and processing of remote sensing images have become increasingly popular. Vision-language foundation models (such as RemoteCLIP) embed rich prior knowledge from a large number of remote sensing images through extensive pre-training. Although these models perform well in image-level tasks, their prior knowledge has not been fully utilized in pixel-level segmentation tasks. To address this issue, we propose a lightweight fusion framework named Remote Foundation Model for Segmentation (RFM-Seg). This framework trains V-branch connectors, VL-branch connectors, and the VL-map module while freezing both the foundation model and the remote sensing segmentation model. These modules effectively integrate multi-scale and multi-modal prior knowledge from remote sensing images into mainstream remote sensing segmentation models, thereby enhancing the model’s performance in pixel-level segmentation tasks. We validated the effectiveness of this model framework on four challenging aerial image segmentation benchmark datasets, including ISPRS Vaihingen, ISPRS Potsdam, Aerial, and LoveDA Urban. Experimental results demonstrate that RFM-Seg achieves state-of-the-art performance while maintaining highly efficient training and inference. The source code will be released at https://github.com/NBTAILAB/ RFM-Seg.
AB - In recent years, visual analysis and processing of remote sensing images have become increasingly popular. Vision-language foundation models (such as RemoteCLIP) embed rich prior knowledge from a large number of remote sensing images through extensive pre-training. Although these models perform well in image-level tasks, their prior knowledge has not been fully utilized in pixel-level segmentation tasks. To address this issue, we propose a lightweight fusion framework named Remote Foundation Model for Segmentation (RFM-Seg). This framework trains V-branch connectors, VL-branch connectors, and the VL-map module while freezing both the foundation model and the remote sensing segmentation model. These modules effectively integrate multi-scale and multi-modal prior knowledge from remote sensing images into mainstream remote sensing segmentation models, thereby enhancing the model’s performance in pixel-level segmentation tasks. We validated the effectiveness of this model framework on four challenging aerial image segmentation benchmark datasets, including ISPRS Vaihingen, ISPRS Potsdam, Aerial, and LoveDA Urban. Experimental results demonstrate that RFM-Seg achieves state-of-the-art performance while maintaining highly efficient training and inference. The source code will be released at https://github.com/NBTAILAB/ RFM-Seg.
KW - Deep learning
KW - Foundation model
KW - Image processing and analysis
KW - Prior knowledge
KW - Semantic segmentation
UR - https://www.scopus.com/pages/publications/105009055682
U2 - 10.1007/s00371-025-03968-9
DO - 10.1007/s00371-025-03968-9
M3 - Review article
AN - SCOPUS:105009055682
SN - 0178-2789
VL - 41
SP - 6687
EP - 6700
JO - Visual Computer
JF - Visual Computer
IS - 9
ER -