TY - JOUR
T1 - Frequency-Aware Multi-Modal Fine-Tuning for Few-Shot Open-Set Remote Sensing Scene Classification
AU - Zhang, Junjie
AU - Rao, Yutao
AU - Huang, Xiaoshui
AU - Li, Guanyi
AU - Zhou, Xin
AU - Zeng, Dan
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Few-shot open-set recognition, as a new paradigm, leveraging a limited amount of supervised data to identify specific Remote Sensing (RS) scene categories and generalize to novel ones. However, the data bias induced by the small sample size not only causes severe overfitting within base classes, but also impairs the capacity for inference to identify RS scenes in hitherto unobserved categories. Furthermore, owing to environmental influences, RS images frequently manifest notable intra-class disparities and comparatively low inter-class distinctions, intensifying the challenge in obtaining suitable classifiers. To address above issues, we investigate the utilization of a Multi-modal Foundational Model (MFM) infused with essential domain knowledge to mitigate the generalization limitations encountered in few-shot scenarios. Recognizing that existing MFMs with a visual-text dual-branch structure are primarily tailored for natural scenes, we propose a custom Frequency Distribution-based Multi-modal Fine-Tuning strategy (FreqDiMFT) in a parameter-efficient manner. More specifically, within the vision branch, we address the high inter-class similarity and intra-class diversity in RS images by embedding the local-global frequency distribution information to facilitate the recognition of RS scenes. To further amplify the model's generalization ability post transfer, we introduce an adaptive feature refinement module designed for Transformers, proficient in filtering redundant features resulting from domain disparities. To mitigate the domain drift on the textual branch, we adopt an input format that combines basic templates with domain expertise from RS end to generate more discriminative class prototypes. To fully verify the effectiveness of our FreqDiMFT in a more practical setting, we collect a Large-Scale hybrid dataset (LSRS). Extensive experiments demonstrate that, even with a scant number of training samples, our strategy yields advanced performances compared to state-of-the-art models.
AB - Few-shot open-set recognition, as a new paradigm, leveraging a limited amount of supervised data to identify specific Remote Sensing (RS) scene categories and generalize to novel ones. However, the data bias induced by the small sample size not only causes severe overfitting within base classes, but also impairs the capacity for inference to identify RS scenes in hitherto unobserved categories. Furthermore, owing to environmental influences, RS images frequently manifest notable intra-class disparities and comparatively low inter-class distinctions, intensifying the challenge in obtaining suitable classifiers. To address above issues, we investigate the utilization of a Multi-modal Foundational Model (MFM) infused with essential domain knowledge to mitigate the generalization limitations encountered in few-shot scenarios. Recognizing that existing MFMs with a visual-text dual-branch structure are primarily tailored for natural scenes, we propose a custom Frequency Distribution-based Multi-modal Fine-Tuning strategy (FreqDiMFT) in a parameter-efficient manner. More specifically, within the vision branch, we address the high inter-class similarity and intra-class diversity in RS images by embedding the local-global frequency distribution information to facilitate the recognition of RS scenes. To further amplify the model's generalization ability post transfer, we introduce an adaptive feature refinement module designed for Transformers, proficient in filtering redundant features resulting from domain disparities. To mitigate the domain drift on the textual branch, we adopt an input format that combines basic templates with domain expertise from RS end to generate more discriminative class prototypes. To fully verify the effectiveness of our FreqDiMFT in a more practical setting, we collect a Large-Scale hybrid dataset (LSRS). Extensive experiments demonstrate that, even with a scant number of training samples, our strategy yields advanced performances compared to state-of-the-art models.
KW - few-shot open-set recognition
KW - multi-modal foundation model
KW - parameter-efficient transfer learning
KW - RS scene classification
UR - http://www.scopus.com/inward/record.url?scp=85187395341&partnerID=8YFLogxK
U2 - 10.1109/TMM.2024.3372416
DO - 10.1109/TMM.2024.3372416
M3 - Article
AN - SCOPUS:85187395341
SN - 1520-9210
VL - 26
SP - 7823
EP - 7837
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -