TY - GEN
T1 - CRA: Text to Image Retrieval for Architecture Images by Chinese CLIP
AU - Wang, Siyuan
AU - Yan, Yuyao
AU - Yang, Xi
AU - Huang, Kaizhu
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023/3/24
Y1 - 2023/3/24
N2 - Text-To-image retrieval is revolutionized since the Contrastive Language-Image Pre-Training model was proposed. Most existing methods learn a latent representation of text and then align its embedding with the corresponding image's embedding from an image encoder. Recently, several Chinese CLIP models have supported a good representation of paired image-Text sets. However, adapting the pre-Trained retrieval model to a professional domain still remains a challenge, mainly due to the large domain gap between the professional and general text-image sets. In this paper, we introduce a novel contrastive tuning model, named CRA, using Chinese texts to retrieve architecture-related images by fine-Tuning the pre-Trained Chinese CLIP. Instead of fine-Tuning the whole CLIP model, we engage the Locked-image Text tuning (LiT) strategy to adapt the architecture-Terminology sets by tuning the text encoder and freezing the pre-Trained large-scale image encoder. We further propose a text-image dataset of architectural design. On the text-To-image retrieval task, we improve the metric of R@20 from 44.92% by the original Chinese CLIP model to 74.61% by our CRA model in the test set.
AB - Text-To-image retrieval is revolutionized since the Contrastive Language-Image Pre-Training model was proposed. Most existing methods learn a latent representation of text and then align its embedding with the corresponding image's embedding from an image encoder. Recently, several Chinese CLIP models have supported a good representation of paired image-Text sets. However, adapting the pre-Trained retrieval model to a professional domain still remains a challenge, mainly due to the large domain gap between the professional and general text-image sets. In this paper, we introduce a novel contrastive tuning model, named CRA, using Chinese texts to retrieve architecture-related images by fine-Tuning the pre-Trained Chinese CLIP. Instead of fine-Tuning the whole CLIP model, we engage the Locked-image Text tuning (LiT) strategy to adapt the architecture-Terminology sets by tuning the text encoder and freezing the pre-Trained large-scale image encoder. We further propose a text-image dataset of architectural design. On the text-To-image retrieval task, we improve the metric of R@20 from 44.92% by the original Chinese CLIP model to 74.61% by our CRA model in the test set.
KW - Text-to-image retrieval
KW - Chinese CLIP
KW - Contrastive learning
UR - https://www.scopus.com/pages/publications/85164831371
U2 - 10.1109/CMVIT57620.2023.00015
DO - 10.1109/CMVIT57620.2023.00015
M3 - Conference Proceeding
AN - SCOPUS:85164831371
T3 - International Conference on Machine Vision and Information Technology, CMVIT
SP - 29
EP - 34
BT - Proceedings - 2023 7th International Conference on Machine Vision and Information Technology, CMVIT 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 7th International Conference on Machine Vision and Information Technology (CMVIT)
Y2 - 24 March 2023 through 26 March 2023
ER -