CRA: Text to Image Retrieval for Architecture Images by Chinese CLIP

Siyuan Wang, Yuyao Yan, Xi Yang*, Kaizhu Huang

*Corresponding author for this work

Research output: Contribution to conferencePaperpeer-review

4 Citations (Scopus)

Abstract

Text-to-image retrieval is revolutionized since the Contrastive Language-Image Pre-training model was proposed. Most existing methods learn a latent representation of text and then align its embedding with the corresponding image’s embedding from an image encoder. Recently, several Chinese CLIP models have supported a good representation of paired image-text sets. However, adapting the pre-trained retrieval model to a professional domain still remains a challenge, mainly due to the large domain gap between the professional and general text-image sets. In this paper, we introduce a novel contrastive tuning model, named CRA, using Chinese texts to retrieve architecture-related images by fine-tuning the pre-trained Chinese CLIP. Instead of fine-tuning the whole CLIP model, we engage the Locked-image Text tuning (LiT) strategy to adapt the architecture-terminology sets by tuning the text encoder and
Original languageEnglish
Pages29-34
Publication statusPublished - 24 Mar 2023
Event7th International Conference on Machine Vision and Information Technology (CMVIT) - Xiamen, China
Duration: 24 Mar 202326 Mar 2023

Conference

Conference7th International Conference on Machine Vision and Information Technology (CMVIT)
Country/TerritoryChina
CityXiamen
Period24/03/2326/03/23

Keywords

  • Text-to-image retrieval;
  • Chinese CLIP
  • Contrastive learning

Fingerprint

Dive into the research topics of 'CRA: Text to Image Retrieval for Architecture Images by Chinese CLIP'. Together they form a unique fingerprint.

Cite this