CRA: Text to Image Retrieval for Architecture Images by Chinese CLIP

  • Siyuan Wang*
  • , Yuyao Yan
  • , Xi Yang
  • , Kaizhu Huang
  • *Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceedingConference Proceedingpeer-review

6 Citations (Scopus)

Abstract

Text-To-image retrieval is revolutionized since the Contrastive Language-Image Pre-Training model was proposed. Most existing methods learn a latent representation of text and then align its embedding with the corresponding image's embedding from an image encoder. Recently, several Chinese CLIP models have supported a good representation of paired image-Text sets. However, adapting the pre-Trained retrieval model to a professional domain still remains a challenge, mainly due to the large domain gap between the professional and general text-image sets. In this paper, we introduce a novel contrastive tuning model, named CRA, using Chinese texts to retrieve architecture-related images by fine-Tuning the pre-Trained Chinese CLIP. Instead of fine-Tuning the whole CLIP model, we engage the Locked-image Text tuning (LiT) strategy to adapt the architecture-Terminology sets by tuning the text encoder and freezing the pre-Trained large-scale image encoder. We further propose a text-image dataset of architectural design. On the text-To-image retrieval task, we improve the metric of R@20 from 44.92% by the original Chinese CLIP model to 74.61% by our CRA model in the test set.

Original languageEnglish
Title of host publicationProceedings - 2023 7th International Conference on Machine Vision and Information Technology, CMVIT 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages29-34
Number of pages6
ISBN (Electronic)9781665464857
DOIs
Publication statusPublished - 24 Mar 2023
Event7th International Conference on Machine Vision and Information Technology (CMVIT) - Xiamen, China
Duration: 24 Mar 202326 Mar 2023

Publication series

NameInternational Conference on Machine Vision and Information Technology, CMVIT

Conference

Conference7th International Conference on Machine Vision and Information Technology (CMVIT)
Country/TerritoryChina
CityXiamen
Period24/03/2326/03/23

Keywords

  • Text-to-image retrieval
  • Chinese CLIP
  • Contrastive learning

Fingerprint

Dive into the research topics of 'CRA: Text to Image Retrieval for Architecture Images by Chinese CLIP'. Together they form a unique fingerprint.

Cite this