CRA: Text to Image Retrieval for Architecture Images by Chinese CLIP

Siyuan Wang; Yuyao Yan; Xi Yang; Kaizhu Huang

CRA: Text to Image Retrieval for Architecture Images by Chinese CLIP

Siyuan Wang, Yuyao Yan, Xi Yang^*, Kaizhu Huang

^*Corresponding author for this work

Research output: Contribution to conference › Paper › peer-review

4 Citations (Scopus)

Abstract

Text-to-image retrieval is revolutionized since the Contrastive Language-Image Pre-training model was proposed. Most existing methods learn a latent representation of text and then align its embedding with the corresponding image’s embedding from an image encoder. Recently, several Chinese CLIP models have supported a good representation of paired image-text sets. However, adapting the pre-trained retrieval model to a professional domain still remains a challenge, mainly due to the large domain gap between the professional and general text-image sets. In this paper, we introduce a novel contrastive tuning model, named CRA, using Chinese texts to retrieve architecture-related images by fine-tuning the pre-trained Chinese CLIP. Instead of fine-tuning the whole CLIP model, we engage the Locked-image Text tuning (LiT) strategy to adapt the architecture-terminology sets by tuning the text encoder and

Original language	English
Pages	29-34
Publication status	Published - 24 Mar 2023
Event	7th International Conference on Machine Vision and Information Technology (CMVIT) - Xiamen, China Duration: 24 Mar 2023 → 26 Mar 2023

Conference

Conference	7th International Conference on Machine Vision and Information Technology (CMVIT)
Country/Territory	China
City	Xiamen
Period	24/03/23 → 26/03/23

Keywords

Text-to-image retrieval;
Chinese CLIP
Contrastive learning

Cite this

@conference{90cd7081165645be8f414c9dc34ad1bc,

title = "CRA: Text to Image Retrieval for Architecture Images by Chinese CLIP",

abstract = "Text-to-image retrieval is revolutionized since the Contrastive Language-Image Pre-training model was proposed. Most existing methods learn a latent representation of text and then align its embedding with the corresponding image{\textquoteright}s embedding from an image encoder. Recently, several Chinese CLIP models have supported a good representation of paired image-text sets. However, adapting the pre-trained retrieval model to a professional domain still remains a challenge, mainly due to the large domain gap between the professional and general text-image sets. In this paper, we introduce a novel contrastive tuning model, named CRA, using Chinese texts to retrieve architecture-related images by fine-tuning the pre-trained Chinese CLIP. Instead of fine-tuning the whole CLIP model, we engage the Locked-image Text tuning (LiT) strategy to adapt the architecture-terminology sets by tuning the text encoder and ",

keywords = "Text-to-image retrieval;, Chinese CLIP, Contrastive learning",

author = "Siyuan Wang and Yuyao Yan and Xi Yang and Kaizhu Huang",

note = "The work was partially supported by the following: National Natural Science Foundation of China under no.62206225; Jiangsu Science and Technology Programme (Natural Science Foundation of Jiangsu Province) under no.BE2020006-4; Natural Science Foundation of the Jiangsu Higher Education Institutions of China under no.22KJB520039; Key Program Special Fund in XJTLU under no.KSF-T-06; Research Development Fund in XJTLU under no.RDF-19-01-21.; 7th International Conference on Machine Vision and Information Technology (CMVIT) ; Conference date: 24-03-2023 Through 26-03-2023",

year = "2023",

month = mar,

day = "24",

language = "English",

pages = "29--34",

}

TY - CONF

T1 - CRA: Text to Image Retrieval for Architecture Images by Chinese CLIP

AU - Wang, Siyuan

AU - Yan, Yuyao

AU - Yang, Xi

AU - Huang, Kaizhu

N1 - The work was partially supported by the following: National Natural Science Foundation of China under no.62206225; Jiangsu Science and Technology Programme (Natural Science Foundation of Jiangsu Province) under no.BE2020006-4; Natural Science Foundation of the Jiangsu Higher Education Institutions of China under no.22KJB520039; Key Program Special Fund in XJTLU under no.KSF-T-06; Research Development Fund in XJTLU under no.RDF-19-01-21.

PY - 2023/3/24

Y1 - 2023/3/24

N2 - Text-to-image retrieval is revolutionized since the Contrastive Language-Image Pre-training model was proposed. Most existing methods learn a latent representation of text and then align its embedding with the corresponding image’s embedding from an image encoder. Recently, several Chinese CLIP models have supported a good representation of paired image-text sets. However, adapting the pre-trained retrieval model to a professional domain still remains a challenge, mainly due to the large domain gap between the professional and general text-image sets. In this paper, we introduce a novel contrastive tuning model, named CRA, using Chinese texts to retrieve architecture-related images by fine-tuning the pre-trained Chinese CLIP. Instead of fine-tuning the whole CLIP model, we engage the Locked-image Text tuning (LiT) strategy to adapt the architecture-terminology sets by tuning the text encoder and

AB - Text-to-image retrieval is revolutionized since the Contrastive Language-Image Pre-training model was proposed. Most existing methods learn a latent representation of text and then align its embedding with the corresponding image’s embedding from an image encoder. Recently, several Chinese CLIP models have supported a good representation of paired image-text sets. However, adapting the pre-trained retrieval model to a professional domain still remains a challenge, mainly due to the large domain gap between the professional and general text-image sets. In this paper, we introduce a novel contrastive tuning model, named CRA, using Chinese texts to retrieve architecture-related images by fine-tuning the pre-trained Chinese CLIP. Instead of fine-tuning the whole CLIP model, we engage the Locked-image Text tuning (LiT) strategy to adapt the architecture-terminology sets by tuning the text encoder and

KW - Text-to-image retrieval;

KW - Chinese CLIP

KW - Contrastive learning

M3 - Paper

SP - 29

EP - 34

T2 - 7th International Conference on Machine Vision and Information Technology (CMVIT)

Y2 - 24 March 2023 through 26 March 2023

ER -

CRA: Text to Image Retrieval for Architecture Images by Chinese CLIP

Abstract

Conference

Keywords

Fingerprint

Cite this