Abstract
Text-to-image retrieval is revolutionized since the Contrastive Language-Image Pre-training model was proposed. Most existing methods learn a latent representation of text and then align its embedding with the corresponding image’s embedding from an image encoder. Recently, several Chinese CLIP models have supported a good representation of paired image-text sets. However, adapting the pre-trained retrieval model to a professional domain still remains a challenge, mainly due to the large domain gap between the professional and general text-image sets. In this paper, we introduce a novel contrastive tuning model, named CRA, using Chinese texts to retrieve architecture-related images by fine-tuning the pre-trained Chinese CLIP. Instead of fine-tuning the whole CLIP model, we engage the Locked-image Text tuning (LiT) strategy to adapt the architecture-terminology sets by tuning the text encoder and
Original language | English |
---|---|
Pages | 29-34 |
Publication status | Published - 24 Mar 2023 |
Event | 7th International Conference on Machine Vision and Information Technology (CMVIT) - Xiamen, China Duration: 24 Mar 2023 → 26 Mar 2023 |
Conference
Conference | 7th International Conference on Machine Vision and Information Technology (CMVIT) |
---|---|
Country/Territory | China |
City | Xiamen |
Period | 24/03/23 → 26/03/23 |
Keywords
- Text-to-image retrieval;
- Chinese CLIP
- Contrastive learning