ITContrast: contrastive learning with hard negative synthesis for image-text matching

Fangyu Wu*, Qiufeng Wang, Zhao Wang, Siyue Yu, Yushi Li, Bailing Zhang, Eng Gee Lim

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review


Image-text matching aims to bridge vision and language so as to match the instance of one modality with the instance of another modality. Recent years have seen considerable progress in the research area by exploring local alignment between image regions and sentence words. However, there are still open questions regarding how to learn modality-invariant feature embeddings and effectively utilize hard negatives in the training set to infer more accurate matching scores. In this paper, we introduce a new approach called Image-Text Modality Contrastive Learning (abbreviated as ITContrast) for image-text matching. Our method addresses these challenges by leveraging a pre-trained vision-language model, OSCAR, which is firstly fine-tuned to obtain visual and textual features. We also introduce a hard negative synthesis module, which capitalizes on the difficulty of negative samples. This module profiles negative samples within a mini-match and generates representative embeddings that reflect their hardness in relation to the anchor sample. A novel cost function is designed to comprehensively integrate the information from positives, negatives and synthesized hard negatives. Extensive experiments on the MS COCO and Flickr30K datasets demonstrate that our approach is effective for image-text matching.

Original languageEnglish
JournalVisual Computer
Publication statusAccepted/In press - 2024


  • Contrastive learning
  • Data retrieval
  • Multimodal deep learning


Dive into the research topics of 'ITContrast: contrastive learning with hard negative synthesis for image-text matching'. Together they form a unique fingerprint.

Cite this