Abstract
Image-text matching aims to bridge vision and language so as to match the instance of one modality with the instance of another modality. Recent years have seen considerable progress in the research area by exploring local alignment between image regions and sentence words. However, there are still open questions regarding how to learn modality-invariant feature embeddings and effectively utilize hard negatives in the training set to infer more accurate matching scores. In this paper, we introduce a new approach called Image-Text Modality Contrastive Learning (abbreviated as ITContrast) for image-text matching. Our method addresses these challenges by leveraging a pre-trained vision-language model, OSCAR, which is firstly fine-tuned to obtain visual and textual features. We also introduce a hard negative synthesis module, which capitalizes on the difficulty of negative samples. This module profiles negative samples within a mini-match and generates representative embeddings that reflect their hardness in relation to the anchor sample. A novel cost function is designed to comprehensively integrate the information from positives, negatives and synthesized hard negatives. Extensive experiments on the MS COCO and Flickr30K datasets demonstrate that our approach is effective for image-text matching.
Original language | English |
---|---|
Journal | Visual Computer |
DOIs | |
Publication status | Accepted/In press - 2024 |
Keywords
- Contrastive learning
- Data retrieval
- Multimodal deep learning