ITContrast: contrastive learning with hard negative synthesis for image-text matching

Fangyu Wu; Qiufeng Wang; Zhao Wang; Siyue Yu; Yushi Li; Bailing Zhang; Eng Gee Lim

doi:10.1007/s00371-024-03274-w

ITContrast: contrastive learning with hard negative synthesis for image-text matching

Fangyu Wu^*, Qiufeng Wang, Zhao Wang, Siyue Yu, Yushi Li, Bailing Zhang, Eng Gee Lim

^*Corresponding author for this work

School of Advanced Technology

Zhejiang University

Research output: Contribution to journal › Article › peer-review

2 Citations (Scopus)

Abstract

Image-text matching aims to bridge vision and language so as to match the instance of one modality with the instance of another modality. Recent years have seen considerable progress in the research area by exploring local alignment between image regions and sentence words. However, there are still open questions regarding how to learn modality-invariant feature embeddings and effectively utilize hard negatives in the training set to infer more accurate matching scores. In this paper, we introduce a new approach called Image-Text Modality Contrastive Learning (abbreviated as ITContrast) for image-text matching. Our method addresses these challenges by leveraging a pre-trained vision-language model, OSCAR, which is firstly fine-tuned to obtain visual and textual features. We also introduce a hard negative synthesis module, which capitalizes on the difficulty of negative samples. This module profiles negative samples within a mini-match and generates representative embeddings that reflect their hardness in relation to the anchor sample. A novel cost function is designed to comprehensively integrate the information from positives, negatives and synthesized hard negatives. Extensive experiments on the MS COCO and Flickr30K datasets demonstrate that our approach is effective for image-text matching.

Original language	English
Pages (from-to)	8825-8838
Number of pages	14
Journal	Visual Computer
Volume	40
Issue number	12
DOIs	https://doi.org/10.1007/s00371-024-03274-w
Publication status	Accepted/In press - 2024

Keywords

Contrastive learning
Data retrieval
Multimodal deep learning

Access to Document

10.1007/s00371-024-03274-w

Cite this

@article{b81759b18a1c48b7be74a78ce2f88b4b,

title = "ITContrast: contrastive learning with hard negative synthesis for image-text matching",

abstract = "Image-text matching aims to bridge vision and language so as to match the instance of one modality with the instance of another modality. Recent years have seen considerable progress in the research area by exploring local alignment between image regions and sentence words. However, there are still open questions regarding how to learn modality-invariant feature embeddings and effectively utilize hard negatives in the training set to infer more accurate matching scores. In this paper, we introduce a new approach called Image-Text Modality Contrastive Learning (abbreviated as ITContrast) for image-text matching. Our method addresses these challenges by leveraging a pre-trained vision-language model, OSCAR, which is firstly fine-tuned to obtain visual and textual features. We also introduce a hard negative synthesis module, which capitalizes on the difficulty of negative samples. This module profiles negative samples within a mini-match and generates representative embeddings that reflect their hardness in relation to the anchor sample. A novel cost function is designed to comprehensively integrate the information from positives, negatives and synthesized hard negatives. Extensive experiments on the MS COCO and Flickr30K datasets demonstrate that our approach is effective for image-text matching.",

keywords = "Contrastive learning, Data retrieval, Multimodal deep learning",

author = "Fangyu Wu and Qiufeng Wang and Zhao Wang and Siyue Yu and Yushi Li and Bailing Zhang and Lim, {Eng Gee}",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.",

year = "2024",

doi = "10.1007/s00371-024-03274-w",

language = "English",

volume = "40",

pages = "8825--8838",

journal = "Visual Computer",

issn = "0178-2789",

number = "12",

}

TY - JOUR

T1 - ITContrast

T2 - contrastive learning with hard negative synthesis for image-text matching

AU - Wu, Fangyu

AU - Wang, Qiufeng

AU - Wang, Zhao

AU - Yu, Siyue

AU - Li, Yushi

AU - Zhang, Bailing

AU - Lim, Eng Gee

N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.

PY - 2024

Y1 - 2024

N2 - Image-text matching aims to bridge vision and language so as to match the instance of one modality with the instance of another modality. Recent years have seen considerable progress in the research area by exploring local alignment between image regions and sentence words. However, there are still open questions regarding how to learn modality-invariant feature embeddings and effectively utilize hard negatives in the training set to infer more accurate matching scores. In this paper, we introduce a new approach called Image-Text Modality Contrastive Learning (abbreviated as ITContrast) for image-text matching. Our method addresses these challenges by leveraging a pre-trained vision-language model, OSCAR, which is firstly fine-tuned to obtain visual and textual features. We also introduce a hard negative synthesis module, which capitalizes on the difficulty of negative samples. This module profiles negative samples within a mini-match and generates representative embeddings that reflect their hardness in relation to the anchor sample. A novel cost function is designed to comprehensively integrate the information from positives, negatives and synthesized hard negatives. Extensive experiments on the MS COCO and Flickr30K datasets demonstrate that our approach is effective for image-text matching.

AB - Image-text matching aims to bridge vision and language so as to match the instance of one modality with the instance of another modality. Recent years have seen considerable progress in the research area by exploring local alignment between image regions and sentence words. However, there are still open questions regarding how to learn modality-invariant feature embeddings and effectively utilize hard negatives in the training set to infer more accurate matching scores. In this paper, we introduce a new approach called Image-Text Modality Contrastive Learning (abbreviated as ITContrast) for image-text matching. Our method addresses these challenges by leveraging a pre-trained vision-language model, OSCAR, which is firstly fine-tuned to obtain visual and textual features. We also introduce a hard negative synthesis module, which capitalizes on the difficulty of negative samples. This module profiles negative samples within a mini-match and generates representative embeddings that reflect their hardness in relation to the anchor sample. A novel cost function is designed to comprehensively integrate the information from positives, negatives and synthesized hard negatives. Extensive experiments on the MS COCO and Flickr30K datasets demonstrate that our approach is effective for image-text matching.

KW - Contrastive learning

KW - Data retrieval

KW - Multimodal deep learning

UR - http://www.scopus.com/inward/record.url?scp=85185119705&partnerID=8YFLogxK

U2 - 10.1007/s00371-024-03274-w

DO - 10.1007/s00371-024-03274-w

M3 - Article

AN - SCOPUS:85185119705

SN - 0178-2789

VL - 40

SP - 8825

EP - 8838

JO - Visual Computer

JF - Visual Computer

IS - 12

ER -

ITContrast: contrastive learning with hard negative synthesis for image-text matching

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this