Auxiliary captioning: Bridging image–text matching and image captioning

Hui Li, Jimin Xiao*, Mingjie Sun, Eng Gee Lim, Yao Zhao

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

The image–text matching task, where one query image (text) is provided to seek its corresponding text (image) in a gallery, has drawn increasing attention recently. Conventional methods try to directly map the image and text to one latent-aligned feature space for matching. Achieving an ideal feature alignment is arduous due to the fact that the significant content of the image is not highlighted. To overcome this limitation, we propose to use an auxiliary captioning step to enhance the image feature, where the image feature is fused with the text feature of the captioning output. In this way, the captioning output feature, sharing similar space distribution with candidate texts, can provide high-level semantic information to facilitate locating the significant content in an image. To optimize the auxiliary captioning output, we introduce a new metric, Caption-to-Text (C2T), representing the retrieval performance between the auxiliary captioning output and the ground-truth matching texts. By integrating our C2T score as a reward in our image captioning reinforcement learning framework, our image captioning model can generate more suitable sentences for the auxiliary image–text matching. Extensive experiments on MSCOCO and Flickr30k demonstrate our method's superiority, which achieves absolute improvements of 5.7% (R@1) on Flickr30k and 3.2% (R@1) on MSCOCO over baseline approaches, outperforming state-of-the-art models without complex architectural modifications.

Original languageEnglish
Article number117337
JournalSignal Processing: Image Communication
Volume138
DOIs
Publication statusPublished - Oct 2025

Keywords

  • Image captioning
  • Image-text matching
  • Reinforcement learning

Fingerprint

Dive into the research topics of 'Auxiliary captioning: Bridging image–text matching and image captioning'. Together they form a unique fingerprint.

Cite this