Auxiliary captioning: Bridging image–text matching and image captioning

Hui Li; Jimin Xiao; Mingjie Sun; Eng Gee Lim; Yao Zhao

doi:10.1016/j.image.2025.117337

Auxiliary captioning: Bridging image–text matching and image captioning

Hui Li, Jimin Xiao^*, Mingjie Sun, Eng Gee Lim, Yao Zhao

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

The image–text matching task, where one query image (text) is provided to seek its corresponding text (image) in a gallery, has drawn increasing attention recently. Conventional methods try to directly map the image and text to one latent-aligned feature space for matching. Achieving an ideal feature alignment is arduous due to the fact that the significant content of the image is not highlighted. To overcome this limitation, we propose to use an auxiliary captioning step to enhance the image feature, where the image feature is fused with the text feature of the captioning output. In this way, the captioning output feature, sharing similar space distribution with candidate texts, can provide high-level semantic information to facilitate locating the significant content in an image. To optimize the auxiliary captioning output, we introduce a new metric, Caption-to-Text (C2T), representing the retrieval performance between the auxiliary captioning output and the ground-truth matching texts. By integrating our C2T score as a reward in our image captioning reinforcement learning framework, our image captioning model can generate more suitable sentences for the auxiliary image–text matching. Extensive experiments on MSCOCO and Flickr30k demonstrate our method's superiority, which achieves absolute improvements of 5.7% (R@1) on Flickr30k and 3.2% (R@1) on MSCOCO over baseline approaches, outperforming state-of-the-art models without complex architectural modifications.

Original language	English
Article number	117337
Journal	Signal Processing: Image Communication
Volume	138
DOIs	https://doi.org/10.1016/j.image.2025.117337
Publication status	Published - Oct 2025

Keywords

Image captioning
Image-text matching
Reinforcement learning

Access to Document

10.1016/j.image.2025.117337

Cite this

@article{9e4aea834e87427fa053062ed0b54db9,

title = "Auxiliary captioning: Bridging image–text matching and image captioning",

abstract = "The image–text matching task, where one query image (text) is provided to seek its corresponding text (image) in a gallery, has drawn increasing attention recently. Conventional methods try to directly map the image and text to one latent-aligned feature space for matching. Achieving an ideal feature alignment is arduous due to the fact that the significant content of the image is not highlighted. To overcome this limitation, we propose to use an auxiliary captioning step to enhance the image feature, where the image feature is fused with the text feature of the captioning output. In this way, the captioning output feature, sharing similar space distribution with candidate texts, can provide high-level semantic information to facilitate locating the significant content in an image. To optimize the auxiliary captioning output, we introduce a new metric, Caption-to-Text (C2T), representing the retrieval performance between the auxiliary captioning output and the ground-truth matching texts. By integrating our C2T score as a reward in our image captioning reinforcement learning framework, our image captioning model can generate more suitable sentences for the auxiliary image–text matching. Extensive experiments on MSCOCO and Flickr30k demonstrate our method's superiority, which achieves absolute improvements of 5.7% (R@1) on Flickr30k and 3.2% (R@1) on MSCOCO over baseline approaches, outperforming state-of-the-art models without complex architectural modifications.",

keywords = "Image captioning, Image-text matching, Reinforcement learning",

author = "Hui Li and Jimin Xiao and Mingjie Sun and Lim, {Eng Gee} and Yao Zhao",

note = "Publisher Copyright: {\textcopyright} 2025 Elsevier B.V.",

year = "2025",

month = oct,

doi = "10.1016/j.image.2025.117337",

language = "English",

volume = "138",

journal = "Signal Processing: Image Communication",

issn = "0923-5965",

}

TY - JOUR

T1 - Auxiliary captioning

T2 - Bridging image–text matching and image captioning

AU - Li, Hui

AU - Xiao, Jimin

AU - Sun, Mingjie

AU - Lim, Eng Gee

AU - Zhao, Yao

PY - 2025/10

Y1 - 2025/10

N2 - The image–text matching task, where one query image (text) is provided to seek its corresponding text (image) in a gallery, has drawn increasing attention recently. Conventional methods try to directly map the image and text to one latent-aligned feature space for matching. Achieving an ideal feature alignment is arduous due to the fact that the significant content of the image is not highlighted. To overcome this limitation, we propose to use an auxiliary captioning step to enhance the image feature, where the image feature is fused with the text feature of the captioning output. In this way, the captioning output feature, sharing similar space distribution with candidate texts, can provide high-level semantic information to facilitate locating the significant content in an image. To optimize the auxiliary captioning output, we introduce a new metric, Caption-to-Text (C2T), representing the retrieval performance between the auxiliary captioning output and the ground-truth matching texts. By integrating our C2T score as a reward in our image captioning reinforcement learning framework, our image captioning model can generate more suitable sentences for the auxiliary image–text matching. Extensive experiments on MSCOCO and Flickr30k demonstrate our method's superiority, which achieves absolute improvements of 5.7% (R@1) on Flickr30k and 3.2% (R@1) on MSCOCO over baseline approaches, outperforming state-of-the-art models without complex architectural modifications.

AB - The image–text matching task, where one query image (text) is provided to seek its corresponding text (image) in a gallery, has drawn increasing attention recently. Conventional methods try to directly map the image and text to one latent-aligned feature space for matching. Achieving an ideal feature alignment is arduous due to the fact that the significant content of the image is not highlighted. To overcome this limitation, we propose to use an auxiliary captioning step to enhance the image feature, where the image feature is fused with the text feature of the captioning output. In this way, the captioning output feature, sharing similar space distribution with candidate texts, can provide high-level semantic information to facilitate locating the significant content in an image. To optimize the auxiliary captioning output, we introduce a new metric, Caption-to-Text (C2T), representing the retrieval performance between the auxiliary captioning output and the ground-truth matching texts. By integrating our C2T score as a reward in our image captioning reinforcement learning framework, our image captioning model can generate more suitable sentences for the auxiliary image–text matching. Extensive experiments on MSCOCO and Flickr30k demonstrate our method's superiority, which achieves absolute improvements of 5.7% (R@1) on Flickr30k and 3.2% (R@1) on MSCOCO over baseline approaches, outperforming state-of-the-art models without complex architectural modifications.

KW - Image captioning

KW - Image-text matching

KW - Reinforcement learning

UR - http://www.scopus.com/inward/record.url?scp=105004379378&partnerID=8YFLogxK

U2 - 10.1016/j.image.2025.117337

DO - 10.1016/j.image.2025.117337

M3 - Article

AN - SCOPUS:105004379378

SN - 0923-5965

VL - 138

JO - Signal Processing: Image Communication

JF - Signal Processing: Image Communication

M1 - 117337

ER -

Auxiliary captioning: Bridging image–text matching and image captioning

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this