TY - JOUR
T1 - Auxiliary captioning
T2 - Bridging image–text matching and image captioning
AU - Li, Hui
AU - Xiao, Jimin
AU - Sun, Mingjie
AU - Lim, Eng Gee
AU - Zhao, Yao
N1 - Publisher Copyright:
© 2025 Elsevier B.V.
PY - 2025/10
Y1 - 2025/10
N2 - The image–text matching task, where one query image (text) is provided to seek its corresponding text (image) in a gallery, has drawn increasing attention recently. Conventional methods try to directly map the image and text to one latent-aligned feature space for matching. Achieving an ideal feature alignment is arduous due to the fact that the significant content of the image is not highlighted. To overcome this limitation, we propose to use an auxiliary captioning step to enhance the image feature, where the image feature is fused with the text feature of the captioning output. In this way, the captioning output feature, sharing similar space distribution with candidate texts, can provide high-level semantic information to facilitate locating the significant content in an image. To optimize the auxiliary captioning output, we introduce a new metric, Caption-to-Text (C2T), representing the retrieval performance between the auxiliary captioning output and the ground-truth matching texts. By integrating our C2T score as a reward in our image captioning reinforcement learning framework, our image captioning model can generate more suitable sentences for the auxiliary image–text matching. Extensive experiments on MSCOCO and Flickr30k demonstrate our method's superiority, which achieves absolute improvements of 5.7% (R@1) on Flickr30k and 3.2% (R@1) on MSCOCO over baseline approaches, outperforming state-of-the-art models without complex architectural modifications.
AB - The image–text matching task, where one query image (text) is provided to seek its corresponding text (image) in a gallery, has drawn increasing attention recently. Conventional methods try to directly map the image and text to one latent-aligned feature space for matching. Achieving an ideal feature alignment is arduous due to the fact that the significant content of the image is not highlighted. To overcome this limitation, we propose to use an auxiliary captioning step to enhance the image feature, where the image feature is fused with the text feature of the captioning output. In this way, the captioning output feature, sharing similar space distribution with candidate texts, can provide high-level semantic information to facilitate locating the significant content in an image. To optimize the auxiliary captioning output, we introduce a new metric, Caption-to-Text (C2T), representing the retrieval performance between the auxiliary captioning output and the ground-truth matching texts. By integrating our C2T score as a reward in our image captioning reinforcement learning framework, our image captioning model can generate more suitable sentences for the auxiliary image–text matching. Extensive experiments on MSCOCO and Flickr30k demonstrate our method's superiority, which achieves absolute improvements of 5.7% (R@1) on Flickr30k and 3.2% (R@1) on MSCOCO over baseline approaches, outperforming state-of-the-art models without complex architectural modifications.
KW - Image captioning
KW - Image-text matching
KW - Reinforcement learning
UR - http://www.scopus.com/inward/record.url?scp=105004379378&partnerID=8YFLogxK
U2 - 10.1016/j.image.2025.117337
DO - 10.1016/j.image.2025.117337
M3 - Article
AN - SCOPUS:105004379378
SN - 0923-5965
VL - 138
JO - Signal Processing: Image Communication
JF - Signal Processing: Image Communication
M1 - 117337
ER -