TY - JOUR
T1 - Multi-Keys Attention Network for Image Captioning
AU - Yang, Ziqian
AU - Li, Hui
AU - Ouyang, Renrong
AU - Zhang, Quan
AU - Xiao, Jimin
N1 - Publisher Copyright:
© 2024, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2024
Y1 - 2024
N2 - The image captioning task aims to generate descriptions from the main content of images. Recently, the Transformer with a self-attention mechanism has been widely used for the image captioning task, where the attention mechanism helps the encoder to generate image region features, and guides caption output in the decoder. However, the vanilla decoder uses a simple conventional self-attention mechanism, resulting in captions with poor semantic information and incomplete sentence logic. In this paper, we propose a novel attention block, Multi-Keys attention block, that fully enhances the relevance between explicit and implicit semantic information. Technically, the Multi-Keys attention block first concatenates the key vector and the value vector and spreads it into both the explicit channel and the implicit channel. Then, the “related value” is generated with more semantic information by applying the element-wise multiplication to them. Moreover, to perfect the sentence logic, the reverse key vector with another information flow is residually connected to the final attention result. We also apply the Multi-Keys attention block into the sentence decoder in the transformer named as Multi-Keys Transformer (MKTrans). The experiments demonstrate that our MKTrans achieves 138.6% CIDEr score on MS COCO “Karpathy” offline test split. The proposed Multi-Keys attention block and MKTrans model are proven to be more effective and superior than the state-of-the-art methods.
AB - The image captioning task aims to generate descriptions from the main content of images. Recently, the Transformer with a self-attention mechanism has been widely used for the image captioning task, where the attention mechanism helps the encoder to generate image region features, and guides caption output in the decoder. However, the vanilla decoder uses a simple conventional self-attention mechanism, resulting in captions with poor semantic information and incomplete sentence logic. In this paper, we propose a novel attention block, Multi-Keys attention block, that fully enhances the relevance between explicit and implicit semantic information. Technically, the Multi-Keys attention block first concatenates the key vector and the value vector and spreads it into both the explicit channel and the implicit channel. Then, the “related value” is generated with more semantic information by applying the element-wise multiplication to them. Moreover, to perfect the sentence logic, the reverse key vector with another information flow is residually connected to the final attention result. We also apply the Multi-Keys attention block into the sentence decoder in the transformer named as Multi-Keys Transformer (MKTrans). The experiments demonstrate that our MKTrans achieves 138.6% CIDEr score on MS COCO “Karpathy” offline test split. The proposed Multi-Keys attention block and MKTrans model are proven to be more effective and superior than the state-of-the-art methods.
KW - Attention mechanism
KW - Computer vision
KW - Image captioning
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85183023665&partnerID=8YFLogxK
U2 - 10.1007/s12559-023-10231-7
DO - 10.1007/s12559-023-10231-7
M3 - Article
AN - SCOPUS:85183023665
SN - 1866-9956
VL - 16
SP - 1061
EP - 1072
JO - Cognitive Computation
JF - Cognitive Computation
IS - 3
ER -