Multi-Keys Attention Network for Image Captioning

Ziqian Yang; Hui Li; Renrong Ouyang; Quan Zhang; Jimin Xiao

doi:10.1007/s12559-023-10231-7

Multi-Keys Attention Network for Image Captioning

Ziqian Yang, Hui Li, Renrong Ouyang^*, Quan Zhang^*, Jimin Xiao

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

The image captioning task aims to generate descriptions from the main content of images. Recently, the Transformer with a self-attention mechanism has been widely used for the image captioning task, where the attention mechanism helps the encoder to generate image region features, and guides caption output in the decoder. However, the vanilla decoder uses a simple conventional self-attention mechanism, resulting in captions with poor semantic information and incomplete sentence logic. In this paper, we propose a novel attention block, Multi-Keys attention block, that fully enhances the relevance between explicit and implicit semantic information. Technically, the Multi-Keys attention block first concatenates the key vector and the value vector and spreads it into both the explicit channel and the implicit channel. Then, the “related value” is generated with more semantic information by applying the element-wise multiplication to them. Moreover, to perfect the sentence logic, the reverse key vector with another information flow is residually connected to the final attention result. We also apply the Multi-Keys attention block into the sentence decoder in the transformer named as Multi-Keys Transformer (MKTrans). The experiments demonstrate that our MKTrans achieves 138.6% CIDEr score on MS COCO “Karpathy” offline test split. The proposed Multi-Keys attention block and MKTrans model are proven to be more effective and superior than the state-of-the-art methods.

Original language	English
Pages (from-to)	1061-1072
Number of pages	12
Journal	Cognitive Computation
Volume	16
Issue number	3
DOIs	https://doi.org/10.1007/s12559-023-10231-7
Publication status	Published - 2024

Keywords

Attention mechanism
Computer vision
Image captioning
Transformer

Access to Document

10.1007/s12559-023-10231-7

https://doi.org/10.1007/s12559-023-10231-7

Cite this

@article{a4b1d919e6bf4203a4d3d1109053e186,

title = "Multi-Keys Attention Network for Image Captioning",

abstract = "The image captioning task aims to generate descriptions from the main content of images. Recently, the Transformer with a self-attention mechanism has been widely used for the image captioning task, where the attention mechanism helps the encoder to generate image region features, and guides caption output in the decoder. However, the vanilla decoder uses a simple conventional self-attention mechanism, resulting in captions with poor semantic information and incomplete sentence logic. In this paper, we propose a novel attention block, Multi-Keys attention block, that fully enhances the relevance between explicit and implicit semantic information. Technically, the Multi-Keys attention block first concatenates the key vector and the value vector and spreads it into both the explicit channel and the implicit channel. Then, the “related value” is generated with more semantic information by applying the element-wise multiplication to them. Moreover, to perfect the sentence logic, the reverse key vector with another information flow is residually connected to the final attention result. We also apply the Multi-Keys attention block into the sentence decoder in the transformer named as Multi-Keys Transformer (MKTrans). The experiments demonstrate that our MKTrans achieves 138.6% CIDEr score on MS COCO “Karpathy” offline test split. The proposed Multi-Keys attention block and MKTrans model are proven to be more effective and superior than the state-of-the-art methods.",

keywords = "Attention mechanism, Computer vision, Image captioning, Transformer",

author = "Ziqian Yang and Hui Li and Renrong Ouyang and Quan Zhang and Jimin Xiao",

note = "Publisher Copyright: {\textcopyright} 2024, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.",

year = "2024",

doi = "10.1007/s12559-023-10231-7",

language = "English",

volume = "16",

pages = "1061--1072",

journal = "Cognitive Computation",

issn = "1866-9956",

number = "3",

}

TY - JOUR

T1 - Multi-Keys Attention Network for Image Captioning

AU - Yang, Ziqian

AU - Li, Hui

AU - Ouyang, Renrong

AU - Zhang, Quan

AU - Xiao, Jimin

PY - 2024

Y1 - 2024

N2 - The image captioning task aims to generate descriptions from the main content of images. Recently, the Transformer with a self-attention mechanism has been widely used for the image captioning task, where the attention mechanism helps the encoder to generate image region features, and guides caption output in the decoder. However, the vanilla decoder uses a simple conventional self-attention mechanism, resulting in captions with poor semantic information and incomplete sentence logic. In this paper, we propose a novel attention block, Multi-Keys attention block, that fully enhances the relevance between explicit and implicit semantic information. Technically, the Multi-Keys attention block first concatenates the key vector and the value vector and spreads it into both the explicit channel and the implicit channel. Then, the “related value” is generated with more semantic information by applying the element-wise multiplication to them. Moreover, to perfect the sentence logic, the reverse key vector with another information flow is residually connected to the final attention result. We also apply the Multi-Keys attention block into the sentence decoder in the transformer named as Multi-Keys Transformer (MKTrans). The experiments demonstrate that our MKTrans achieves 138.6% CIDEr score on MS COCO “Karpathy” offline test split. The proposed Multi-Keys attention block and MKTrans model are proven to be more effective and superior than the state-of-the-art methods.

AB - The image captioning task aims to generate descriptions from the main content of images. Recently, the Transformer with a self-attention mechanism has been widely used for the image captioning task, where the attention mechanism helps the encoder to generate image region features, and guides caption output in the decoder. However, the vanilla decoder uses a simple conventional self-attention mechanism, resulting in captions with poor semantic information and incomplete sentence logic. In this paper, we propose a novel attention block, Multi-Keys attention block, that fully enhances the relevance between explicit and implicit semantic information. Technically, the Multi-Keys attention block first concatenates the key vector and the value vector and spreads it into both the explicit channel and the implicit channel. Then, the “related value” is generated with more semantic information by applying the element-wise multiplication to them. Moreover, to perfect the sentence logic, the reverse key vector with another information flow is residually connected to the final attention result. We also apply the Multi-Keys attention block into the sentence decoder in the transformer named as Multi-Keys Transformer (MKTrans). The experiments demonstrate that our MKTrans achieves 138.6% CIDEr score on MS COCO “Karpathy” offline test split. The proposed Multi-Keys attention block and MKTrans model are proven to be more effective and superior than the state-of-the-art methods.

KW - Attention mechanism

KW - Computer vision

KW - Image captioning

KW - Transformer

UR - http://www.scopus.com/inward/record.url?scp=85183023665&partnerID=8YFLogxK

U2 - 10.1007/s12559-023-10231-7

DO - 10.1007/s12559-023-10231-7

M3 - Article

AN - SCOPUS:85183023665

SN - 1866-9956

VL - 16

SP - 1061

EP - 1072

JO - Cognitive Computation

JF - Cognitive Computation

IS - 3

ER -

Multi-Keys Attention Network for Image Captioning

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this