Multi-Keys Attention Network for Image Captioning

Ziqian Yang, Hui Li, Renrong Ouyang*, Quan Zhang*, Jimin Xiao

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

The image captioning task aims to generate descriptions from the main content of images. Recently, the Transformer with a self-attention mechanism has been widely used for the image captioning task, where the attention mechanism helps the encoder to generate image region features, and guides caption output in the decoder. However, the vanilla decoder uses a simple conventional self-attention mechanism, resulting in captions with poor semantic information and incomplete sentence logic. In this paper, we propose a novel attention block, Multi-Keys attention block, that fully enhances the relevance between explicit and implicit semantic information. Technically, the Multi-Keys attention block first concatenates the key vector and the value vector and spreads it into both the explicit channel and the implicit channel. Then, the “related value” is generated with more semantic information by applying the element-wise multiplication to them. Moreover, to perfect the sentence logic, the reverse key vector with another information flow is residually connected to the final attention result. We also apply the Multi-Keys attention block into the sentence decoder in the transformer named as Multi-Keys Transformer (MKTrans). The experiments demonstrate that our MKTrans achieves 138.6% CIDEr score on MS COCO “Karpathy” offline test split. The proposed Multi-Keys attention block and MKTrans model are proven to be more effective and superior than the state-of-the-art methods.

Original languageEnglish
Pages (from-to)1061-1072
Number of pages12
JournalCognitive Computation
Volume16
Issue number3
DOIs
Publication statusPublished - 2024

Keywords

  • Attention mechanism
  • Computer vision
  • Image captioning
  • Transformer

Fingerprint

Dive into the research topics of 'Multi-Keys Attention Network for Image Captioning'. Together they form a unique fingerprint.

Cite this