TY - JOUR
T1 - Image captioning via hierarchical attention mechanism and policy gradient optimization
AU - Yan, Shiyang
AU - Xie, Yuan
AU - Wu, Fangyu
AU - Smith, Jeremy S.
AU - Lu, Wenjin
AU - Zhang, Bailing
N1 - Publisher Copyright:
© 2019
PY - 2020/2
Y1 - 2020/2
N2 - Automatically generating the descriptions of an image, i.e., image captioning, is an important and fundamental topic in artificial intelligence, which bridges the gap between computer vision and natural language processing. Based on the successful deep learning models, especially the CNN model and Long Short Term Memories (LSTMs) with attention mechanism, we propose a hierarchical attention model by utilizing both of the global CNN features and the local object features for more effective feature representation and reasoning in image captioning. The generative adversarial network (GAN), together with a reinforcement learning (RL) algorithm, is applied to solve the exposure bias problem in RNN-based supervised training for language problems. In addition, through the automatic measurement of the consistency between the generated caption and the image content by the discriminator in the GAN framework and RL optimization, we make the finally generated sentences more accurate and natural. Comprehensive experiments show the improved performance of the hierarchical attention mechanism and the effectiveness of our RL-based optimization method. Our model achieves state-of-the-art results on several important metrics in the MSCOCO dataset, using only greedy inference.
AB - Automatically generating the descriptions of an image, i.e., image captioning, is an important and fundamental topic in artificial intelligence, which bridges the gap between computer vision and natural language processing. Based on the successful deep learning models, especially the CNN model and Long Short Term Memories (LSTMs) with attention mechanism, we propose a hierarchical attention model by utilizing both of the global CNN features and the local object features for more effective feature representation and reasoning in image captioning. The generative adversarial network (GAN), together with a reinforcement learning (RL) algorithm, is applied to solve the exposure bias problem in RNN-based supervised training for language problems. In addition, through the automatic measurement of the consistency between the generated caption and the image content by the discriminator in the GAN framework and RL optimization, we make the finally generated sentences more accurate and natural. Comprehensive experiments show the improved performance of the hierarchical attention mechanism and the effectiveness of our RL-based optimization method. Our model achieves state-of-the-art results on several important metrics in the MSCOCO dataset, using only greedy inference.
KW - Generative adversarial network
KW - Hierarchical attention mechanism
KW - Image captioning
KW - Policy gradient
KW - Reinforcement learning
UR - http://www.scopus.com/inward/record.url?scp=85073026284&partnerID=8YFLogxK
U2 - 10.1016/j.sigpro.2019.107329
DO - 10.1016/j.sigpro.2019.107329
M3 - Article
AN - SCOPUS:85073026284
SN - 0165-1684
VL - 167
JO - Signal Processing
JF - Signal Processing
M1 - 107329
ER -