BERT-hLSTMs: BERT and hierarchical LSTMs for visual storytelling

Jing Su; Qingyun Dai; Frank Guerin; Mian Zhou

doi:10.1016/j.csl.2020.101169

BERT-hLSTMs: BERT and hierarchical LSTMs for visual storytelling

Jing Su, Qingyun Dai, Frank Guerin, Mian Zhou^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

31 Citations (Scopus)

Abstract

Visual storytelling is a creative and challenging task, aiming to automatically generate a story-like description for a sequence of images. The descriptions generated by previous visual storytelling approaches lack coherence because they use word-level sequence generation methods and do not adequately consider sentence-level dependencies. To tackle this problem, we propose a novel hierarchical visual storytelling framework which separately models sentence-level and word-level semantics. We use the transformer-based BERT to obtain embeddings for sentences and words. We then employ a hierarchical LSTM network: the bottom LSTM receives as input the sentence vector representation from BERT, to learn the dependencies between the sentences corresponding to images, and the top LSTM is responsible for generating the corresponding word vector representations, taking input from the bottom LSTM. Experimental results demonstrate that our model outperforms most closely related baselines under automatic evaluation metrics BLEU and CIDEr, and also show the effectiveness of our method with human evaluation.

Original language	English
Article number	101169
Journal	Computer Speech and Language
Volume	67
DOIs	https://doi.org/10.1016/j.csl.2020.101169
Publication status	Published - May 2021
Externally published	Yes

Keywords

BERT
Hierarchical LSTMs
Sentence vector
Visual storytelling

Access to Document

10.1016/j.csl.2020.101169

Cite this

@article{ecb3abf1c4e84fd8999f98f6fe247827,

title = "BERT-hLSTMs: BERT and hierarchical LSTMs for visual storytelling",

abstract = "Visual storytelling is a creative and challenging task, aiming to automatically generate a story-like description for a sequence of images. The descriptions generated by previous visual storytelling approaches lack coherence because they use word-level sequence generation methods and do not adequately consider sentence-level dependencies. To tackle this problem, we propose a novel hierarchical visual storytelling framework which separately models sentence-level and word-level semantics. We use the transformer-based BERT to obtain embeddings for sentences and words. We then employ a hierarchical LSTM network: the bottom LSTM receives as input the sentence vector representation from BERT, to learn the dependencies between the sentences corresponding to images, and the top LSTM is responsible for generating the corresponding word vector representations, taking input from the bottom LSTM. Experimental results demonstrate that our model outperforms most closely related baselines under automatic evaluation metrics BLEU and CIDEr, and also show the effectiveness of our method with human evaluation.",

keywords = "BERT, Hierarchical LSTMs, Sentence vector, Visual storytelling",

author = "Jing Su and Qingyun Dai and Frank Guerin and Mian Zhou",

note = "Publisher Copyright: {\textcopyright} 2020 Elsevier Ltd",

year = "2021",

month = may,

doi = "10.1016/j.csl.2020.101169",

language = "English",

volume = "67",

journal = "Computer Speech and Language",

issn = "0885-2308",

}

TY - JOUR

T1 - BERT-hLSTMs

T2 - BERT and hierarchical LSTMs for visual storytelling

AU - Su, Jing

AU - Dai, Qingyun

AU - Guerin, Frank

AU - Zhou, Mian

PY - 2021/5

Y1 - 2021/5

N2 - Visual storytelling is a creative and challenging task, aiming to automatically generate a story-like description for a sequence of images. The descriptions generated by previous visual storytelling approaches lack coherence because they use word-level sequence generation methods and do not adequately consider sentence-level dependencies. To tackle this problem, we propose a novel hierarchical visual storytelling framework which separately models sentence-level and word-level semantics. We use the transformer-based BERT to obtain embeddings for sentences and words. We then employ a hierarchical LSTM network: the bottom LSTM receives as input the sentence vector representation from BERT, to learn the dependencies between the sentences corresponding to images, and the top LSTM is responsible for generating the corresponding word vector representations, taking input from the bottom LSTM. Experimental results demonstrate that our model outperforms most closely related baselines under automatic evaluation metrics BLEU and CIDEr, and also show the effectiveness of our method with human evaluation.

AB - Visual storytelling is a creative and challenging task, aiming to automatically generate a story-like description for a sequence of images. The descriptions generated by previous visual storytelling approaches lack coherence because they use word-level sequence generation methods and do not adequately consider sentence-level dependencies. To tackle this problem, we propose a novel hierarchical visual storytelling framework which separately models sentence-level and word-level semantics. We use the transformer-based BERT to obtain embeddings for sentences and words. We then employ a hierarchical LSTM network: the bottom LSTM receives as input the sentence vector representation from BERT, to learn the dependencies between the sentences corresponding to images, and the top LSTM is responsible for generating the corresponding word vector representations, taking input from the bottom LSTM. Experimental results demonstrate that our model outperforms most closely related baselines under automatic evaluation metrics BLEU and CIDEr, and also show the effectiveness of our method with human evaluation.

KW - BERT

KW - Hierarchical LSTMs

KW - Sentence vector

KW - Visual storytelling

UR - http://www.scopus.com/inward/record.url?scp=85097174541&partnerID=8YFLogxK

U2 - 10.1016/j.csl.2020.101169

DO - 10.1016/j.csl.2020.101169

M3 - Article

AN - SCOPUS:85097174541

SN - 0885-2308

VL - 67

JO - Computer Speech and Language

JF - Computer Speech and Language

M1 - 101169

ER -

BERT-hLSTMs: BERT and hierarchical LSTMs for visual storytelling

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this