Towards Accurate Alignment and Sufficient Context in Scene Text Recognition

Yijie Hu; Bin Dong; Qiufeng Wang; Lei Ding; Xiaobo Jin; Kaizhu Huang

doi:10.1007/978-3-031-30111-7_59

Towards Accurate Alignment and Sufficient Context in Scene Text Recognition

Yijie Hu, Bin Dong, Qiufeng Wang^*, Lei Ding, Xiaobo Jin, Kaizhu Huang

^*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

1 Citation (Scopus)

Abstract

Encoder-decoder framework has recently become cutting-edge in scene text recognition (STR), where most decoder networks consist of two parts: an attention model to align visual features from the encoder for each character, and a linear or LSTM-based model to predict label sequence. However, it is difficult for these attention models to obtain accurate alignment, and linear or LSTM model usually captures limited context. To emphasize the role of character feature alignment, we separate the attention alignment module from the decoder network in this work, forming an Encoder-Alignment-Decoder framework. Under this framework, we propose a deformable attention based model to accurately align visual features of each character. In this alignment model, we explicitly learn the spatial coordinate information of each character from the input reading order sequence and optimize it with learnable sampled offsets in the attention block to obtain accurate aligned features. To address the lack of context, we explore transformer-based decoder to capture global context by multi-head attention, where a mask matrix is integrated to keep attention weights focused on the relevant context during the decoding. Extensive experiments demonstrate the effectiveness of the Encoder-Alignment-Decoder framework in STR, achieving better performance than other language free methods with significant improvement on most benchmark STR datasets, and obtain the state-of-the-art performance on several datasets by integrating a language model.

Original language	English
Title of host publication	International Conference on Neural Information Processing (ICONIP), 2023
Editors	Mohammad Tanveer, Sonali Agarwal, Seiichi Ozawa, Asif Ekbal, Adam Jatowt
Pages	705-717
Number of pages	13
DOIs	https://doi.org/10.1007/978-3-031-30111-7_59
Publication status	Published - 2023

Keywords

Character alignment
Deformable attention
Mask attention
Scene text recognition

Access to Document

10.1007/978-3-031-30111-7_59

Cite this

@inproceedings{f198feaacec94ce3bbc395f271f8aee5,

title = "Towards Accurate Alignment and Sufficient Context in Scene Text Recognition",

abstract = "Encoder-decoder framework has recently become cutting-edge in scene text recognition (STR), where most decoder networks consist of two parts: an attention model to align visual features from the encoder for each character, and a linear or LSTM-based model to predict label sequence. However, it is difficult for these attention models to obtain accurate alignment, and linear or LSTM model usually captures limited context. To emphasize the role of character feature alignment, we separate the attention alignment module from the decoder network in this work, forming an Encoder-Alignment-Decoder framework. Under this framework, we propose a deformable attention based model to accurately align visual features of each character. In this alignment model, we explicitly learn the spatial coordinate information of each character from the input reading order sequence and optimize it with learnable sampled offsets in the attention block to obtain accurate aligned features. To address the lack of context, we explore transformer-based decoder to capture global context by multi-head attention, where a mask matrix is integrated to keep attention weights focused on the relevant context during the decoding. Extensive experiments demonstrate the effectiveness of the Encoder-Alignment-Decoder framework in STR, achieving better performance than other language free methods with significant improvement on most benchmark STR datasets, and obtain the state-of-the-art performance on several datasets by integrating a language model.",

keywords = "Character alignment, Deformable attention, Mask attention, Scene text recognition",

author = "Yijie Hu and Bin Dong and Qiufeng Wang and Lei Ding and Xiaobo Jin and Kaizhu Huang",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.",

year = "2023",

doi = "10.1007/978-3-031-30111-7_59",

language = "English",

isbn = "9783031301100",

pages = "705--717",

editor = "Mohammad Tanveer and Sonali Agarwal and Seiichi Ozawa and Asif Ekbal and Adam Jatowt",

booktitle = "International Conference on Neural Information Processing (ICONIP), 2023",

}

Towards Accurate Alignment and Sufficient Context in Scene Text Recognition. / Hu, Yijie; Dong, Bin; Wang, Qiufeng et al.
International Conference on Neural Information Processing (ICONIP), 2023. ed. / Mohammad Tanveer; Sonali Agarwal; Seiichi Ozawa; Asif Ekbal; Adam Jatowt. 2023. p. 705-717.

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Towards Accurate Alignment and Sufficient Context in Scene Text Recognition

AU - Hu, Yijie

AU - Dong, Bin

AU - Wang, Qiufeng

AU - Ding, Lei

AU - Jin, Xiaobo

AU - Huang, Kaizhu

PY - 2023

Y1 - 2023

N2 - Encoder-decoder framework has recently become cutting-edge in scene text recognition (STR), where most decoder networks consist of two parts: an attention model to align visual features from the encoder for each character, and a linear or LSTM-based model to predict label sequence. However, it is difficult for these attention models to obtain accurate alignment, and linear or LSTM model usually captures limited context. To emphasize the role of character feature alignment, we separate the attention alignment module from the decoder network in this work, forming an Encoder-Alignment-Decoder framework. Under this framework, we propose a deformable attention based model to accurately align visual features of each character. In this alignment model, we explicitly learn the spatial coordinate information of each character from the input reading order sequence and optimize it with learnable sampled offsets in the attention block to obtain accurate aligned features. To address the lack of context, we explore transformer-based decoder to capture global context by multi-head attention, where a mask matrix is integrated to keep attention weights focused on the relevant context during the decoding. Extensive experiments demonstrate the effectiveness of the Encoder-Alignment-Decoder framework in STR, achieving better performance than other language free methods with significant improvement on most benchmark STR datasets, and obtain the state-of-the-art performance on several datasets by integrating a language model.

AB - Encoder-decoder framework has recently become cutting-edge in scene text recognition (STR), where most decoder networks consist of two parts: an attention model to align visual features from the encoder for each character, and a linear or LSTM-based model to predict label sequence. However, it is difficult for these attention models to obtain accurate alignment, and linear or LSTM model usually captures limited context. To emphasize the role of character feature alignment, we separate the attention alignment module from the decoder network in this work, forming an Encoder-Alignment-Decoder framework. Under this framework, we propose a deformable attention based model to accurately align visual features of each character. In this alignment model, we explicitly learn the spatial coordinate information of each character from the input reading order sequence and optimize it with learnable sampled offsets in the attention block to obtain accurate aligned features. To address the lack of context, we explore transformer-based decoder to capture global context by multi-head attention, where a mask matrix is integrated to keep attention weights focused on the relevant context during the decoding. Extensive experiments demonstrate the effectiveness of the Encoder-Alignment-Decoder framework in STR, achieving better performance than other language free methods with significant improvement on most benchmark STR datasets, and obtain the state-of-the-art performance on several datasets by integrating a language model.

KW - Character alignment

KW - Deformable attention

KW - Mask attention

KW - Scene text recognition

UR - http://www.scopus.com/inward/record.url?scp=85161713315&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-30111-7_59

DO - 10.1007/978-3-031-30111-7_59

M3 - Conference Proceeding

AN - SCOPUS:85161713315

SN - 9783031301100

SP - 705

EP - 717

BT - International Conference on Neural Information Processing (ICONIP), 2023

A2 - Tanveer, Mohammad

A2 - Agarwal, Sonali

A2 - Ozawa, Seiichi

A2 - Ekbal, Asif

A2 - Jatowt, Adam

ER -

Towards Accurate Alignment and Sufficient Context in Scene Text Recognition

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this