Towards Accurate Alignment and Sufficient Context in Scene Text Recognition

Yijie Hu, Bin Dong, Qiufeng Wang*, Lei Ding, Xiaobo Jin, Kaizhu Huang

*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceedingConference Proceedingpeer-review

1 Citation (Scopus)


Encoder-decoder framework has recently become cutting-edge in scene text recognition (STR), where most decoder networks consist of two parts: an attention model to align visual features from the encoder for each character, and a linear or LSTM-based model to predict label sequence. However, it is difficult for these attention models to obtain accurate alignment, and linear or LSTM model usually captures limited context. To emphasize the role of character feature alignment, we separate the attention alignment module from the decoder network in this work, forming an Encoder-Alignment-Decoder framework. Under this framework, we propose a deformable attention based model to accurately align visual features of each character. In this alignment model, we explicitly learn the spatial coordinate information of each character from the input reading order sequence and optimize it with learnable sampled offsets in the attention block to obtain accurate aligned features. To address the lack of context, we explore transformer-based decoder to capture global context by multi-head attention, where a mask matrix is integrated to keep attention weights focused on the relevant context during the decoding. Extensive experiments demonstrate the effectiveness of the Encoder-Alignment-Decoder framework in STR, achieving better performance than other language free methods with significant improvement on most benchmark STR datasets, and obtain the state-of-the-art performance on several datasets by integrating a language model.

Original languageEnglish
Title of host publicationNeural Information Processing - 29th International Conference, ICONIP 2022, Proceedings
EditorsMohammad Tanveer, Sonali Agarwal, Seiichi Ozawa, Asif Ekbal, Adam Jatowt
PublisherSpringer Science and Business Media Deutschland GmbH
Number of pages13
ISBN (Print)9783031301100
Publication statusPublished - 2023
Event29th International Conference on Neural Information Processing, ICONIP 2022 - Virtual, Online
Duration: 22 Nov 202226 Nov 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13625 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference29th International Conference on Neural Information Processing, ICONIP 2022
CityVirtual, Online


  • Character alignment
  • Deformable attention
  • Mask attention
  • Scene text recognition


Dive into the research topics of 'Towards Accurate Alignment and Sufficient Context in Scene Text Recognition'. Together they form a unique fingerprint.

Cite this