Towards Accurate Alignment and Sufficient Context in Scene Text Recognition

Yijie Hu, Bin Dong, Qiufeng Wang*, Lei Ding, Xiaobo Jin, Kaizhu Huang

*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceedingConference Proceedingpeer-review

1 Citation (Scopus)

Abstract

Encoder-decoder framework has recently become cutting-edge in scene text recognition (STR), where most decoder networks consist of two parts: an attention model to align visual features from the encoder for each character, and a linear or LSTM-based model to predict label sequence. However, it is difficult for these attention models to obtain accurate alignment, and linear or LSTM model usually captures limited context. To emphasize the role of character feature alignment, we separate the attention alignment module from the decoder network in this work, forming an Encoder-Alignment-Decoder framework. Under this framework, we propose a deformable attention based model to accurately align visual features of each character. In this alignment model, we explicitly learn the spatial coordinate information of each character from the input reading order sequence and optimize it with learnable sampled offsets in the attention block to obtain accurate aligned features. To address the lack of context, we explore transformer-based decoder to capture global context by multi-head attention, where a mask matrix is integrated to keep attention weights focused on the relevant context during the decoding. Extensive experiments demonstrate the effectiveness of the Encoder-Alignment-Decoder framework in STR, achieving better performance than other language free methods with significant improvement on most benchmark STR datasets, and obtain the state-of-the-art performance on several datasets by integrating a language model.

Original languageEnglish
Title of host publicationInternational Conference on Neural Information Processing (ICONIP), 2023
EditorsMohammad Tanveer, Sonali Agarwal, Seiichi Ozawa, Asif Ekbal, Adam Jatowt
Pages705-717
Number of pages13
DOIs
Publication statusPublished - 2023

Keywords

  • Character alignment
  • Deformable attention
  • Mask attention
  • Scene text recognition

Fingerprint

Dive into the research topics of 'Towards Accurate Alignment and Sufficient Context in Scene Text Recognition'. Together they form a unique fingerprint.

Cite this