Abstract
Encoder-decoder framework has recently become cutting-edge in scene text recognition (STR), where most decoder networks consist of two parts: an attention model to align visual features from the encoder for each character, and a linear or LSTM-based model to predict label sequence. However, it is difficult for these attention models to obtain accurate alignment, and linear or LSTM model usually captures limited context. To emphasize the role of character feature alignment, we separate the attention alignment module from the decoder network in this work, forming an Encoder-Alignment-Decoder framework. Under this framework, we propose a deformable attention based model to accurately align visual features of each character. In this alignment model, we explicitly learn the spatial coordinate information of each character from the input reading order sequence and optimize it with learnable sampled offsets in the attention block to obtain accurate aligned features. To address the lack of context, we explore transformer-based decoder to capture global context by multi-head attention, where a mask matrix is integrated to keep attention weights focused on the relevant context during the decoding. Extensive experiments demonstrate the effectiveness of the Encoder-Alignment-Decoder framework in STR, achieving better performance than other language free methods with significant improvement on most benchmark STR datasets, and obtain the state-of-the-art performance on several datasets by integrating a language model.
| Original language | English |
|---|---|
| Title of host publication | International Conference on Neural Information Processing (ICONIP), 2023 |
| Editors | Mohammad Tanveer, Sonali Agarwal, Seiichi Ozawa, Asif Ekbal, Adam Jatowt |
| Pages | 705-717 |
| Number of pages | 13 |
| DOIs | |
| Publication status | Published - 2023 |
Keywords
- Character alignment
- Deformable attention
- Mask attention
- Scene text recognition
Fingerprint
Dive into the research topics of 'Towards Accurate Alignment and Sufficient Context in Scene Text Recognition'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver