TY - GEN
T1 - Towards Accurate Alignment and Sufficient Context in Scene Text Recognition
AU - Hu, Yijie
AU - Dong, Bin
AU - Wang, Qiufeng
AU - Ding, Lei
AU - Jin, Xiaobo
AU - Huang, Kaizhu
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023
Y1 - 2023
N2 - Encoder-decoder framework has recently become cutting-edge in scene text recognition (STR), where most decoder networks consist of two parts: an attention model to align visual features from the encoder for each character, and a linear or LSTM-based model to predict label sequence. However, it is difficult for these attention models to obtain accurate alignment, and linear or LSTM model usually captures limited context. To emphasize the role of character feature alignment, we separate the attention alignment module from the decoder network in this work, forming an Encoder-Alignment-Decoder framework. Under this framework, we propose a deformable attention based model to accurately align visual features of each character. In this alignment model, we explicitly learn the spatial coordinate information of each character from the input reading order sequence and optimize it with learnable sampled offsets in the attention block to obtain accurate aligned features. To address the lack of context, we explore transformer-based decoder to capture global context by multi-head attention, where a mask matrix is integrated to keep attention weights focused on the relevant context during the decoding. Extensive experiments demonstrate the effectiveness of the Encoder-Alignment-Decoder framework in STR, achieving better performance than other language free methods with significant improvement on most benchmark STR datasets, and obtain the state-of-the-art performance on several datasets by integrating a language model.
AB - Encoder-decoder framework has recently become cutting-edge in scene text recognition (STR), where most decoder networks consist of two parts: an attention model to align visual features from the encoder for each character, and a linear or LSTM-based model to predict label sequence. However, it is difficult for these attention models to obtain accurate alignment, and linear or LSTM model usually captures limited context. To emphasize the role of character feature alignment, we separate the attention alignment module from the decoder network in this work, forming an Encoder-Alignment-Decoder framework. Under this framework, we propose a deformable attention based model to accurately align visual features of each character. In this alignment model, we explicitly learn the spatial coordinate information of each character from the input reading order sequence and optimize it with learnable sampled offsets in the attention block to obtain accurate aligned features. To address the lack of context, we explore transformer-based decoder to capture global context by multi-head attention, where a mask matrix is integrated to keep attention weights focused on the relevant context during the decoding. Extensive experiments demonstrate the effectiveness of the Encoder-Alignment-Decoder framework in STR, achieving better performance than other language free methods with significant improvement on most benchmark STR datasets, and obtain the state-of-the-art performance on several datasets by integrating a language model.
KW - Character alignment
KW - Deformable attention
KW - Mask attention
KW - Scene text recognition
UR - http://www.scopus.com/inward/record.url?scp=85161713315&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-30111-7_59
DO - 10.1007/978-3-031-30111-7_59
M3 - Conference Proceeding
AN - SCOPUS:85161713315
SN - 9783031301100
SP - 705
EP - 717
BT - International Conference on Neural Information Processing (ICONIP), 2023
A2 - Tanveer, Mohammad
A2 - Agarwal, Sonali
A2 - Ozawa, Seiichi
A2 - Ekbal, Asif
A2 - Jatowt, Adam
ER -