Towards Accurate Alignment and Sufficient Context in Scene Text Recognition

Yijie Hu; Bin Dong; Qiufeng Wang; Lei Ding; Xiaobo Jin; Kaizhu Huang

doi:10.1007/978-3-031-30111-7_59

Towards Accurate Alignment and Sufficient Context in Scene Text Recognition

Yijie Hu, Bin Dong, Qiufeng Wang^*, Lei Ding, Xiaobo Jin, Kaizhu Huang

^*Corresponding author for this work

Department of Intelligent Science

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

1 Citation (Scopus)

Abstract

Encoder-decoder framework has recently become cutting-edge in scene text recognition (STR), where most decoder networks consist of two parts: an attention model to align visual features from the encoder for each character, and a linear or LSTM-based model to predict label sequence. However, it is difficult for these attention models to obtain accurate alignment, and linear or LSTM model usually captures limited context. To emphasize the role of character feature alignment, we separate the attention alignment module from the decoder network in this work, forming an Encoder-Alignment-Decoder framework. Under this framework, we propose a deformable attention based model to accurately align visual features of each character. In this alignment model, we explicitly learn the spatial coordinate information of each character from the input reading order sequence and optimize it with learnable sampled offsets in the attention block to obtain accurate aligned features. To address the lack of context, we explore transformer-based decoder to capture global context by multi-head attention, where a mask matrix is integrated to keep attention weights focused on the relevant context during the decoding. Extensive experiments demonstrate the effectiveness of the Encoder-Alignment-Decoder framework in STR, achieving better performance than other language free methods with significant improvement on most benchmark STR datasets, and obtain the state-of-the-art performance on several datasets by integrating a language model.

Original language	English
Title of host publication	Neural Information Processing - 29th International Conference, ICONIP 2022, Proceedings
Editors	Mohammad Tanveer, Sonali Agarwal, Seiichi Ozawa, Asif Ekbal, Adam Jatowt
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	705-717
Number of pages	13
ISBN (Print)	9783031301100
DOIs	https://doi.org/10.1007/978-3-031-30111-7_59
Publication status	Published - 2023
Event	29th International Conference on Neural Information Processing, ICONIP 2022 - Virtual, Online Duration: 22 Nov 2022 → 26 Nov 2022

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	13625 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	29th International Conference on Neural Information Processing, ICONIP 2022
City	Virtual, Online
Period	22/11/22 → 26/11/22

Keywords

Character alignment
Deformable attention
Mask attention
Scene text recognition

Access to Document

10.1007/978-3-031-30111-7_59

Cite this

Hu, Y., Dong, B., Wang, Q., Ding, L., Jin, X., & Huang, K. (2023). Towards Accurate Alignment and Sufficient Context in Scene Text Recognition. In M. Tanveer, S. Agarwal, S. Ozawa, A. Ekbal, & A. Jatowt (Eds.), Neural Information Processing - 29th International Conference, ICONIP 2022, Proceedings (pp. 705-717). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13625 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-30111-7_59

Hu, Yijie ; Dong, Bin ; Wang, Qiufeng et al. / Towards Accurate Alignment and Sufficient Context in Scene Text Recognition. Neural Information Processing - 29th International Conference, ICONIP 2022, Proceedings. editor / Mohammad Tanveer ; Sonali Agarwal ; Seiichi Ozawa ; Asif Ekbal ; Adam Jatowt. Springer Science and Business Media Deutschland GmbH, 2023. pp. 705-717 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{f198feaacec94ce3bbc395f271f8aee5,

title = "Towards Accurate Alignment and Sufficient Context in Scene Text Recognition",

abstract = "Encoder-decoder framework has recently become cutting-edge in scene text recognition (STR), where most decoder networks consist of two parts: an attention model to align visual features from the encoder for each character, and a linear or LSTM-based model to predict label sequence. However, it is difficult for these attention models to obtain accurate alignment, and linear or LSTM model usually captures limited context. To emphasize the role of character feature alignment, we separate the attention alignment module from the decoder network in this work, forming an Encoder-Alignment-Decoder framework. Under this framework, we propose a deformable attention based model to accurately align visual features of each character. In this alignment model, we explicitly learn the spatial coordinate information of each character from the input reading order sequence and optimize it with learnable sampled offsets in the attention block to obtain accurate aligned features. To address the lack of context, we explore transformer-based decoder to capture global context by multi-head attention, where a mask matrix is integrated to keep attention weights focused on the relevant context during the decoding. Extensive experiments demonstrate the effectiveness of the Encoder-Alignment-Decoder framework in STR, achieving better performance than other language free methods with significant improvement on most benchmark STR datasets, and obtain the state-of-the-art performance on several datasets by integrating a language model.",

keywords = "Character alignment, Deformable attention, Mask attention, Scene text recognition",

author = "Yijie Hu and Bin Dong and Qiufeng Wang and Lei Ding and Xiaobo Jin and Kaizhu Huang",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.; 29th International Conference on Neural Information Processing, ICONIP 2022 ; Conference date: 22-11-2022 Through 26-11-2022",

year = "2023",

doi = "10.1007/978-3-031-30111-7_59",

language = "English",

isbn = "9783031301100",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "705--717",

editor = "Mohammad Tanveer and Sonali Agarwal and Seiichi Ozawa and Asif Ekbal and Adam Jatowt",

booktitle = "Neural Information Processing - 29th International Conference, ICONIP 2022, Proceedings",

}

Hu, Y, Dong, B, Wang, Q, Ding, L, Jin, X & Huang, K 2023, Towards Accurate Alignment and Sufficient Context in Scene Text Recognition. in M Tanveer, S Agarwal, S Ozawa, A Ekbal & A Jatowt (eds), Neural Information Processing - 29th International Conference, ICONIP 2022, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13625 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 705-717, 29th International Conference on Neural Information Processing, ICONIP 2022, Virtual, Online, 22/11/22. https://doi.org/10.1007/978-3-031-30111-7_59

Towards Accurate Alignment and Sufficient Context in Scene Text Recognition. / Hu, Yijie; Dong, Bin; Wang, Qiufeng et al.
Neural Information Processing - 29th International Conference, ICONIP 2022, Proceedings. ed. / Mohammad Tanveer; Sonali Agarwal; Seiichi Ozawa; Asif Ekbal; Adam Jatowt. Springer Science and Business Media Deutschland GmbH, 2023. p. 705-717 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13625 LNCS).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Towards Accurate Alignment and Sufficient Context in Scene Text Recognition

AU - Hu, Yijie

AU - Dong, Bin

AU - Wang, Qiufeng

AU - Ding, Lei

AU - Jin, Xiaobo

AU - Huang, Kaizhu

PY - 2023

Y1 - 2023

N2 - Encoder-decoder framework has recently become cutting-edge in scene text recognition (STR), where most decoder networks consist of two parts: an attention model to align visual features from the encoder for each character, and a linear or LSTM-based model to predict label sequence. However, it is difficult for these attention models to obtain accurate alignment, and linear or LSTM model usually captures limited context. To emphasize the role of character feature alignment, we separate the attention alignment module from the decoder network in this work, forming an Encoder-Alignment-Decoder framework. Under this framework, we propose a deformable attention based model to accurately align visual features of each character. In this alignment model, we explicitly learn the spatial coordinate information of each character from the input reading order sequence and optimize it with learnable sampled offsets in the attention block to obtain accurate aligned features. To address the lack of context, we explore transformer-based decoder to capture global context by multi-head attention, where a mask matrix is integrated to keep attention weights focused on the relevant context during the decoding. Extensive experiments demonstrate the effectiveness of the Encoder-Alignment-Decoder framework in STR, achieving better performance than other language free methods with significant improvement on most benchmark STR datasets, and obtain the state-of-the-art performance on several datasets by integrating a language model.

AB - Encoder-decoder framework has recently become cutting-edge in scene text recognition (STR), where most decoder networks consist of two parts: an attention model to align visual features from the encoder for each character, and a linear or LSTM-based model to predict label sequence. However, it is difficult for these attention models to obtain accurate alignment, and linear or LSTM model usually captures limited context. To emphasize the role of character feature alignment, we separate the attention alignment module from the decoder network in this work, forming an Encoder-Alignment-Decoder framework. Under this framework, we propose a deformable attention based model to accurately align visual features of each character. In this alignment model, we explicitly learn the spatial coordinate information of each character from the input reading order sequence and optimize it with learnable sampled offsets in the attention block to obtain accurate aligned features. To address the lack of context, we explore transformer-based decoder to capture global context by multi-head attention, where a mask matrix is integrated to keep attention weights focused on the relevant context during the decoding. Extensive experiments demonstrate the effectiveness of the Encoder-Alignment-Decoder framework in STR, achieving better performance than other language free methods with significant improvement on most benchmark STR datasets, and obtain the state-of-the-art performance on several datasets by integrating a language model.

KW - Character alignment

KW - Deformable attention

KW - Mask attention

KW - Scene text recognition

UR - http://www.scopus.com/inward/record.url?scp=85161713315&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-30111-7_59

DO - 10.1007/978-3-031-30111-7_59

M3 - Conference Proceeding

AN - SCOPUS:85161713315

SN - 9783031301100

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 705

EP - 717

BT - Neural Information Processing - 29th International Conference, ICONIP 2022, Proceedings

A2 - Tanveer, Mohammad

A2 - Agarwal, Sonali

A2 - Ozawa, Seiichi

A2 - Ekbal, Asif

A2 - Jatowt, Adam

PB - Springer Science and Business Media Deutschland GmbH

T2 - 29th International Conference on Neural Information Processing, ICONIP 2022

Y2 - 22 November 2022 through 26 November 2022

ER -

Hu Y, Dong B, Wang Q, Ding L, Jin X, Huang K. Towards Accurate Alignment and Sufficient Context in Scene Text Recognition. In Tanveer M, Agarwal S, Ozawa S, Ekbal A, Jatowt A, editors, Neural Information Processing - 29th International Conference, ICONIP 2022, Proceedings. Springer Science and Business Media Deutschland GmbH. 2023. p. 705-717. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-031-30111-7_59