TY - JOUR
T1 - Scene Text Recognition via Dual-path Network with Shape-driven Attention Alignment
AU - Hu, Yijie
AU - Dong, Bin
AU - Huang, Kaizhu
AU - Ding, Lei
AU - Wang, Wei
AU - Huang, Xiaowei
AU - Wang, Qiu Feng
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2024/4/30
Y1 - 2024/4/30
N2 - Scene text recognition (STR), one typical sequence-to-sequence problem, has drawn much attention recently in multimedia applications. To guarantee good performance, it is essential for STR to obtain aligned character-wise features from the whole-image feature maps. While most present works adopt fully data-driven attention-based alignment, such practice ignores specific character geometric information. In this article, built upon a group of learnable geometric points, we propose a novel shape-driven attention alignment method that is able to obtain character-wise features. Concretely, we first design a corner detector to generate a shape map to guide the attention alignments explicitly, where a series of points can be learned to represent character-wise features flexibly. We then propose a dual-path network with a mutual learning and cooperating strategy that successfully combines CNN with a ViT-based model, leading to further accuracy improvement. We conduct extensive experiments to evaluate the proposed method on various scene text benchmarks, including six popular regular and irregular datasets, two more challenging datasets (i.e., WordArt and OST), and three Chinese datasets. Experimental results indicate that our method can achieve superior performance with a comparable model size against many state-of-the-art models.
AB - Scene text recognition (STR), one typical sequence-to-sequence problem, has drawn much attention recently in multimedia applications. To guarantee good performance, it is essential for STR to obtain aligned character-wise features from the whole-image feature maps. While most present works adopt fully data-driven attention-based alignment, such practice ignores specific character geometric information. In this article, built upon a group of learnable geometric points, we propose a novel shape-driven attention alignment method that is able to obtain character-wise features. Concretely, we first design a corner detector to generate a shape map to guide the attention alignments explicitly, where a series of points can be learned to represent character-wise features flexibly. We then propose a dual-path network with a mutual learning and cooperating strategy that successfully combines CNN with a ViT-based model, leading to further accuracy improvement. We conduct extensive experiments to evaluate the proposed method on various scene text benchmarks, including six popular regular and irregular datasets, two more challenging datasets (i.e., WordArt and OST), and three Chinese datasets. Experimental results indicate that our method can achieve superior performance with a comparable model size against many state-of-the-art models.
KW - Additional Key Words and PhrasesOCR
KW - attention alignment
KW - deformable attention
KW - dual path network
KW - scene text recognition
UR - http://www.scopus.com/inward/record.url?scp=85182598099&partnerID=8YFLogxK
U2 - 10.1145/3633517
DO - 10.1145/3633517
M3 - Article
SN - 1551-6857
VL - 20
JO - ACM Transactions on Multimedia Computing, Communications and Applications
JF - ACM Transactions on Multimedia Computing, Communications and Applications
IS - 4
M1 - 107
ER -