Scene Text Recognition via Dual-path Network with Shape-driven Attention Alignment

Yijie Hu; Bin Dong; Kaizhu Huang; Lei Ding; Wei Wang; Xiaowei Huang; Qiu Feng Wang

doi:10.1145/3633517

Scene Text Recognition via Dual-path Network with Shape-driven Attention Alignment

Yijie Hu, Bin Dong, Kaizhu Huang, Lei Ding, Wei Wang, Xiaowei Huang, Qiu Feng Wang^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

Scene text recognition (STR), one typical sequence-to-sequence problem, has drawn much attention recently in multimedia applications. To guarantee good performance, it is essential for STR to obtain aligned character-wise features from the whole-image feature maps. While most present works adopt fully data-driven attention-based alignment, such practice ignores specific character geometric information. In this article, built upon a group of learnable geometric points, we propose a novel shape-driven attention alignment method that is able to obtain character-wise features. Concretely, we first design a corner detector to generate a shape map to guide the attention alignments explicitly, where a series of points can be learned to represent character-wise features flexibly. We then propose a dual-path network with a mutual learning and cooperating strategy that successfully combines CNN with a ViT-based model, leading to further accuracy improvement. We conduct extensive experiments to evaluate the proposed method on various scene text benchmarks, including six popular regular and irregular datasets, two more challenging datasets (i.e., WordArt and OST), and three Chinese datasets. Experimental results indicate that our method can achieve superior performance with a comparable model size against many state-of-the-art models.

Original language	English
Article number	107
Journal	ACM Transactions on Multimedia Computing, Communications and Applications
Volume	20
Issue number	4
Early online date	11 Jan 2024
DOIs	https://doi.org/10.1145/3633517
Publication status	Published - 30 Apr 2024

Keywords

Additional Key Words and PhrasesOCR
attention alignment
deformable attention
dual path network
scene text recognition

Access to Document

10.1145/3633517

https://doi.org/10.1145/3633517

Cite this

@article{1a7f8d77130349b3b7aed8d4864582c9,

title = "Scene Text Recognition via Dual-path Network with Shape-driven Attention Alignment",

abstract = "Scene text recognition (STR), one typical sequence-to-sequence problem, has drawn much attention recently in multimedia applications. To guarantee good performance, it is essential for STR to obtain aligned character-wise features from the whole-image feature maps. While most present works adopt fully data-driven attention-based alignment, such practice ignores specific character geometric information. In this article, built upon a group of learnable geometric points, we propose a novel shape-driven attention alignment method that is able to obtain character-wise features. Concretely, we first design a corner detector to generate a shape map to guide the attention alignments explicitly, where a series of points can be learned to represent character-wise features flexibly. We then propose a dual-path network with a mutual learning and cooperating strategy that successfully combines CNN with a ViT-based model, leading to further accuracy improvement. We conduct extensive experiments to evaluate the proposed method on various scene text benchmarks, including six popular regular and irregular datasets, two more challenging datasets (i.e., WordArt and OST), and three Chinese datasets. Experimental results indicate that our method can achieve superior performance with a comparable model size against many state-of-the-art models.",

keywords = "Additional Key Words and PhrasesOCR, attention alignment, deformable attention, dual path network, scene text recognition",

author = "Yijie Hu and Bin Dong and Kaizhu Huang and Lei Ding and Wei Wang and Xiaowei Huang and Wang, {Qiu Feng}",

note = "Publisher Copyright: {\textcopyright} 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.",

year = "2024",

month = apr,

day = "30",

doi = "10.1145/3633517",

language = "English",

volume = "20",

journal = "ACM Transactions on Multimedia Computing, Communications and Applications",

issn = "1551-6857",

publisher = "Association for Computing Machinery",

number = "4",

}

TY - JOUR

T1 - Scene Text Recognition via Dual-path Network with Shape-driven Attention Alignment

AU - Hu, Yijie

AU - Dong, Bin

AU - Huang, Kaizhu

AU - Ding, Lei

AU - Wang, Wei

AU - Huang, Xiaowei

AU - Wang, Qiu Feng

PY - 2024/4/30

Y1 - 2024/4/30

N2 - Scene text recognition (STR), one typical sequence-to-sequence problem, has drawn much attention recently in multimedia applications. To guarantee good performance, it is essential for STR to obtain aligned character-wise features from the whole-image feature maps. While most present works adopt fully data-driven attention-based alignment, such practice ignores specific character geometric information. In this article, built upon a group of learnable geometric points, we propose a novel shape-driven attention alignment method that is able to obtain character-wise features. Concretely, we first design a corner detector to generate a shape map to guide the attention alignments explicitly, where a series of points can be learned to represent character-wise features flexibly. We then propose a dual-path network with a mutual learning and cooperating strategy that successfully combines CNN with a ViT-based model, leading to further accuracy improvement. We conduct extensive experiments to evaluate the proposed method on various scene text benchmarks, including six popular regular and irregular datasets, two more challenging datasets (i.e., WordArt and OST), and three Chinese datasets. Experimental results indicate that our method can achieve superior performance with a comparable model size against many state-of-the-art models.

AB - Scene text recognition (STR), one typical sequence-to-sequence problem, has drawn much attention recently in multimedia applications. To guarantee good performance, it is essential for STR to obtain aligned character-wise features from the whole-image feature maps. While most present works adopt fully data-driven attention-based alignment, such practice ignores specific character geometric information. In this article, built upon a group of learnable geometric points, we propose a novel shape-driven attention alignment method that is able to obtain character-wise features. Concretely, we first design a corner detector to generate a shape map to guide the attention alignments explicitly, where a series of points can be learned to represent character-wise features flexibly. We then propose a dual-path network with a mutual learning and cooperating strategy that successfully combines CNN with a ViT-based model, leading to further accuracy improvement. We conduct extensive experiments to evaluate the proposed method on various scene text benchmarks, including six popular regular and irregular datasets, two more challenging datasets (i.e., WordArt and OST), and three Chinese datasets. Experimental results indicate that our method can achieve superior performance with a comparable model size against many state-of-the-art models.

KW - Additional Key Words and PhrasesOCR

KW - attention alignment

KW - deformable attention

KW - dual path network

KW - scene text recognition

UR - http://www.scopus.com/inward/record.url?scp=85182598099&partnerID=8YFLogxK

U2 - 10.1145/3633517

DO - 10.1145/3633517

M3 - Article

SN - 1551-6857

VL - 20

JO - ACM Transactions on Multimedia Computing, Communications and Applications

JF - ACM Transactions on Multimedia Computing, Communications and Applications

IS - 4

M1 - 107

ER -

Scene Text Recognition via Dual-path Network with Shape-driven Attention Alignment

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this