STR Transformer: A Cross-domain Transformer for Scene Text Recognition

Xing Wu; Bin Tang; Ming Zhao; Jianjia Wang; Yike Guo

doi:10.1007/s10489-022-03728-5

STR Transformer: A Cross-domain Transformer for Scene Text Recognition

Xing Wu^*, Bin Tang, Ming Zhao, Jianjia Wang, Yike Guo

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

10 Citations (Scopus)

Abstract

Scene text recognition is an indispensable part of computer vision, which aims to extract text information from an image. However, effective extraction of texts following spelling rules remains a challenge for scene text recognition. We propose a cross-domain Transformer, called STR Transformer (STRT), which can not only extract texts from an image but also correct characters effectively according to their spelling rules. Specifically, we propose a Spline Transformer to extract hierarchical features of images without the convolution layers, which has the flexibility to build models with various scales and has linear computational complexity with respect to image size. Furthermore, an iterative Text Transformer is designed to predict the probability distribution of current character in the character sequence, which can effectively reduce the impact of noise. Extensive experiments demonstrate that the proposed STRT outperforms state-of-the-art methods on various benchmark datasets of scene text recognition. The qualitative and quantitative analysis proves the effectiveness and efficiency of the proposed STRT method.

Original language	English
Pages (from-to)	3444-3458
Number of pages	15
Journal	Applied Intelligence
Volume	53
Issue number	3
DOIs	https://doi.org/10.1007/s10489-022-03728-5
Publication status	Published - Feb 2023
Externally published	Yes

Keywords

Cross-domain
Hierarchical feature
Scene text recognition
Transformer

Access to Document

10.1007/s10489-022-03728-5

Cite this

@article{6add204cfea04e8184919ae7c2e76369,

title = "STR Transformer: A Cross-domain Transformer for Scene Text Recognition",

abstract = "Scene text recognition is an indispensable part of computer vision, which aims to extract text information from an image. However, effective extraction of texts following spelling rules remains a challenge for scene text recognition. We propose a cross-domain Transformer, called STR Transformer (STRT), which can not only extract texts from an image but also correct characters effectively according to their spelling rules. Specifically, we propose a Spline Transformer to extract hierarchical features of images without the convolution layers, which has the flexibility to build models with various scales and has linear computational complexity with respect to image size. Furthermore, an iterative Text Transformer is designed to predict the probability distribution of current character in the character sequence, which can effectively reduce the impact of noise. Extensive experiments demonstrate that the proposed STRT outperforms state-of-the-art methods on various benchmark datasets of scene text recognition. The qualitative and quantitative analysis proves the effectiveness and efficiency of the proposed STRT method.",

keywords = "Cross-domain, Hierarchical feature, Scene text recognition, Transformer",

author = "Xing Wu and Bin Tang and Ming Zhao and Jianjia Wang and Yike Guo",

note = "Publisher Copyright: {\textcopyright} 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.",

year = "2023",

month = feb,

doi = "10.1007/s10489-022-03728-5",

language = "English",

volume = "53",

pages = "3444--3458",

journal = "Applied Intelligence",

issn = "0924-669X",

number = "3",

}

TY - JOUR

T1 - STR Transformer

T2 - A Cross-domain Transformer for Scene Text Recognition

AU - Wu, Xing

AU - Tang, Bin

AU - Zhao, Ming

AU - Wang, Jianjia

AU - Guo, Yike

PY - 2023/2

Y1 - 2023/2

N2 - Scene text recognition is an indispensable part of computer vision, which aims to extract text information from an image. However, effective extraction of texts following spelling rules remains a challenge for scene text recognition. We propose a cross-domain Transformer, called STR Transformer (STRT), which can not only extract texts from an image but also correct characters effectively according to their spelling rules. Specifically, we propose a Spline Transformer to extract hierarchical features of images without the convolution layers, which has the flexibility to build models with various scales and has linear computational complexity with respect to image size. Furthermore, an iterative Text Transformer is designed to predict the probability distribution of current character in the character sequence, which can effectively reduce the impact of noise. Extensive experiments demonstrate that the proposed STRT outperforms state-of-the-art methods on various benchmark datasets of scene text recognition. The qualitative and quantitative analysis proves the effectiveness and efficiency of the proposed STRT method.

AB - Scene text recognition is an indispensable part of computer vision, which aims to extract text information from an image. However, effective extraction of texts following spelling rules remains a challenge for scene text recognition. We propose a cross-domain Transformer, called STR Transformer (STRT), which can not only extract texts from an image but also correct characters effectively according to their spelling rules. Specifically, we propose a Spline Transformer to extract hierarchical features of images without the convolution layers, which has the flexibility to build models with various scales and has linear computational complexity with respect to image size. Furthermore, an iterative Text Transformer is designed to predict the probability distribution of current character in the character sequence, which can effectively reduce the impact of noise. Extensive experiments demonstrate that the proposed STRT outperforms state-of-the-art methods on various benchmark datasets of scene text recognition. The qualitative and quantitative analysis proves the effectiveness and efficiency of the proposed STRT method.

KW - Cross-domain

KW - Hierarchical feature

KW - Scene text recognition

KW - Transformer

UR - http://www.scopus.com/inward/record.url?scp=85131092410&partnerID=8YFLogxK

U2 - 10.1007/s10489-022-03728-5

DO - 10.1007/s10489-022-03728-5

M3 - Article

AN - SCOPUS:85131092410

SN - 0924-669X

VL - 53

SP - 3444

EP - 3458

JO - Applied Intelligence

JF - Applied Intelligence

IS - 3

ER -

STR Transformer: A Cross-domain Transformer for Scene Text Recognition

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this