HGR-ViT: Hand Gesture Recognition with Vision Transformer

Chun Keat Tan; Kian Ming Lim; Roy Kwang Yang Chang; Chin Poo Lee; Ali Alqahtani

doi:10.3390/s23125555

HGR-ViT: Hand Gesture Recognition with Vision Transformer

Chun Keat Tan, Kian Ming Lim^*, Roy Kwang Yang Chang, Chin Poo Lee, Ali Alqahtani

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

15 Citations (Scopus)

Abstract

Hand gesture recognition (HGR) is a crucial area of research that enhances communication by overcoming language barriers and facilitating human-computer interaction. Although previous works in HGR have employed deep neural networks, they fail to encode the orientation and position of the hand in the image. To address this issue, this paper proposes HGR-ViT, a Vision Transformer (ViT) model with an attention mechanism for hand gesture recognition. Given a hand gesture image, it is first split into fixed size patches. Positional embedding is added to these embeddings to form learnable vectors that capture the positional information of the hand patches. The resulting sequence of vectors are then served as the input to a standard Transformer encoder to obtain the hand gesture representation. A multilayer perceptron head is added to the output of the encoder to classify the hand gesture to the correct class. The proposed HGR-ViT obtains an accuracy of 99.98%, 99.36% and 99.85% for the American Sign Language (ASL) dataset, ASL with Digits dataset, and National University of Singapore (NUS) hand gesture dataset, respectively.

Original language	English
Article number	5555
Journal	Sensors
Volume	23
Issue number	12
DOIs	https://doi.org/10.3390/s23125555
Publication status	Published - Jun 2023
Externally published	Yes

Keywords

attention
hand gesture recognition
sign language recognition
vision transformer
ViT

Access to Document

10.3390/s23125555

Cite this

@article{0df155ff58c14d3299d79b2ba8f72b93,

title = "HGR-ViT: Hand Gesture Recognition with Vision Transformer",

abstract = "Hand gesture recognition (HGR) is a crucial area of research that enhances communication by overcoming language barriers and facilitating human-computer interaction. Although previous works in HGR have employed deep neural networks, they fail to encode the orientation and position of the hand in the image. To address this issue, this paper proposes HGR-ViT, a Vision Transformer (ViT) model with an attention mechanism for hand gesture recognition. Given a hand gesture image, it is first split into fixed size patches. Positional embedding is added to these embeddings to form learnable vectors that capture the positional information of the hand patches. The resulting sequence of vectors are then served as the input to a standard Transformer encoder to obtain the hand gesture representation. A multilayer perceptron head is added to the output of the encoder to classify the hand gesture to the correct class. The proposed HGR-ViT obtains an accuracy of 99.98%, 99.36% and 99.85% for the American Sign Language (ASL) dataset, ASL with Digits dataset, and National University of Singapore (NUS) hand gesture dataset, respectively.",

keywords = "attention, hand gesture recognition, sign language recognition, vision transformer, ViT",

author = "Tan, {Chun Keat} and Lim, {Kian Ming} and Chang, {Roy Kwang Yang} and Lee, {Chin Poo} and Ali Alqahtani",

note = "Publisher Copyright: {\textcopyright} 2023 by the authors.",

year = "2023",

month = jun,

doi = "10.3390/s23125555",

language = "English",

volume = "23",

journal = "Sensors",

issn = "1424-8220",

publisher = "MDPI (Basel, Switzerland) ",

number = "12",

}

TY - JOUR

T1 - HGR-ViT: Hand Gesture Recognition with Vision Transformer

AU - Tan, Chun Keat

AU - Lim, Kian Ming

AU - Chang, Roy Kwang Yang

AU - Lee, Chin Poo

AU - Alqahtani, Ali

PY - 2023/6

Y1 - 2023/6

N2 - Hand gesture recognition (HGR) is a crucial area of research that enhances communication by overcoming language barriers and facilitating human-computer interaction. Although previous works in HGR have employed deep neural networks, they fail to encode the orientation and position of the hand in the image. To address this issue, this paper proposes HGR-ViT, a Vision Transformer (ViT) model with an attention mechanism for hand gesture recognition. Given a hand gesture image, it is first split into fixed size patches. Positional embedding is added to these embeddings to form learnable vectors that capture the positional information of the hand patches. The resulting sequence of vectors are then served as the input to a standard Transformer encoder to obtain the hand gesture representation. A multilayer perceptron head is added to the output of the encoder to classify the hand gesture to the correct class. The proposed HGR-ViT obtains an accuracy of 99.98%, 99.36% and 99.85% for the American Sign Language (ASL) dataset, ASL with Digits dataset, and National University of Singapore (NUS) hand gesture dataset, respectively.

AB - Hand gesture recognition (HGR) is a crucial area of research that enhances communication by overcoming language barriers and facilitating human-computer interaction. Although previous works in HGR have employed deep neural networks, they fail to encode the orientation and position of the hand in the image. To address this issue, this paper proposes HGR-ViT, a Vision Transformer (ViT) model with an attention mechanism for hand gesture recognition. Given a hand gesture image, it is first split into fixed size patches. Positional embedding is added to these embeddings to form learnable vectors that capture the positional information of the hand patches. The resulting sequence of vectors are then served as the input to a standard Transformer encoder to obtain the hand gesture representation. A multilayer perceptron head is added to the output of the encoder to classify the hand gesture to the correct class. The proposed HGR-ViT obtains an accuracy of 99.98%, 99.36% and 99.85% for the American Sign Language (ASL) dataset, ASL with Digits dataset, and National University of Singapore (NUS) hand gesture dataset, respectively.

KW - attention

KW - hand gesture recognition

KW - sign language recognition

KW - vision transformer

KW - ViT

UR - http://www.scopus.com/inward/record.url?scp=85163935721&partnerID=8YFLogxK

U2 - 10.3390/s23125555

DO - 10.3390/s23125555

M3 - Article

C2 - 37420722

AN - SCOPUS:85163935721

SN - 1424-8220

VL - 23

JO - Sensors

JF - Sensors

IS - 12

M1 - 5555

ER -

HGR-ViT: Hand Gesture Recognition with Vision Transformer

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this