Gait-ViT: Gait Recognition with Vision Transformer: Gait Recognition with Vision Transformer

Jashila Nair Mogan; Chin Poo Lee; Kian Ming Lim; Kalaiarasi Sonai Muthu

doi:10.3390/s22197362

Gait-ViT: Gait Recognition with Vision Transformer: Gait Recognition with Vision Transformer

Jashila Nair Mogan, Chin Poo Lee^*, Kian Ming Lim, Kalaiarasi Sonai Muthu

^*Corresponding author for this work

Multimedia University

Research output: Contribution to journal › Article › peer-review

24 Citations (Scopus)

Abstract

Identifying an individual based on their physical/behavioral characteristics is known as biometric recognition. Gait is one of the most reliable biometrics due to its advantages, such as being perceivable at a long distance and difficult to replicate. The existing works mostly leverage Convolutional Neural Networks for gait recognition. The Convolutional Neural Networks perform well in image recognition tasks; however, they lack the attention mechanism to emphasize more on the significant regions of the image. The attention mechanism encodes information in the image patches, which facilitates the model to learn the substantial features in the specific regions. In light of this, this work employs the Vision Transformer (ViT) with an attention mechanism for gait recognition, referred to as Gait-ViT. In the proposed Gait-ViT, the gait energy image is first obtained by averaging the series of images over the gait cycle. The images are then split into patches and transformed into sequences by flattening and patch embedding. Position embedding, along with patch embedding, are applied on the sequence of patches to restore the positional information of the patches. Subsequently, the sequence of vectors is fed to the Transformer encoder to produce the final gait representation. As for the classification, the first element of the sequence is sent to the multi-layer perceptron to predict the class label. The proposed method obtained 99.93% on CASIA-B, 100% on OU-ISIR D and 99.51% on OU-LP, which exhibit the ability of the Vision Transformer model to outperform the state-of-the-art methods.

Original language	English
Article number	7362
Journal	Sensors
Volume	22
Issue number	19
DOIs	https://doi.org/10.3390/s22197362
Publication status	Published - Oct 2022
Externally published	Yes

Keywords

attention
deep learning
gait
gait recognition
transformers
vision transformer
vit

Access to Document

10.3390/s22197362

Cite this

@article{fd15b83faefb4b6e9ab57275bf2a13ea,

title = "Gait-ViT: Gait Recognition with Vision Transformer: Gait Recognition with Vision Transformer",

abstract = "Identifying an individual based on their physical/behavioral characteristics is known as biometric recognition. Gait is one of the most reliable biometrics due to its advantages, such as being perceivable at a long distance and difficult to replicate. The existing works mostly leverage Convolutional Neural Networks for gait recognition. The Convolutional Neural Networks perform well in image recognition tasks; however, they lack the attention mechanism to emphasize more on the significant regions of the image. The attention mechanism encodes information in the image patches, which facilitates the model to learn the substantial features in the specific regions. In light of this, this work employs the Vision Transformer (ViT) with an attention mechanism for gait recognition, referred to as Gait-ViT. In the proposed Gait-ViT, the gait energy image is first obtained by averaging the series of images over the gait cycle. The images are then split into patches and transformed into sequences by flattening and patch embedding. Position embedding, along with patch embedding, are applied on the sequence of patches to restore the positional information of the patches. Subsequently, the sequence of vectors is fed to the Transformer encoder to produce the final gait representation. As for the classification, the first element of the sequence is sent to the multi-layer perceptron to predict the class label. The proposed method obtained 99.93% on CASIA-B, 100% on OU-ISIR D and 99.51% on OU-LP, which exhibit the ability of the Vision Transformer model to outperform the state-of-the-art methods.",

keywords = "attention, deep learning, gait, gait recognition, transformers, vision transformer, vit",

author = "Mogan, {Jashila Nair} and Lee, {Chin Poo} and Lim, {Kian Ming} and Muthu, {Kalaiarasi Sonai}",

note = "Publisher Copyright: {\textcopyright} 2022 by the authors.",

year = "2022",

month = oct,

doi = "10.3390/s22197362",

language = "English",

volume = "22",

journal = "Sensors",

issn = "1424-8220",

publisher = "MDPI (Basel, Switzerland) ",

number = "19",

}

TY - JOUR

T1 - Gait-ViT: Gait Recognition with Vision Transformer

T2 - Gait Recognition with Vision Transformer

AU - Mogan, Jashila Nair

AU - Lee, Chin Poo

AU - Lim, Kian Ming

AU - Muthu, Kalaiarasi Sonai

PY - 2022/10

Y1 - 2022/10

N2 - Identifying an individual based on their physical/behavioral characteristics is known as biometric recognition. Gait is one of the most reliable biometrics due to its advantages, such as being perceivable at a long distance and difficult to replicate. The existing works mostly leverage Convolutional Neural Networks for gait recognition. The Convolutional Neural Networks perform well in image recognition tasks; however, they lack the attention mechanism to emphasize more on the significant regions of the image. The attention mechanism encodes information in the image patches, which facilitates the model to learn the substantial features in the specific regions. In light of this, this work employs the Vision Transformer (ViT) with an attention mechanism for gait recognition, referred to as Gait-ViT. In the proposed Gait-ViT, the gait energy image is first obtained by averaging the series of images over the gait cycle. The images are then split into patches and transformed into sequences by flattening and patch embedding. Position embedding, along with patch embedding, are applied on the sequence of patches to restore the positional information of the patches. Subsequently, the sequence of vectors is fed to the Transformer encoder to produce the final gait representation. As for the classification, the first element of the sequence is sent to the multi-layer perceptron to predict the class label. The proposed method obtained 99.93% on CASIA-B, 100% on OU-ISIR D and 99.51% on OU-LP, which exhibit the ability of the Vision Transformer model to outperform the state-of-the-art methods.

AB - Identifying an individual based on their physical/behavioral characteristics is known as biometric recognition. Gait is one of the most reliable biometrics due to its advantages, such as being perceivable at a long distance and difficult to replicate. The existing works mostly leverage Convolutional Neural Networks for gait recognition. The Convolutional Neural Networks perform well in image recognition tasks; however, they lack the attention mechanism to emphasize more on the significant regions of the image. The attention mechanism encodes information in the image patches, which facilitates the model to learn the substantial features in the specific regions. In light of this, this work employs the Vision Transformer (ViT) with an attention mechanism for gait recognition, referred to as Gait-ViT. In the proposed Gait-ViT, the gait energy image is first obtained by averaging the series of images over the gait cycle. The images are then split into patches and transformed into sequences by flattening and patch embedding. Position embedding, along with patch embedding, are applied on the sequence of patches to restore the positional information of the patches. Subsequently, the sequence of vectors is fed to the Transformer encoder to produce the final gait representation. As for the classification, the first element of the sequence is sent to the multi-layer perceptron to predict the class label. The proposed method obtained 99.93% on CASIA-B, 100% on OU-ISIR D and 99.51% on OU-LP, which exhibit the ability of the Vision Transformer model to outperform the state-of-the-art methods.

KW - attention

KW - deep learning

KW - gait

KW - gait recognition

KW - transformers

KW - vision transformer

KW - vit

UR - http://www.scopus.com/inward/record.url?scp=85139812381&partnerID=8YFLogxK

U2 - 10.3390/s22197362

DO - 10.3390/s22197362

M3 - Article

C2 - 36236462

AN - SCOPUS:85139812381

SN - 1424-8220

VL - 22

JO - Sensors

JF - Sensors

IS - 19

M1 - 7362

ER -

Gait-ViT: Gait Recognition with Vision Transformer: Gait Recognition with Vision Transformer

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this