FLAG: A Framework With Explicit Learning Based on Appearance and Gait for Video-Based Clothes-Changing Person Re-Identification

Hengjie Lu; Yilin Gao; Shugong Xu

doi:10.1109/TCSVT.2024.3483265

FLAG: A Framework With Explicit Learning Based on Appearance and Gait for Video-Based Clothes-Changing Person Re-Identification

Hengjie Lu, Yilin Gao, Shugong Xu^*

^*Corresponding author for this work

Shanghai University

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

Person re-identification (ReID) aims to search for the target person among the non-overlapping surveillance cameras. Video-based clothes-changing person re-identification (VCC-ReID) has become an essential branch of ReID due to the rich spatial and temporal information in the videos and the broad application of the scenarios. Appearance and gait are discriminative features in the video-based ReID, but appearance information is limited due to the clothes changing, which makes the VCC-ReID challenging. To solve this challenge, we propose a Framework with explicit Learning based on Appearance and Gait (FLAG), which can explicitly extract two corresponding types of information and be combined with most existing video-based ReID methods. The FLAG includes a multi-modal and multi-granularities Architecture (MGA), which is a large model, and a Cross-Modal Knowledge Distillation Scheme (CMKDS), which has a small model. They can be applied to devices with different computing resources. The MGA is designed to simultaneously take the visible light and silhouette modalities as input to explicitly learn the appearance and gait features, respectively. The silhouette modalities are composed of several levels of granularities to model global and local gait features and independently serve as input for MGA. The Embedding-Based parallel fusion module is proposed to fuse the appearance and multi-granularities gait feature efficiently. The CMKDS is present to distill the MGA to a small single-modal model that only uses the visible light modality as input. The Embedding-Based direct and indirect distillation strategies are designed in the CMKDS. Experimental results demonstrate that the FLAG combined with the existing video-based ReID methods can significantly improve their performance. In addition, when FLAG is combined with the AP3D method, the MGA can outperform state-of-the-art accuracy by 4.2%.

Original language	English
Pages (from-to)	1801-1813
Number of pages	13
Journal	IEEE Transactions on Circuits and Systems for Video Technology
Volume	35
Issue number	2
DOIs	https://doi.org/10.1109/TCSVT.2024.3483265
Publication status	Published - 2025
Externally published	Yes

Keywords

Video-based person re-identification
clothes-changing person re-identification
knowledge distillation
multi-modal learning

Access to Document

10.1109/TCSVT.2024.3483265

Cite this

@article{70f658344da24f149f3c2cfe241fe631,

title = "FLAG: A Framework With Explicit Learning Based on Appearance and Gait for Video-Based Clothes-Changing Person Re-Identification",

abstract = "Person re-identification (ReID) aims to search for the target person among the non-overlapping surveillance cameras. Video-based clothes-changing person re-identification (VCC-ReID) has become an essential branch of ReID due to the rich spatial and temporal information in the videos and the broad application of the scenarios. Appearance and gait are discriminative features in the video-based ReID, but appearance information is limited due to the clothes changing, which makes the VCC-ReID challenging. To solve this challenge, we propose a Framework with explicit Learning based on Appearance and Gait (FLAG), which can explicitly extract two corresponding types of information and be combined with most existing video-based ReID methods. The FLAG includes a multi-modal and multi-granularities Architecture (MGA), which is a large model, and a Cross-Modal Knowledge Distillation Scheme (CMKDS), which has a small model. They can be applied to devices with different computing resources. The MGA is designed to simultaneously take the visible light and silhouette modalities as input to explicitly learn the appearance and gait features, respectively. The silhouette modalities are composed of several levels of granularities to model global and local gait features and independently serve as input for MGA. The Embedding-Based parallel fusion module is proposed to fuse the appearance and multi-granularities gait feature efficiently. The CMKDS is present to distill the MGA to a small single-modal model that only uses the visible light modality as input. The Embedding-Based direct and indirect distillation strategies are designed in the CMKDS. Experimental results demonstrate that the FLAG combined with the existing video-based ReID methods can significantly improve their performance. In addition, when FLAG is combined with the AP3D method, the MGA can outperform state-of-the-art accuracy by 4.2%.",

keywords = "Video-based person re-identification, clothes-changing person re-identification, knowledge distillation, multi-modal learning",

author = "Hengjie Lu and Yilin Gao and Shugong Xu",

note = "Publisher Copyright: {\textcopyright} 1991-2012 IEEE.",

year = "2025",

doi = "10.1109/TCSVT.2024.3483265",

language = "English",

volume = "35",

pages = "1801--1813",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

number = "2",

}

TY - JOUR

T1 - FLAG

T2 - A Framework With Explicit Learning Based on Appearance and Gait for Video-Based Clothes-Changing Person Re-Identification

AU - Lu, Hengjie

AU - Gao, Yilin

AU - Xu, Shugong

PY - 2025

Y1 - 2025

N2 - Person re-identification (ReID) aims to search for the target person among the non-overlapping surveillance cameras. Video-based clothes-changing person re-identification (VCC-ReID) has become an essential branch of ReID due to the rich spatial and temporal information in the videos and the broad application of the scenarios. Appearance and gait are discriminative features in the video-based ReID, but appearance information is limited due to the clothes changing, which makes the VCC-ReID challenging. To solve this challenge, we propose a Framework with explicit Learning based on Appearance and Gait (FLAG), which can explicitly extract two corresponding types of information and be combined with most existing video-based ReID methods. The FLAG includes a multi-modal and multi-granularities Architecture (MGA), which is a large model, and a Cross-Modal Knowledge Distillation Scheme (CMKDS), which has a small model. They can be applied to devices with different computing resources. The MGA is designed to simultaneously take the visible light and silhouette modalities as input to explicitly learn the appearance and gait features, respectively. The silhouette modalities are composed of several levels of granularities to model global and local gait features and independently serve as input for MGA. The Embedding-Based parallel fusion module is proposed to fuse the appearance and multi-granularities gait feature efficiently. The CMKDS is present to distill the MGA to a small single-modal model that only uses the visible light modality as input. The Embedding-Based direct and indirect distillation strategies are designed in the CMKDS. Experimental results demonstrate that the FLAG combined with the existing video-based ReID methods can significantly improve their performance. In addition, when FLAG is combined with the AP3D method, the MGA can outperform state-of-the-art accuracy by 4.2%.

AB - Person re-identification (ReID) aims to search for the target person among the non-overlapping surveillance cameras. Video-based clothes-changing person re-identification (VCC-ReID) has become an essential branch of ReID due to the rich spatial and temporal information in the videos and the broad application of the scenarios. Appearance and gait are discriminative features in the video-based ReID, but appearance information is limited due to the clothes changing, which makes the VCC-ReID challenging. To solve this challenge, we propose a Framework with explicit Learning based on Appearance and Gait (FLAG), which can explicitly extract two corresponding types of information and be combined with most existing video-based ReID methods. The FLAG includes a multi-modal and multi-granularities Architecture (MGA), which is a large model, and a Cross-Modal Knowledge Distillation Scheme (CMKDS), which has a small model. They can be applied to devices with different computing resources. The MGA is designed to simultaneously take the visible light and silhouette modalities as input to explicitly learn the appearance and gait features, respectively. The silhouette modalities are composed of several levels of granularities to model global and local gait features and independently serve as input for MGA. The Embedding-Based parallel fusion module is proposed to fuse the appearance and multi-granularities gait feature efficiently. The CMKDS is present to distill the MGA to a small single-modal model that only uses the visible light modality as input. The Embedding-Based direct and indirect distillation strategies are designed in the CMKDS. Experimental results demonstrate that the FLAG combined with the existing video-based ReID methods can significantly improve their performance. In addition, when FLAG is combined with the AP3D method, the MGA can outperform state-of-the-art accuracy by 4.2%.

KW - Video-based person re-identification

KW - clothes-changing person re-identification

KW - knowledge distillation

KW - multi-modal learning

UR - http://www.scopus.com/inward/record.url?scp=85207471869&partnerID=8YFLogxK

U2 - 10.1109/TCSVT.2024.3483265

DO - 10.1109/TCSVT.2024.3483265

M3 - Article

AN - SCOPUS:85207471869

SN - 1051-8215

VL - 35

SP - 1801

EP - 1813

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

IS - 2

ER -

FLAG: A Framework With Explicit Learning Based on Appearance and Gait for Video-Based Clothes-Changing Person Re-Identification

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this