Transformer-Based Language-Person Search with Multiple Region Slicing

Hui Li; Jimin Xiao; Mingjie Sun; Eng Gee Lim; Yao Zhao

doi:10.1109/TCSVT.2021.3073718

Transformer-Based Language-Person Search with Multiple Region Slicing

Hui Li, Jimin Xiao^*, Mingjie Sun, Eng Gee Lim, Yao Zhao

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

29 Citations (Scopus)

Abstract

Language-person search is an essential technique for applications like criminal searching, where it is more feasible for a witness to provide language descriptions of a suspect than providing a photo. Most existing works treat the language-person pair as a black-box, neither considering the inner structure in a person picture, nor the correlations between image regions and referring words. In this work, we propose a transformer-based language-person search framework with matching conducted between words and image regions, where a person picture is vertically separated into multiple regions using two different ways, including the overlapped slicing and the key-point-based slicing. The co-attention between linguistic referring words and visual features are evaluated via transformer blocks. Besides the obtained outstanding searching performance, the proposed method enables to provide interpretability by visualizing the co-attention between image parts in the person picture and the corresponding referring words. Without bells and whistles, we achieve the state-of-the-art performance on the CUHK-PEDES dataset with Rank-1 score of 57.67% and the PA100K dataset with mAP of 22.88%, with simple yet elegant design. Code is available on https://github.com/detectiveli/T-MRS.

Original language	English
Pages (from-to)	1624-1633
Number of pages	10
Journal	IEEE Transactions on Circuits and Systems for Video Technology
Volume	32
Issue number	3
DOIs	https://doi.org/10.1109/TCSVT.2021.3073718
Publication status	Published - 1 Mar 2022

Keywords

Transformer
language-person search

Access to Document

10.1109/TCSVT.2021.3073718

Cite this

@article{9c3adb20043240b8ab1a64e14a994fe4,

title = "Transformer-Based Language-Person Search with Multiple Region Slicing",

abstract = "Language-person search is an essential technique for applications like criminal searching, where it is more feasible for a witness to provide language descriptions of a suspect than providing a photo. Most existing works treat the language-person pair as a black-box, neither considering the inner structure in a person picture, nor the correlations between image regions and referring words. In this work, we propose a transformer-based language-person search framework with matching conducted between words and image regions, where a person picture is vertically separated into multiple regions using two different ways, including the overlapped slicing and the key-point-based slicing. The co-attention between linguistic referring words and visual features are evaluated via transformer blocks. Besides the obtained outstanding searching performance, the proposed method enables to provide interpretability by visualizing the co-attention between image parts in the person picture and the corresponding referring words. Without bells and whistles, we achieve the state-of-the-art performance on the CUHK-PEDES dataset with Rank-1 score of 57.67% and the PA100K dataset with mAP of 22.88%, with simple yet elegant design. Code is available on https://github.com/detectiveli/T-MRS.",

keywords = "Transformer, language-person search",

author = "Hui Li and Jimin Xiao and Mingjie Sun and Lim, {Eng Gee} and Yao Zhao",

note = "Publisher Copyright: {\textcopyright} 1991-2012 IEEE.",

year = "2022",

month = mar,

day = "1",

doi = "10.1109/TCSVT.2021.3073718",

language = "English",

volume = "32",

pages = "1624--1633",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

number = "3",

}

TY - JOUR

T1 - Transformer-Based Language-Person Search with Multiple Region Slicing

AU - Li, Hui

AU - Xiao, Jimin

AU - Sun, Mingjie

AU - Lim, Eng Gee

AU - Zhao, Yao

PY - 2022/3/1

Y1 - 2022/3/1

N2 - Language-person search is an essential technique for applications like criminal searching, where it is more feasible for a witness to provide language descriptions of a suspect than providing a photo. Most existing works treat the language-person pair as a black-box, neither considering the inner structure in a person picture, nor the correlations between image regions and referring words. In this work, we propose a transformer-based language-person search framework with matching conducted between words and image regions, where a person picture is vertically separated into multiple regions using two different ways, including the overlapped slicing and the key-point-based slicing. The co-attention between linguistic referring words and visual features are evaluated via transformer blocks. Besides the obtained outstanding searching performance, the proposed method enables to provide interpretability by visualizing the co-attention between image parts in the person picture and the corresponding referring words. Without bells and whistles, we achieve the state-of-the-art performance on the CUHK-PEDES dataset with Rank-1 score of 57.67% and the PA100K dataset with mAP of 22.88%, with simple yet elegant design. Code is available on https://github.com/detectiveli/T-MRS.

AB - Language-person search is an essential technique for applications like criminal searching, where it is more feasible for a witness to provide language descriptions of a suspect than providing a photo. Most existing works treat the language-person pair as a black-box, neither considering the inner structure in a person picture, nor the correlations between image regions and referring words. In this work, we propose a transformer-based language-person search framework with matching conducted between words and image regions, where a person picture is vertically separated into multiple regions using two different ways, including the overlapped slicing and the key-point-based slicing. The co-attention between linguistic referring words and visual features are evaluated via transformer blocks. Besides the obtained outstanding searching performance, the proposed method enables to provide interpretability by visualizing the co-attention between image parts in the person picture and the corresponding referring words. Without bells and whistles, we achieve the state-of-the-art performance on the CUHK-PEDES dataset with Rank-1 score of 57.67% and the PA100K dataset with mAP of 22.88%, with simple yet elegant design. Code is available on https://github.com/detectiveli/T-MRS.

KW - Transformer

KW - language-person search

UR - http://www.scopus.com/inward/record.url?scp=85104672321&partnerID=8YFLogxK

U2 - 10.1109/TCSVT.2021.3073718

DO - 10.1109/TCSVT.2021.3073718

M3 - Article

AN - SCOPUS:85104672321

SN - 1051-8215

VL - 32

SP - 1624

EP - 1633

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

IS - 3

ER -

Transformer-Based Language-Person Search with Multiple Region Slicing

Abstract

Keywords

Access to Document

Other files and links

Cite this