TY - JOUR
T1 - Transformer-Based Language-Person Search with Multiple Region Slicing
AU - Li, Hui
AU - Xiao, Jimin
AU - Sun, Mingjie
AU - Lim, Eng Gee
AU - Zhao, Yao
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2022/3/1
Y1 - 2022/3/1
N2 - Language-person search is an essential technique for applications like criminal searching, where it is more feasible for a witness to provide language descriptions of a suspect than providing a photo. Most existing works treat the language-person pair as a black-box, neither considering the inner structure in a person picture, nor the correlations between image regions and referring words. In this work, we propose a transformer-based language-person search framework with matching conducted between words and image regions, where a person picture is vertically separated into multiple regions using two different ways, including the overlapped slicing and the key-point-based slicing. The co-attention between linguistic referring words and visual features are evaluated via transformer blocks. Besides the obtained outstanding searching performance, the proposed method enables to provide interpretability by visualizing the co-attention between image parts in the person picture and the corresponding referring words. Without bells and whistles, we achieve the state-of-the-art performance on the CUHK-PEDES dataset with Rank-1 score of 57.67% and the PA100K dataset with mAP of 22.88%, with simple yet elegant design. Code is available on https://github.com/detectiveli/T-MRS.
AB - Language-person search is an essential technique for applications like criminal searching, where it is more feasible for a witness to provide language descriptions of a suspect than providing a photo. Most existing works treat the language-person pair as a black-box, neither considering the inner structure in a person picture, nor the correlations between image regions and referring words. In this work, we propose a transformer-based language-person search framework with matching conducted between words and image regions, where a person picture is vertically separated into multiple regions using two different ways, including the overlapped slicing and the key-point-based slicing. The co-attention between linguistic referring words and visual features are evaluated via transformer blocks. Besides the obtained outstanding searching performance, the proposed method enables to provide interpretability by visualizing the co-attention between image parts in the person picture and the corresponding referring words. Without bells and whistles, we achieve the state-of-the-art performance on the CUHK-PEDES dataset with Rank-1 score of 57.67% and the PA100K dataset with mAP of 22.88%, with simple yet elegant design. Code is available on https://github.com/detectiveli/T-MRS.
KW - Transformer
KW - language-person search
UR - http://www.scopus.com/inward/record.url?scp=85104672321&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2021.3073718
DO - 10.1109/TCSVT.2021.3073718
M3 - Article
AN - SCOPUS:85104672321
SN - 1051-8215
VL - 32
SP - 1624
EP - 1633
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 3
ER -