TY - JOUR
T1 - Cognitive Disentanglement for Referring Multi-Object Tracking
AU - Liang, Shaofeng
AU - Guan, Runwei
AU - Lian, Wangwang
AU - Liu, Daizong
AU - Sun, Xiaolou
AU - Wu, Dongming
AU - Yue, Yutao
AU - Ding, Weiping
AU - Xiong, Hui
N1 - Publisher Copyright:
© 2025 Elsevier B.V.
PY - 2025/12
Y1 - 2025/12
N2 - As a significant application of multi-source information fusion in intelligent transportation perception systems, Referring Multi-Object Tracking (RMOT) involves localizing and tracking specific objects in video sequences based on language references. However, existing RMOT approaches often treat language descriptions as holistic embeddings and struggle to effectively integrate the rich semantic information contained in language expressions with visual features. This limitation is especially apparent in complex scenes requiring comprehensive understanding of both static object attributes and spatial motion information. In this paper, we propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework that addresses these challenges. It adapts the ”what” and ”where” pathways from the human visual processing system to RMOT tasks. Specifically, our framework first establishes cross-modal connections while preserving modality-specific characteristics. It then disentangles language descriptions and hierarchically injects them into object queries, refining object understanding from coarse to fine-grained semantic levels. Finally, we reconstruct language representations based on visual features, ensuring that tracked objects faithfully reflect the referring expression. Extensive experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods, with average gains of 6.0% in HOTA score on Refer-KITTI and 3.2% on Refer-KITTI-V2. Our approach advances the state-of-the-art in RMOT while simultaneously providing new insights into multi-source information fusion.
AB - As a significant application of multi-source information fusion in intelligent transportation perception systems, Referring Multi-Object Tracking (RMOT) involves localizing and tracking specific objects in video sequences based on language references. However, existing RMOT approaches often treat language descriptions as holistic embeddings and struggle to effectively integrate the rich semantic information contained in language expressions with visual features. This limitation is especially apparent in complex scenes requiring comprehensive understanding of both static object attributes and spatial motion information. In this paper, we propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework that addresses these challenges. It adapts the ”what” and ”where” pathways from the human visual processing system to RMOT tasks. Specifically, our framework first establishes cross-modal connections while preserving modality-specific characteristics. It then disentangles language descriptions and hierarchically injects them into object queries, refining object understanding from coarse to fine-grained semantic levels. Finally, we reconstruct language representations based on visual features, ensuring that tracked objects faithfully reflect the referring expression. Extensive experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods, with average gains of 6.0% in HOTA score on Refer-KITTI and 3.2% on Refer-KITTI-V2. Our approach advances the state-of-the-art in RMOT while simultaneously providing new insights into multi-source information fusion.
KW - Human-centric perception
KW - Human-visual-inspired neural network
KW - Referring multi-object tracking
KW - Vision–language fusion
UR - http://www.scopus.com/inward/record.url?scp=105007298110&partnerID=8YFLogxK
U2 - 10.1016/j.inffus.2025.103349
DO - 10.1016/j.inffus.2025.103349
M3 - Article
AN - SCOPUS:105007298110
SN - 1566-2535
VL - 124
JO - Information Fusion
JF - Information Fusion
M1 - 103349
ER -