Cognitive Disentanglement for Referring Multi-Object Tracking

Shaofeng Liang; Runwei Guan; Wangwang Lian; Daizong Liu; Xiaolou Sun; Dongming Wu; Yutao Yue; Weiping Ding; Hui Xiong

doi:10.1016/j.inffus.2025.103349

Cognitive Disentanglement for Referring Multi-Object Tracking

Shaofeng Liang, Runwei Guan^*, Wangwang Lian, Daizong Liu, Xiaolou Sun, Dongming Wu, Yutao Yue, Weiping Ding, Hui Xiong

^*Corresponding author for this work

Xi'an Jiaotong-Liverpool University

Research output: Contribution to journal › Article › peer-review

Abstract

As a significant application of multi-source information fusion in intelligent transportation perception systems, Referring Multi-Object Tracking (RMOT) involves localizing and tracking specific objects in video sequences based on language references. However, existing RMOT approaches often treat language descriptions as holistic embeddings and struggle to effectively integrate the rich semantic information contained in language expressions with visual features. This limitation is especially apparent in complex scenes requiring comprehensive understanding of both static object attributes and spatial motion information. In this paper, we propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework that addresses these challenges. It adapts the ”what” and ”where” pathways from the human visual processing system to RMOT tasks. Specifically, our framework first establishes cross-modal connections while preserving modality-specific characteristics. It then disentangles language descriptions and hierarchically injects them into object queries, refining object understanding from coarse to fine-grained semantic levels. Finally, we reconstruct language representations based on visual features, ensuring that tracked objects faithfully reflect the referring expression. Extensive experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods, with average gains of 6.0% in HOTA score on Refer-KITTI and 3.2% on Refer-KITTI-V2. Our approach advances the state-of-the-art in RMOT while simultaneously providing new insights into multi-source information fusion.

Original language	English
Article number	103349
Journal	Information Fusion
Volume	124
DOIs	https://doi.org/10.1016/j.inffus.2025.103349
Publication status	Published - Dec 2025

Keywords

Human-centric perception
Human-visual-inspired neural network
Referring multi-object tracking
Vision–language fusion

Access to Document

10.1016/j.inffus.2025.103349

Cite this

@article{85a8490da2504b398a8eb0388e6e34e2,

title = "Cognitive Disentanglement for Referring Multi-Object Tracking",

abstract = "As a significant application of multi-source information fusion in intelligent transportation perception systems, Referring Multi-Object Tracking (RMOT) involves localizing and tracking specific objects in video sequences based on language references. However, existing RMOT approaches often treat language descriptions as holistic embeddings and struggle to effectively integrate the rich semantic information contained in language expressions with visual features. This limitation is especially apparent in complex scenes requiring comprehensive understanding of both static object attributes and spatial motion information. In this paper, we propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework that addresses these challenges. It adapts the ”what” and ”where” pathways from the human visual processing system to RMOT tasks. Specifically, our framework first establishes cross-modal connections while preserving modality-specific characteristics. It then disentangles language descriptions and hierarchically injects them into object queries, refining object understanding from coarse to fine-grained semantic levels. Finally, we reconstruct language representations based on visual features, ensuring that tracked objects faithfully reflect the referring expression. Extensive experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods, with average gains of 6.0% in HOTA score on Refer-KITTI and 3.2% on Refer-KITTI-V2. Our approach advances the state-of-the-art in RMOT while simultaneously providing new insights into multi-source information fusion.",

keywords = "Human-centric perception, Human-visual-inspired neural network, Referring multi-object tracking, Vision–language fusion",

author = "Shaofeng Liang and Runwei Guan and Wangwang Lian and Daizong Liu and Xiaolou Sun and Dongming Wu and Yutao Yue and Weiping Ding and Hui Xiong",

note = "Publisher Copyright: {\textcopyright} 2025 Elsevier B.V.",

year = "2025",

month = dec,

doi = "10.1016/j.inffus.2025.103349",

language = "English",

volume = "124",

journal = "Information Fusion",

issn = "1566-2535",

}

TY - JOUR

T1 - Cognitive Disentanglement for Referring Multi-Object Tracking

AU - Liang, Shaofeng

AU - Guan, Runwei

AU - Lian, Wangwang

AU - Liu, Daizong

AU - Sun, Xiaolou

AU - Wu, Dongming

AU - Yue, Yutao

AU - Ding, Weiping

AU - Xiong, Hui

PY - 2025/12

Y1 - 2025/12

N2 - As a significant application of multi-source information fusion in intelligent transportation perception systems, Referring Multi-Object Tracking (RMOT) involves localizing and tracking specific objects in video sequences based on language references. However, existing RMOT approaches often treat language descriptions as holistic embeddings and struggle to effectively integrate the rich semantic information contained in language expressions with visual features. This limitation is especially apparent in complex scenes requiring comprehensive understanding of both static object attributes and spatial motion information. In this paper, we propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework that addresses these challenges. It adapts the ”what” and ”where” pathways from the human visual processing system to RMOT tasks. Specifically, our framework first establishes cross-modal connections while preserving modality-specific characteristics. It then disentangles language descriptions and hierarchically injects them into object queries, refining object understanding from coarse to fine-grained semantic levels. Finally, we reconstruct language representations based on visual features, ensuring that tracked objects faithfully reflect the referring expression. Extensive experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods, with average gains of 6.0% in HOTA score on Refer-KITTI and 3.2% on Refer-KITTI-V2. Our approach advances the state-of-the-art in RMOT while simultaneously providing new insights into multi-source information fusion.

AB - As a significant application of multi-source information fusion in intelligent transportation perception systems, Referring Multi-Object Tracking (RMOT) involves localizing and tracking specific objects in video sequences based on language references. However, existing RMOT approaches often treat language descriptions as holistic embeddings and struggle to effectively integrate the rich semantic information contained in language expressions with visual features. This limitation is especially apparent in complex scenes requiring comprehensive understanding of both static object attributes and spatial motion information. In this paper, we propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework that addresses these challenges. It adapts the ”what” and ”where” pathways from the human visual processing system to RMOT tasks. Specifically, our framework first establishes cross-modal connections while preserving modality-specific characteristics. It then disentangles language descriptions and hierarchically injects them into object queries, refining object understanding from coarse to fine-grained semantic levels. Finally, we reconstruct language representations based on visual features, ensuring that tracked objects faithfully reflect the referring expression. Extensive experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods, with average gains of 6.0% in HOTA score on Refer-KITTI and 3.2% on Refer-KITTI-V2. Our approach advances the state-of-the-art in RMOT while simultaneously providing new insights into multi-source information fusion.

KW - Human-centric perception

KW - Human-visual-inspired neural network

KW - Referring multi-object tracking

KW - Vision–language fusion

UR - http://www.scopus.com/inward/record.url?scp=105007298110&partnerID=8YFLogxK

U2 - 10.1016/j.inffus.2025.103349

DO - 10.1016/j.inffus.2025.103349

M3 - Article

AN - SCOPUS:105007298110

SN - 1566-2535

VL - 124

JO - Information Fusion

JF - Information Fusion

M1 - 103349

ER -

Cognitive Disentanglement for Referring Multi-Object Tracking

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this