An end-to-end tracking framework via multi-view and temporal feature aggregation

Yihan Yang, Ming Xu*, Jason F. Ralph, Yuchen Ling, Xiaonan Pan

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Multi-view pedestrian tracking has frequently been used to cope with the challenges of occlusion and limited fields-of-view in single-view tracking. However, there are few end-to-end methods in this field. Many existing algorithms detect pedestrians in individual views, cluster projected detections in a top view and then track them. The others track pedestrians in individual views and then associate the projected tracklets in a top view. In this paper, an end-to-end framework is proposed for multi-view tracking, in which both multi-view and temporal aggregations of feature maps are applied. The multi-view aggregation projects the per-view feature maps to a top view, uses a transformer encoder to output encoded feature maps and then uses a CNN to calculate a pedestrian occupancy map. The temporal aggregation uses another CNN to estimate position offsets from the encoded feature maps in consecutive frames. Our experiments have demonstrated that this end-to-end framework outperforms the state-of-the-art online algorithms for multi-view pedestrian tracking.
Original languageEnglish
Article number104203
Number of pages10
JournalComputer Vision and Image Understanding
Volume249
DOIs
Publication statusPublished - 10 Oct 2024

Fingerprint

Dive into the research topics of 'An end-to-end tracking framework via multi-view and temporal feature aggregation'. Together they form a unique fingerprint.

Cite this