MonoTCM: Semantic-Depth Fusion Transformer for Monocular 3D Object Detection with Token Clustering and Merging

Changyu Zeng, Zimu Wang, Jimin Xiao, Anh Nguyen, Kaizhu Huang, Wei Wang*, Yutao Yue*

*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceedingConference Proceedingpeer-review

Abstract

Monocular 3D object detection presents significant challenges due to the inherent absence of depth and geometric information, rendering it more complex than 2D detection. This paper introduces MonoTCM, a Semantic-Depth Fusion Transformer that leverages a Token Clustering and Merging (TCM) module to enhance the efficiency and accuracy of monocular 3D object detection. The TCM module aggregates multi-scale grid-based tokens into clustering-based tokens, dynamically adjusting their shapes and sizes based on local density and distance metrics. This allows for finer granularity in critical areas while consolidating less informative regions. The aggregated tokens are subsequently decomposed into semantic and depth features, processed through dedicated transformer-based encoders, and integrated using a semantic-depth fusion decoder modeled after DETR. This approach enhances the model’s ability to capture implicit global geometric information and provides a cost-effective solution for real-time intelligent driving applications. Experimental results demonstrate the superiority of MonoTCM in enhancing detection performance compared to other advanced methods, highlighting its potential to advance the field of monocular 3D object detection.

Original languageEnglish
Title of host publicationNeural Information Processing - 31st International Conference, ICONIP 2024, Proceedings
EditorsMufti Mahmud, Maryam Doborjeh, Zohreh Doborjeh, Kevin Wong, Andrew Chi Sing Leung, M. Tanveer
PublisherSpringer Science and Business Media Deutschland GmbH
Pages332-346
Number of pages15
ISBN (Print)9789819670352
DOIs
Publication statusPublished - 2026
Event31st International Conference on Neural Information Processing, ICONIP 2024 - Auckland, New Zealand
Duration: 2 Dec 20246 Dec 2024

Publication series

NameCommunications in Computer and Information Science
Volume2297 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference31st International Conference on Neural Information Processing, ICONIP 2024
Country/TerritoryNew Zealand
CityAuckland
Period2/12/246/12/24

Keywords

  • Computer Vision
  • Depth Estimation
  • Monocluar 3D Object Detection

Fingerprint

Dive into the research topics of 'MonoTCM: Semantic-Depth Fusion Transformer for Monocular 3D Object Detection with Token Clustering and Merging'. Together they form a unique fingerprint.

Cite this