StereoDETR: Stereo-based Transformer for 3D Object Detection

  • Shiyi Mu
  • , Zichong Gu
  • , Zhiqi Ai
  • , Anqi Liu
  • , Yilin Gao
  • , Shugong Xu*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Compared to monocular 3D object detection, stereo-based 3D methods offer significantly higher accuracy but still suffer from high computational overhead and latency. The state-of-the-art stereo 3D detection method achieves twice the accuracy of monocular approaches, yet its inference speed is only half as fast. In this paper, we propose StereoDETR, an efficient stereo 3D object detection framework based on DETR. StereoDETR consists of two branches: a monocular DETR branch and a stereo branch. The DETR branch is built upon 2D DETR with additional channels for predicting object scale, orientation, and sampling points. The stereo branch leverages low-cost multi-scale disparity features to predict object-level depth maps. These two branches are coupled solely through a differentiable depth sampling strategy. To handle occlusion, we introduce a constrained supervision strategy for sampling points without requiring extra annotations. Compared with the existing published monocular and binocular 3D detection methods, StereoDETR breaks the trade-off between speed and accuracy. Through a concise framework, it achieves binocular-level accuracy while maintaining monocular-level inference speed.

Original languageEnglish
JournalIEEE Transactions on Circuits and Systems for Video Technology
DOIs
Publication statusAccepted/In press - 2025

Keywords

  • 3D Object Detection
  • Autonomous Driving
  • Binocular Images
  • Stereo Matching

Cite this