TY - JOUR
T1 - StereoDETR
T2 - Stereo-based Transformer for 3D Object Detection
AU - Mu, Shiyi
AU - Gu, Zichong
AU - Ai, Zhiqi
AU - Liu, Anqi
AU - Gao, Yilin
AU - Xu, Shugong
N1 - Publisher Copyright:
© 2025 IEEE. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Compared to monocular 3D object detection, stereo-based 3D methods offer significantly higher accuracy but still suffer from high computational overhead and latency. The state-of-the-art stereo 3D detection method achieves twice the accuracy of monocular approaches, yet its inference speed is only half as fast. In this paper, we propose StereoDETR, an efficient stereo 3D object detection framework based on DETR. StereoDETR consists of two branches: a monocular DETR branch and a stereo branch. The DETR branch is built upon 2D DETR with additional channels for predicting object scale, orientation, and sampling points. The stereo branch leverages low-cost multi-scale disparity features to predict object-level depth maps. These two branches are coupled solely through a differentiable depth sampling strategy. To handle occlusion, we introduce a constrained supervision strategy for sampling points without requiring extra annotations. Compared with the existing published monocular and binocular 3D detection methods, StereoDETR breaks the trade-off between speed and accuracy. Through a concise framework, it achieves binocular-level accuracy while maintaining monocular-level inference speed.
AB - Compared to monocular 3D object detection, stereo-based 3D methods offer significantly higher accuracy but still suffer from high computational overhead and latency. The state-of-the-art stereo 3D detection method achieves twice the accuracy of monocular approaches, yet its inference speed is only half as fast. In this paper, we propose StereoDETR, an efficient stereo 3D object detection framework based on DETR. StereoDETR consists of two branches: a monocular DETR branch and a stereo branch. The DETR branch is built upon 2D DETR with additional channels for predicting object scale, orientation, and sampling points. The stereo branch leverages low-cost multi-scale disparity features to predict object-level depth maps. These two branches are coupled solely through a differentiable depth sampling strategy. To handle occlusion, we introduce a constrained supervision strategy for sampling points without requiring extra annotations. Compared with the existing published monocular and binocular 3D detection methods, StereoDETR breaks the trade-off between speed and accuracy. Through a concise framework, it achieves binocular-level accuracy while maintaining monocular-level inference speed.
KW - 3D Object Detection
KW - Autonomous Driving
KW - Binocular Images
KW - Stereo Matching
UR - https://www.scopus.com/pages/publications/105023194475
U2 - 10.1109/TCSVT.2025.3636925
DO - 10.1109/TCSVT.2025.3636925
M3 - Article
AN - SCOPUS:105023194475
SN - 1051-8215
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
ER -