A streamlined framework for BEV-based 3D object detection with prior masking

Qinglin Tong; Junjie Zhang; Chenggang Yan; Dan Zeng

doi:10.1016/j.imavis.2024.105229

A streamlined framework for BEV-based 3D object detection with prior masking

Qinglin Tong, Junjie Zhang, Chenggang Yan, Dan Zeng^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

In the field of autonomous driving, perception tasks based on Bird's-Eye-View (BEV) have attracted considerable research attention due to their numerous benefits. Despite recent advancements in performance, efficiency remains a challenge for real-world implementation. In this study, we propose an efficient and effective framework that constructs a spatio-temporal BEV feature from multi-camera inputs and leverages it for 3D object detection. Specifically, the success of our network is primarily attributed to the design of the lifting strategy and a tailored BEV encoder. The lifting strategy is tasked with the conversion of 2D features into 3D representations. In the absence of depth information in the images, we innovatively introduce a prior mask for the BEV feature, which can assess the significance of the feature along the camera ray at a low cost. Moreover, we design a lightweight BEV encoder, which significantly boosts the capacity of this physical-interpretation representation. In the encoder, we investigate the spatial relationships of the BEV feature and retain rich residual information from upstream. To further enhance performance, we establish a 2D object detection auxiliary head to delve into insights offered by 2D object detection and leverage the 4D information to explore the cues within the sequence. Benefiting from all these designs, our network can capture abundant semantic information from 3D scenes and strikes a balanced trade-off between efficiency and performance.

Original language	English
Article number	105229
Journal	Image and Vision Computing
Volume	150
DOIs	https://doi.org/10.1016/j.imavis.2024.105229
Publication status	Published - Oct 2024
Externally published	Yes

Keywords

3D object detection
Autonomous driving
bird's-eye-view (BEV) representation
Multi-camera

Access to Document

10.1016/j.imavis.2024.105229

Cite this

@article{b0e4f30208fe4595a4991b56ab438ef4,

title = "A streamlined framework for BEV-based 3D object detection with prior masking",

abstract = "In the field of autonomous driving, perception tasks based on Bird's-Eye-View (BEV) have attracted considerable research attention due to their numerous benefits. Despite recent advancements in performance, efficiency remains a challenge for real-world implementation. In this study, we propose an efficient and effective framework that constructs a spatio-temporal BEV feature from multi-camera inputs and leverages it for 3D object detection. Specifically, the success of our network is primarily attributed to the design of the lifting strategy and a tailored BEV encoder. The lifting strategy is tasked with the conversion of 2D features into 3D representations. In the absence of depth information in the images, we innovatively introduce a prior mask for the BEV feature, which can assess the significance of the feature along the camera ray at a low cost. Moreover, we design a lightweight BEV encoder, which significantly boosts the capacity of this physical-interpretation representation. In the encoder, we investigate the spatial relationships of the BEV feature and retain rich residual information from upstream. To further enhance performance, we establish a 2D object detection auxiliary head to delve into insights offered by 2D object detection and leverage the 4D information to explore the cues within the sequence. Benefiting from all these designs, our network can capture abundant semantic information from 3D scenes and strikes a balanced trade-off between efficiency and performance.",

keywords = "3D object detection, Autonomous driving, bird's-eye-view (BEV) representation, Multi-camera",

author = "Qinglin Tong and Junjie Zhang and Chenggang Yan and Dan Zeng",

note = "Publisher Copyright: {\textcopyright} 2024",

year = "2024",

month = oct,

doi = "10.1016/j.imavis.2024.105229",

language = "English",

volume = "150",

journal = "Image and Vision Computing",

issn = "0262-8856",

}

TY - JOUR

T1 - A streamlined framework for BEV-based 3D object detection with prior masking

AU - Tong, Qinglin

AU - Zhang, Junjie

AU - Yan, Chenggang

AU - Zeng, Dan

PY - 2024/10

Y1 - 2024/10

N2 - In the field of autonomous driving, perception tasks based on Bird's-Eye-View (BEV) have attracted considerable research attention due to their numerous benefits. Despite recent advancements in performance, efficiency remains a challenge for real-world implementation. In this study, we propose an efficient and effective framework that constructs a spatio-temporal BEV feature from multi-camera inputs and leverages it for 3D object detection. Specifically, the success of our network is primarily attributed to the design of the lifting strategy and a tailored BEV encoder. The lifting strategy is tasked with the conversion of 2D features into 3D representations. In the absence of depth information in the images, we innovatively introduce a prior mask for the BEV feature, which can assess the significance of the feature along the camera ray at a low cost. Moreover, we design a lightweight BEV encoder, which significantly boosts the capacity of this physical-interpretation representation. In the encoder, we investigate the spatial relationships of the BEV feature and retain rich residual information from upstream. To further enhance performance, we establish a 2D object detection auxiliary head to delve into insights offered by 2D object detection and leverage the 4D information to explore the cues within the sequence. Benefiting from all these designs, our network can capture abundant semantic information from 3D scenes and strikes a balanced trade-off between efficiency and performance.

AB - In the field of autonomous driving, perception tasks based on Bird's-Eye-View (BEV) have attracted considerable research attention due to their numerous benefits. Despite recent advancements in performance, efficiency remains a challenge for real-world implementation. In this study, we propose an efficient and effective framework that constructs a spatio-temporal BEV feature from multi-camera inputs and leverages it for 3D object detection. Specifically, the success of our network is primarily attributed to the design of the lifting strategy and a tailored BEV encoder. The lifting strategy is tasked with the conversion of 2D features into 3D representations. In the absence of depth information in the images, we innovatively introduce a prior mask for the BEV feature, which can assess the significance of the feature along the camera ray at a low cost. Moreover, we design a lightweight BEV encoder, which significantly boosts the capacity of this physical-interpretation representation. In the encoder, we investigate the spatial relationships of the BEV feature and retain rich residual information from upstream. To further enhance performance, we establish a 2D object detection auxiliary head to delve into insights offered by 2D object detection and leverage the 4D information to explore the cues within the sequence. Benefiting from all these designs, our network can capture abundant semantic information from 3D scenes and strikes a balanced trade-off between efficiency and performance.

KW - 3D object detection

KW - Autonomous driving

KW - bird's-eye-view (BEV) representation

KW - Multi-camera

UR - http://www.scopus.com/inward/record.url?scp=85202722760&partnerID=8YFLogxK

U2 - 10.1016/j.imavis.2024.105229

DO - 10.1016/j.imavis.2024.105229

M3 - Article

AN - SCOPUS:85202722760

SN - 0262-8856

VL - 150

JO - Image and Vision Computing

JF - Image and Vision Computing

M1 - 105229

ER -

A streamlined framework for BEV-based 3D object detection with prior masking

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this