Abstract
Perceiving and understanding complex 3D environments are paramount for ensuring the safety and efficiency of intelligent transportation systems. 3D occupancy perception provides a comprehensive representation of surroundings by encoding geometric volumes alongside semantic labels. However, existing approaches often face significant challenges in preserving fine-grained geometric details due to the inherent constraints of camera-based image features. Simultaneously, 3D perception sensors such as LiDAR are limited by the sparsity of the resulting point clouds (voxels). To address these limitations, this paper introduces M2DG-TPV, an innovative model that employs a cross-attention-based module to integrate image and voxel features. It lifts the 2D image feature to a simplified 3D representation through an efficient matrix-based view transformation and enhances the voxel feature in occluded regions via a masked 2D Gaussian recovery method. Experimental evaluations on the nuScenes dataset demonstrate that M2DG-TPV achieves an absolute improvement of 2.0 mIoU over state-of-the-art methods and surpasses existing approaches in 9 out of 16 semantic classes, while utilizing fewer model parameters and preserving finer geometric structures. Additional evaluation on the SemanticKITTI dataset further demonstrates the cross-dataset generalizability.
| Original language | English |
|---|---|
| Pages (from-to) | 21105-21118 |
| Number of pages | 14 |
| Journal | IEEE Access |
| Volume | 14 |
| DOIs | |
| Publication status | Published - 2026 |
Keywords
- attention
- autonomous driving
- multi-sensor fusion
- multi-view camera perception
- Occupancy prediction
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver