M2DG-TPV: A Matrix-Based Tri-Perspective View 3-D Perception Model With 2-D Gaussian Recovered Voxels

Research output: Contribution to journalArticlepeer-review

Abstract

Perceiving and understanding complex 3D environments are paramount for ensuring the safety and efficiency of intelligent transportation systems. 3D occupancy perception provides a comprehensive representation of surroundings by encoding geometric volumes alongside semantic labels. However, existing approaches often face significant challenges in preserving fine-grained geometric details due to the inherent constraints of camera-based image features. Simultaneously, 3D perception sensors such as LiDAR are limited by the sparsity of the resulting point clouds (voxels). To address these limitations, this paper introduces M2DG-TPV, an innovative model that employs a cross-attention-based module to integrate image and voxel features. It lifts the 2D image feature to a simplified 3D representation through an efficient matrix-based view transformation and enhances the voxel feature in occluded regions via a masked 2D Gaussian recovery method. Experimental evaluations on the nuScenes dataset demonstrate that M2DG-TPV achieves an absolute improvement of 2.0 mIoU over state-of-the-art methods and surpasses existing approaches in 9 out of 16 semantic classes, while utilizing fewer model parameters and preserving finer geometric structures. Additional evaluation on the SemanticKITTI dataset further demonstrates the cross-dataset generalizability.

Original languageEnglish
Pages (from-to)21105-21118
Number of pages14
JournalIEEE Access
Volume14
DOIs
Publication statusPublished - 2026

Keywords

  • attention
  • autonomous driving
  • multi-sensor fusion
  • multi-view camera perception
  • Occupancy prediction

Cite this