3DBench: A scalable benchmark for object and scene-level instruction-tuning of 3D large language models

Tianci Hu, Junjie Zhang*, Yutao Rao, Dan Zeng, Hongwen Yu, Xiaoshui Huang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Recent assessments of Multi-Modal Large Language Models (MLLMs) have been thorough. However, a detailed benchmark that integrates point cloud data with language for MLLMs remains absent, leading to superficial comparisons that obscure advancements in the nuanced capabilities of such models. Current benchmarks typically feature object-level classification, scene-level captioning, and visual grounding (VG) tasks. These tasks inadequately encapsulate the spatial perception and logical reasoning skills of MLLMs, nor do they permit a just and all-encompassing assessment of MLLMs with varied architectures. To address these gaps, we propose 3DBench, a novel fine-grained benchmark specifically designed for MLLMs. It encompasses ten tasks spanning object and scene levels and organizes these tasks into three evaluative categories: expression, perception, and reasoning. Additionally, we present a scalable approach for constructing 3D instruction-tuning datasets derived from simulation environments, resulting in a dataset with over 239k question–answer pairs covering twelve tasks and their respective point clouds. Using this high-quality dataset, we introduce the Bench-model, which integrates advanced detection models to significantly enhance MLLM performance. We compare Bench-model against open-sourced 3D LLMs, analyzing the impact of different model architectures, training protocols, and public datasets. These experimental outcomes provide crucial perspectives on existing research limitations and suggest potential rooms for future investigation. Codes and datasets are available at https://github.com/Inshsang/3DBench.

Original languageEnglish
Article number107566
JournalNeural Networks
Volume189
DOIs
Publication statusPublished - Sept 2025

Keywords

  • Evaluation metrics
  • Fine-grained tasks
  • Instruction-tuning datasets
  • Multi-modal large language models
  • Point clouds

Cite this