TY - JOUR
T1 - 3DBench
T2 - A scalable benchmark for object and scene-level instruction-tuning of 3D large language models
AU - Hu, Tianci
AU - Zhang, Junjie
AU - Rao, Yutao
AU - Zeng, Dan
AU - Yu, Hongwen
AU - Huang, Xiaoshui
N1 - Publisher Copyright:
© 2025 Elsevier Ltd
PY - 2025/9
Y1 - 2025/9
N2 - Recent assessments of Multi-Modal Large Language Models (MLLMs) have been thorough. However, a detailed benchmark that integrates point cloud data with language for MLLMs remains absent, leading to superficial comparisons that obscure advancements in the nuanced capabilities of such models. Current benchmarks typically feature object-level classification, scene-level captioning, and visual grounding (VG) tasks. These tasks inadequately encapsulate the spatial perception and logical reasoning skills of MLLMs, nor do they permit a just and all-encompassing assessment of MLLMs with varied architectures. To address these gaps, we propose 3DBench, a novel fine-grained benchmark specifically designed for MLLMs. It encompasses ten tasks spanning object and scene levels and organizes these tasks into three evaluative categories: expression, perception, and reasoning. Additionally, we present a scalable approach for constructing 3D instruction-tuning datasets derived from simulation environments, resulting in a dataset with over 239k question–answer pairs covering twelve tasks and their respective point clouds. Using this high-quality dataset, we introduce the Bench-model, which integrates advanced detection models to significantly enhance MLLM performance. We compare Bench-model against open-sourced 3D LLMs, analyzing the impact of different model architectures, training protocols, and public datasets. These experimental outcomes provide crucial perspectives on existing research limitations and suggest potential rooms for future investigation. Codes and datasets are available at https://github.com/Inshsang/3DBench.
AB - Recent assessments of Multi-Modal Large Language Models (MLLMs) have been thorough. However, a detailed benchmark that integrates point cloud data with language for MLLMs remains absent, leading to superficial comparisons that obscure advancements in the nuanced capabilities of such models. Current benchmarks typically feature object-level classification, scene-level captioning, and visual grounding (VG) tasks. These tasks inadequately encapsulate the spatial perception and logical reasoning skills of MLLMs, nor do they permit a just and all-encompassing assessment of MLLMs with varied architectures. To address these gaps, we propose 3DBench, a novel fine-grained benchmark specifically designed for MLLMs. It encompasses ten tasks spanning object and scene levels and organizes these tasks into three evaluative categories: expression, perception, and reasoning. Additionally, we present a scalable approach for constructing 3D instruction-tuning datasets derived from simulation environments, resulting in a dataset with over 239k question–answer pairs covering twelve tasks and their respective point clouds. Using this high-quality dataset, we introduce the Bench-model, which integrates advanced detection models to significantly enhance MLLM performance. We compare Bench-model against open-sourced 3D LLMs, analyzing the impact of different model architectures, training protocols, and public datasets. These experimental outcomes provide crucial perspectives on existing research limitations and suggest potential rooms for future investigation. Codes and datasets are available at https://github.com/Inshsang/3DBench.
KW - Evaluation metrics
KW - Fine-grained tasks
KW - Instruction-tuning datasets
KW - Multi-modal large language models
KW - Point clouds
UR - http://www.scopus.com/inward/record.url?scp=105004871884&partnerID=8YFLogxK
U2 - 10.1016/j.neunet.2025.107566
DO - 10.1016/j.neunet.2025.107566
M3 - Article
AN - SCOPUS:105004871884
SN - 0893-6080
VL - 189
JO - Neural Networks
JF - Neural Networks
M1 - 107566
ER -