Paper Reading AI Learner

3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset

2024-04-23 02:06:10
Junjie Zhang, Tianci Hu, Xiaoshui Huang, Yongshun Gong, Dan Zeng

Abstract

Evaluating the performance of Multi-modal Large Language Models (MLLMs), integrating both point cloud and language, presents significant challenges. The lack of a comprehensive assessment hampers determining whether these models truly represent advancements, thereby impeding further progress in the field. Current evaluations heavily rely on classification and caption tasks, falling short in providing a thorough assessment of MLLMs. A pressing need exists for a more sophisticated evaluation method capable of thoroughly analyzing the spatial understanding and expressive capabilities of these models. To address these issues, we introduce a scalable 3D benchmark, accompanied by a large-scale instruction-tuning dataset known as 3DBench, providing an extensible platform for a comprehensive evaluation of MLLMs. Specifically, we establish the benchmark that spans a wide range of spatial and semantic scales, from object-level to scene-level, addressing both perception and planning tasks. Furthermore, we present a rigorous pipeline for automatically constructing scalable 3D instruction-tuning datasets, covering 10 diverse multi-modal tasks with more than 0.23 million QA pairs generated in total. Thorough experiments evaluating trending MLLMs, comparisons against existing datasets, and variations of training protocols demonstrate the superiority of 3DBench, offering valuable insights into current limitations and potential research directions.

Abstract (translated)

评估多模态大型语言模型(MLLMs)的表现,将点云和语言集成在一起,带来了显著的挑战。缺乏全面的评估方法使得确定这些模型是否真正代表了进步成为不可能,从而阻碍了该领域进一步的发展。当前的评估主要依赖于分类和摘要任务,而无法提供对MLLMs的深入评估。迫切需要一种更复杂的评估方法,能够深入分析这些模型的空间理解和表现能力。为解决这些问题,我们引入了一个可扩展的3D基准,即3DBench,为全面评估MLLMs提供了一个扩展的平台。具体来说,我们建立了一个覆盖广泛的空间和语义尺度的基准,从物体级别到场景级别,解决了感知和规划任务。此外,我们提供了自动构建可扩展3D指令调整数据集的严格流程,包括10个多样化的多模态任务,总共生成了超过0.23百万个QA对。对热门MLLM的详细实验、与现有数据集的比较以及训练协议的变异,证明了3DBench的优越性,为当前的局限性和研究方向提供了宝贵的见解。

URL

https://arxiv.org/abs/2404.14678

PDF

https://arxiv.org/pdf/2404.14678.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot