Abstract
Evaluating the performance of Multi-modal Large Language Models (MLLMs), integrating both point cloud and language, presents significant challenges. The lack of a comprehensive assessment hampers determining whether these models truly represent advancements, thereby impeding further progress in the field. Current evaluations heavily rely on classification and caption tasks, falling short in providing a thorough assessment of MLLMs. A pressing need exists for a more sophisticated evaluation method capable of thoroughly analyzing the spatial understanding and expressive capabilities of these models. To address these issues, we introduce a scalable 3D benchmark, accompanied by a large-scale instruction-tuning dataset known as 3DBench, providing an extensible platform for a comprehensive evaluation of MLLMs. Specifically, we establish the benchmark that spans a wide range of spatial and semantic scales, from object-level to scene-level, addressing both perception and planning tasks. Furthermore, we present a rigorous pipeline for automatically constructing scalable 3D instruction-tuning datasets, covering 10 diverse multi-modal tasks with more than 0.23 million QA pairs generated in total. Thorough experiments evaluating trending MLLMs, comparisons against existing datasets, and variations of training protocols demonstrate the superiority of 3DBench, offering valuable insights into current limitations and potential research directions.
Abstract (translated)
评估多模态大型语言模型(MLLMs)的表现,将点云和语言集成在一起,带来了显著的挑战。缺乏全面的评估方法使得确定这些模型是否真正代表了进步成为不可能,从而阻碍了该领域进一步的发展。当前的评估主要依赖于分类和摘要任务,而无法提供对MLLMs的深入评估。迫切需要一种更复杂的评估方法,能够深入分析这些模型的空间理解和表现能力。为解决这些问题,我们引入了一个可扩展的3D基准,即3DBench,为全面评估MLLMs提供了一个扩展的平台。具体来说,我们建立了一个覆盖广泛的空间和语义尺度的基准,从物体级别到场景级别,解决了感知和规划任务。此外,我们提供了自动构建可扩展3D指令调整数据集的严格流程,包括10个多样化的多模态任务,总共生成了超过0.23百万个QA对。对热门MLLM的详细实验、与现有数据集的比较以及训练协议的变异,证明了3DBench的优越性,为当前的局限性和研究方向提供了宝贵的见解。
URL
https://arxiv.org/abs/2404.14678