Abstract
Enabling Large Language Models (LLMs) to interact with 3D environments is challenging. Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. Text-image aligned 2D features from CLIP are then lifted to point clouds, which serve as inputs for LLMs. However, this solution lacks the establishment of 3D point-to-point connections, leading to a deficiency of spatial structure information. Concurrently, the absence of integration and unification between the geometric and semantic representations of the scene culminates in a diminished level of 3D scene understanding. In this paper, we demonstrate the importance of having a unified scene representation and reconstruction framework, which is essential for LLMs in 3D scenes. Specifically, we introduce Uni3DR^2 extracts 3D geometric and semantic aware representation features via the frozen pre-trained 2D foundation models (e.g., CLIP and SAM) and a multi-scale aggregate 3D decoder. Our learned 3D representations not only contribute to the reconstruction process but also provide valuable knowledge for LLMs. Experimental results validate that our Uni3DR^2 yields convincing gains over the baseline on the 3D reconstruction dataset ScanNet (increasing F-Score by +1.8\%). When applied to LLMs, our Uni3DR^2-LLM exhibits superior performance over the baseline on the 3D vision-language understanding dataset ScanQA (increasing BLEU-1 by +4.0\% and +4.2\% on the val set and test set, respectively). Furthermore, it outperforms the state-of-the-art method that uses additional GT point clouds on both ScanQA and 3DMV-VQA.
Abstract (translated)
使大型语言模型(LLMs)与3D环境进行交互具有挑战性。现有的方法从地面真实(GT)几何或由辅助模型重构的3D场景中提取点云。然后将CLIP中的文本图像对齐的2D特征提升到点云中,作为LLMs的输入。然而,这种解决方案缺乏3D点对点连接的建立,导致空间结构信息不足。同时,场景的几何和语义表示之间的同步缺失导致了3D场景理解水平的降低。在本文中,我们证明了在3D场景中实现统一场景表示和重建框架对LLM的重要性。具体来说,我们引入了通过冻解除预训练2D基础模型(如CLIP和SAM)提取3D几何和语义感知表示特征的多尺度聚合3D解码器。我们的学习到的3D表示不仅有助于重建过程,还为LLMs提供了宝贵的知识。实验结果证实,我们的Uni3DR^2在3D重建数据集ScanNet上取得了显著的提高(将F- Score提高+1.8)。当应用于LLM时,我们在3D视觉语言理解数据集ScanQA上的表现超过了基线。此外,它在ScanQA和3DMV-VQA上的表现都超过了最先进的采用额外GT点云的方法。
URL
https://arxiv.org/abs/2404.13044