Abstract
3D visual grounding is a challenging task that often requires direct and dense supervision, notably the semantic label for each object in the scene. In this paper, we instead study the naturally supervised setting that learns from only 3D scene and QA pairs, where prior works underperform. We propose the Language-Regularized Concept Learner (LARC), which uses constraints from language as regularization to significantly improve the accuracy of neuro-symbolic concept learners in the naturally supervised setting. Our approach is based on two core insights: the first is that language constraints (e.g., a word's relation to another) can serve as effective regularization for structured representations in neuro-symbolic models; the second is that we can query large language models to distill such constraints from language properties. We show that LARC improves performance of prior works in naturally supervised 3D visual grounding, and demonstrates a wide range of 3D visual reasoning capabilities-from zero-shot composition, to data efficiency and transferability. Our method represents a promising step towards regularizing structured visual reasoning frameworks with language-based priors, for learning in settings without dense supervision.
Abstract (translated)
3D视觉 groundeding 是一个具有挑战性的任务,通常需要直接和密集的监督,特别是场景中的每个对象的语义标签。在本文中,我们研究了一个自然监督设置,该设置仅从3D场景和QA对中学习,而之前的工作在这些设置上表现不佳。我们提出了 Language-Regularized Concept Learner (LARC),它使用语言约束作为正则化,显著提高了自然监督设置中神经符号学习者的准确性。我们的方法基于两个核心见解:语言约束(例如,一个单词与其他单词的关系)可以作为对结构化表示的有效正则化;第二个是,我们可以向大型语言模型查询,从中提取这样的约束从语言属性中。我们证明了 LARC 能够提高之前在自然监督3D视觉 groundeding 中的工作的性能,并展示了广泛的3D视觉推理能力-从零散的组合到数据效率和可转移性。我们的方法代表了一个有前景的步骤,将语言基于先验的视觉推理框架 regularize,以在缺乏密集监督的学习环境中学习。
URL
https://arxiv.org/abs/2404.19696