Abstract
Humans describe the physical world using natural language to refer to specific 3D locations based on a vast range of properties: visual appearance, semantics, abstract associations, or actionable affordances. In this work we propose Language Embedded Radiance Fields (LERFs), a method for grounding language embeddings from off-the-shelf models like CLIP into NeRF, which enable these types of open-ended language queries in 3D. LERF learns a dense, multi-scale language field inside NeRF by volume rendering CLIP embeddings along training rays, supervising these embeddings across training views to provide multi-view consistency and smooth the underlying language field. After optimization, LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time, which has potential use cases in robotics, understanding vision-language models, and interacting with 3D scenes. LERF enables pixel-aligned, zero-shot queries on the distilled 3D CLIP embeddings without relying on region proposals or masks, supporting long-tail open-vocabulary queries hierarchically across the volume. The project website can be found at this https URL .
Abstract (translated)
人类使用自然语言以指代具体的三维位置,基于广泛的属性:视觉外观、语义、抽象关联或行动赋予的意义。在这项工作中,我们提出了语言嵌入 Radiance fields (LERFs),一种方法,将来自Clip等现有模型的语言嵌入嵌入到NeRF中,以使在三维中执行这些类型的无限语言查询。LERF通过在训练光线路径上体积渲染Clip嵌入来学习NeRF内部的密集、多尺度的语言场,并对训练视图中的嵌入进行监督,以提供多视图一致性和平滑底层语言场。经过优化,LERF可以在实时互动中获取广泛的语言提示中的三维相关性地图,这些地图在三维场景中的潜在用途包括机器人、理解视觉语言模型、并与三维场景交互。LERF无需依赖于区域提议或掩膜,即可在体积级别上支持较长的开放词汇查询。该项目的网站可以在以下httpsURL中找到。
URL
https://arxiv.org/abs/2303.09553