Abstract
Household robots operate in the same space for years. Such robots incrementally build dynamic maps that can be used for tasks requiring remote object localization. However, benchmarks in robot learning often test generalization through inference on tasks in unobserved environments. In an observed environment, locating an object is reduced to choosing from among all object proposals in the environment, which may number in the 100,000s. Armed with this intuition, using only a generic vision-language scoring model with minor modifications for 3d encoding and operating in an embodied environment, we demonstrate an absolute performance gain of 9.84% on remote object grounding above state of the art models for REVERIE and of 5.04% on FAO. When allowed to pre-explore an environment, we also exceed the previous state of the art pre-exploration method on REVERIE. Additionally, we demonstrate our model on a real-world TurtleBot platform, highlighting the simplicity and usefulness of the approach. Our analysis outlines a "bag of tricks" essential for accomplishing this task, from utilizing 3d coordinates and context, to generalizing vision-language models to large 3d search spaces.
Abstract (translated)
家庭机器人已经在同一片空间中运行了数年。这些机器人逐步构建动态地图,可用于需要远程对象定位的任务。然而,机器人学习基准往往通过推断在未观测环境下的任务泛化。在一个观测环境下,定位一个对象 reduced to从环境内所有物体提议中选择,可能包含数百万个。凭借这一直觉,仅使用一个通用的视觉语言得分模型进行3D编码并在实体环境中运行,我们证明了在REVERIE任务中远程对象基准的 absolute performance gain为9.84%,而在FAO任务中为5.04%。当允许在环境前探索时,我们也超过了之前在REVERIE中先进的探索方法。此外,我们在现实机器人平台上展示了我们的模型,突出了这种方法的简单易用。我们的分析概述了完成此任务所需的“工具包”,包括利用3D坐标和上下文,将视觉语言模型扩展到大型3D搜索空间等。
URL
https://arxiv.org/abs/2301.12614