Abstract
We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks. Each instance includes an RGB image, a natural language description, and a ground-truth 4X4 transformation matrix representing object pose. We focus on inferring Z-axis distance as a foundational skill, with future extensions targeting full 6 Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to support further research. This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.
Abstract (translated)
我们提出了一种概念框架,用于训练视觉-语言模型(VLM)进行视觉视角转换(VPT),这是一种对具身认知至关重要的核心能力,并且对于人机交互(HRI)至关重要。作为实现这一目标的第一步,我们引入了一个在NVIDIA Omniverse中生成的合成数据集,该数据集支持空间推理任务的监督学习。每个实例包括一个RGB图像、一个自然语言描述以及表示对象姿态的真实值4X4变换矩阵。我们专注于推断Z轴距离作为一个基础技能,并且未来扩展将着眼于全面的6个自由度(DOFs)推理。此数据集公开提供,以支持进一步的研究。这项工作为能够理解空间关系的具身AI系统在互动的人机场景中的发展奠定了基础。
URL
https://arxiv.org/abs/2505.14366