Abstract
We present Vision in Action (ViA), an active perception system for bimanual robot manipulation. ViA learns task-relevant active perceptual strategies (e.g., searching, tracking, and focusing) directly from human demonstrations. On the hardware side, ViA employs a simple yet effective 6-DoF robotic neck to enable flexible, human-like head movements. To capture human active perception strategies, we design a VR-based teleoperation interface that creates a shared observation space between the robot and the human operator. To mitigate VR motion sickness caused by latency in the robot's physical movements, the interface uses an intermediate 3D scene representation, enabling real-time view rendering on the operator side while asynchronously updating the scene with the robot's latest observations. Together, these design elements enable the learning of robust visuomotor policies for three complex, multi-stage bimanual manipulation tasks involving visual occlusions, significantly outperforming baseline systems.
Abstract (translated)
我们介绍了一种名为“行动中的视觉”(Vision in Action,简称ViA)的双臂机器人操作主动感知系统。ViA能够直接从人类示范中学习与任务相关的主动感知策略(例如搜索、追踪和聚焦)。在硬件方面,ViA采用了一个简单而有效的六自由度机械颈部装置,使机器人头部运动更灵活、接近于人。为了捕捉人类的主动感知策略,我们设计了一种基于虚拟现实(VR)的远程操作界面,该界面创造了一个共享观察空间,使得机器人与人类操作员之间能够协同工作。 为了避免由于机器人物理动作中的延迟引起的VR运动病,我们的接口采用了一个中间3D场景表示方法,在操作人员这边实现实时视图渲染的同时,异步更新场景以反映机器人的最新观测数据。这些设计元素共同作用,使ViA能够在三个涉及视觉遮挡的复杂多步骤双臂操作任务中学习稳健的视觉-运动策略,显著优于基准系统。
URL
https://arxiv.org/abs/2506.15666