Paper Reading AI Learner

Vision in Action: Learning Active Perception from Human Demonstrations

2025-06-18 17:43:55
Haoyu Xiong, Xiaomeng Xu, Jimmy Wu, Yifan Hou, Jeannette Bohg, Shuran Song

Abstract

We present Vision in Action (ViA), an active perception system for bimanual robot manipulation. ViA learns task-relevant active perceptual strategies (e.g., searching, tracking, and focusing) directly from human demonstrations. On the hardware side, ViA employs a simple yet effective 6-DoF robotic neck to enable flexible, human-like head movements. To capture human active perception strategies, we design a VR-based teleoperation interface that creates a shared observation space between the robot and the human operator. To mitigate VR motion sickness caused by latency in the robot's physical movements, the interface uses an intermediate 3D scene representation, enabling real-time view rendering on the operator side while asynchronously updating the scene with the robot's latest observations. Together, these design elements enable the learning of robust visuomotor policies for three complex, multi-stage bimanual manipulation tasks involving visual occlusions, significantly outperforming baseline systems.

Abstract (translated)

我们介绍了一种名为“行动中的视觉”(Vision in Action,简称ViA)的双臂机器人操作主动感知系统。ViA能够直接从人类示范中学习与任务相关的主动感知策略(例如搜索、追踪和聚焦)。在硬件方面,ViA采用了一个简单而有效的六自由度机械颈部装置,使机器人头部运动更灵活、接近于人。为了捕捉人类的主动感知策略,我们设计了一种基于虚拟现实(VR)的远程操作界面,该界面创造了一个共享观察空间,使得机器人与人类操作员之间能够协同工作。 为了避免由于机器人物理动作中的延迟引起的VR运动病,我们的接口采用了一个中间3D场景表示方法,在操作人员这边实现实时视图渲染的同时,异步更新场景以反映机器人的最新观测数据。这些设计元素共同作用,使ViA能够在三个涉及视觉遮挡的复杂多步骤双臂操作任务中学习稳健的视觉-运动策略,显著优于基准系统。

URL

https://arxiv.org/abs/2506.15666

PDF

https://arxiv.org/pdf/2506.15666.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot