Paper Reading AI Learner

ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation

2026-01-13 08:29:07
Zhenyang Liu, Yongchong Gu, Yikai Wang, Xiangyang Xue, Yanwei Fu

Abstract

Recent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising vision-language-action (VLA) paradigm. However, most existing approaches overlook the importance of active perception: they typically rely on static, wrist-mounted cameras that provide an end-effector-centric viewpoint. As a result, these models are unable to adaptively select optimal viewpoints or resolutions during task execution, which significantly limits their performance in long-horizon tasks and fine-grained manipulation scenarios. To address these limitations, we propose ActiveVLA, a novel vision-language-action framework that empowers robots with active perception capabilities for high-precision, fine-grained manipulation. ActiveVLA adopts a coarse-to-fine paradigm, dividing the process into two stages: (1) Critical region localization. ActiveVLA projects 3D inputs onto multi-view 2D projections, identifies critical 3D regions, and supports dynamic spatial awareness. (2) Active perception optimization. Drawing on the localized critical regions, ActiveVLA uses an active view selection strategy to choose optimal viewpoints. These viewpoints aim to maximize amodal relevance and diversity while minimizing occlusions. Additionally, ActiveVLA applies a 3D zoom-in to improve resolution in key areas. Together, these steps enable finer-grained active perception for precise manipulation. Extensive experiments demonstrate that ActiveVLA achieves precise 3D manipulation and outperforms state-of-the-art baselines on three simulation benchmarks. Moreover, ActiveVLA transfers seamlessly to real-world scenarios, enabling robots to learn high-precision tasks in complex environments.

Abstract (translated)

最近在机器人操作领域取得的进展利用了预训练的视觉语言模型(VLMs),并探索将3D空间信号整合到这些模型中,以实现有效的动作预测。这催生了一个有前景的视觉语言行动(VLA)范式。然而,大多数现有的方法忽视了主动感知的重要性:它们通常依赖于安装在手腕上的静态摄像头,提供的是末端执行器中心视角。因此,这些模型无法在任务执行过程中自适应地选择最优视点或分辨率,这显著限制了它们在长期任务和精细操作场景中的性能表现。 为了解决这些问题,我们提出了ActiveVLA,这是一种新型的视觉语言行动框架,赋予机器人主动感知能力以实现高精度、细粒度的操作。ActiveVLA采用了从粗到精的方法,将过程分为两个阶段: 1. 关键区域定位:ActiveVLA将3D输入投影到多视图2D图像上,并识别关键的3D区域,支持动态空间意识。 2. 主动感知优化:基于定位的关键区域,ActiveVLA采用主动视点选择策略来选取最优视角。这些视角旨在最大化非模态相关性(即不依赖于特定感官模式的信息关联)和多样性,同时最小化遮挡现象。此外,ActiveVLA应用3D变焦功能以提高关键区域的分辨率。 通过上述步骤,ActiveVLA能够实现更精细级别的主动感知,从而支持精确的操作。广泛的实验表明,ActiveVLA实现了精准的3D操作,并在三个模拟基准测试中超越了最新的基线模型。此外,ActiveVLA能够无缝地转移到真实世界场景中,使机器人能够在复杂环境中学习高精度任务。

URL

https://arxiv.org/abs/2601.08325

PDF

https://arxiv.org/pdf/2601.08325.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot