Paper Reading AI Learner

From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems

2025-05-21 16:01:11
Xiuchao Sui, Daiying Tian, Qi Sun, Ruirui Chen, Dongkyu Choi, Kenneth Kwok, Soujanya Poria

Abstract

Foundation models (FMs) are increasingly used to bridge language and action in embodied agents, yet the operational characteristics of different FM integration strategies remain under-explored -- particularly for complex instruction following and versatile action generation in changing environments. This paper examines three paradigms for building robotic systems: end-to-end vision-language-action (VLA) models that implicitly integrate perception and planning, and modular pipelines incorporating either vision-language models (VLMs) or multimodal large language models (LLMs). We evaluate these paradigms through two focused case studies: a complex instruction grounding task assessing fine-grained instruction understanding and cross-modal disambiguation, and an object manipulation task targeting skill transfer via VLA finetuning. Our experiments in zero-shot and few-shot settings reveal trade-offs in generalization and data efficiency. By exploring performance limits, we distill design implications for developing language-driven physical agents and outline emerging challenges and opportunities for FM-powered robotics in real-world conditions.

Abstract (translated)

基础模型(FMs)在连接语言和行动的具身代理中越来越常用,然而不同基础模型集成策略的操作特性仍然有待深入研究——尤其是在复杂指令执行和多变环境下的灵活动作生成方面。本文探讨了构建机器人系统时采用的三种范式:端到端视觉-语言-行动(VLA)模型,这种模型隐含地整合了感知与规划;以及模块化管道,这些管道结合了视觉-语言模型(VLMs)或跨模态大型语言模型(LLMs)。我们通过两个聚焦案例研究来评估这些范式:一个是复杂指令定位任务,该任务旨在测试细粒度的指令理解和跨模式歧义消除能力;另一个是目标操作任务,其目的是通过VLA微调来进行技能转移。我们在零样本和少量样本设置下的实验揭示了泛化能力和数据效率之间的权衡。通过对性能极限的研究,我们提炼出了为开发以语言驱动的物理代理的设计启示,并概述了基础模型在现实世界条件下赋能机器人技术所面临的新兴挑战与机遇。

URL

https://arxiv.org/abs/2505.15685

PDF

https://arxiv.org/pdf/2505.15685.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot