Paper Reading AI Learner

Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence

2025-06-18 17:58:17
Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, Kai-Wei Chang

Abstract

AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access. All datasets, codes and websites are publicly available at our project page this https URL.

Abstract (translated)

当前的AI代理大多被隔离运作,它们要么检索和推断互联网上获取的大量数字信息和知识;要么通过具身感知、规划和行动与物理世界互动——但很少同时进行两者。这种分离限制了他们解决需要集成物理和数字智能的任务的能力,例如根据在线食谱烹饪、使用动态地图数据导航或利用网络知识解释现实世界的地标。我们介绍了一种新的AI代理范式:具身Web代理(Embodied Web Agents),该范式流畅地将具身性和大规模的网络推理结合起来。 为了实现这一概念,我们首先开发了具身Web代理任务环境,这是一个统一的模拟平台,紧密集成了真实的三维室内和室外环境与功能性的网页界面。在此基础上,我们构建并发布了具身Web代理基准测试(Embodied Web Agents Benchmark),涵盖了包括烹饪、导航、购物、旅游和地理定位等多样化的任务——所有这些都要求跨物理和数字领域协调推理来进行系统性评估的跨域智能。实验结果显示了最先进的AI系统与人类能力之间的显著性能差距,揭示了具身认知与大规模网络知识访问交汇处面临的挑战与机遇。 我们项目的全部数据集、代码和网站均可在我们的项目网页上公开获取:[提供URL]

URL

https://arxiv.org/abs/2506.15677

PDF

https://arxiv.org/pdf/2506.15677.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot