Paper Reading AI Learner

COMBO: Compositional World Models for Embodied Multi-Agent Cooperation

2024-04-16 17:59:11
Hongxin Zhang, Zeyuan Wang, Qiushi Lyu, Zheyuan Zhang, Sunli Chen, Tianmin Shu, Yilun Du, Chuang Gan

Abstract

In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only partial egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics conditioned on an arbitrary number of agents' actions given only partial egocentric visual observations of the world. To address this issue of partial observability, we first train generative models to estimate the overall world state given partial egocentric observations. To enable accurate simulation of multiple sets of actions on this world state, we then propose to learn a compositional world model for multi-agent cooperation by factorizing the naturally composable joint actions of multiple agents and compositionally generating the video. By leveraging this compositional world model, in combination with Vision Language Models to infer the actions of other agents, we can use a tree search procedure to integrate these modules and facilitate online cooperative planning. To evaluate the efficacy of our methods, we create two challenging embodied multi-agent long-horizon cooperation tasks using the ThreeDWorld simulator and conduct experiments with 2-4 agents. The results show our compositional world model is effective and the framework enables the embodied agents to cooperate efficiently with different agents across various tasks and an arbitrary number of agents, showing the promising future of our proposed framework. More videos can be found at this https URL.

Abstract (translated)

在本文中,我们研究了 embodied multi-agent 合作问题,在这种设置中,分散的代理必须根据仅有的部分自中心化视图来合作。为了在這種设置中有效地规划,与在单 agent 情景中学习世界动态不同,我们必须根据仅有的部分自中心化视图来模拟世界动态。为了应对这种部分可观测性问题,我们首先训练生成模型来估计给定部分自中心化观测值的总体世界状态。为了确保准确地模拟多个动作在世界状态上,我们 then 提出了一种多代理合作的世界模型,通过分解多个代理的自然可组合的联合动作并递归地生成视频来实现。通过利用这个可组合的世界模型,我们可以结合 Vision Language Models 推断其他代理的行动,从而使用树搜索过程将这些模块整合起来,促进在线合作规划。为了评估我们方法的有效性,我们使用 ThreeDWorld 模拟器创建两个具有挑战性的 embodied multi-agent 长时合作任务,并对 2-4 个代理进行实验。结果表明,我们的可组合的世界模型有效,并为不同任务和任意数量代理的 embodied 合作提供了高效的框架,展示了我们提出框架的潜在前景。更多视频可以在这个链接 https://www.youtube.com/watch?v= 找到。

URL

https://arxiv.org/abs/2404.10775

PDF

https://arxiv.org/pdf/2404.10775.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot