COMBO: Compositional World Models for Embodied Multi-Agent Cooperation

Abstract
Abstract (translated)
URL
PDF

Abstract

In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only partial egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics conditioned on an arbitrary number of agents' actions given only partial egocentric visual observations of the world. To address this issue of partial observability, we first train generative models to estimate the overall world state given partial egocentric observations. To enable accurate simulation of multiple sets of actions on this world state, we then propose to learn a compositional world model for multi-agent cooperation by factorizing the naturally composable joint actions of multiple agents and compositionally generating the video. By leveraging this compositional world model, in combination with Vision Language Models to infer the actions of other agents, we can use a tree search procedure to integrate these modules and facilitate online cooperative planning. To evaluate the efficacy of our methods, we create two challenging embodied multi-agent long-horizon cooperation tasks using the ThreeDWorld simulator and conduct experiments with 2-4 agents. The results show our compositional world model is effective and the framework enables the embodied agents to cooperate efficiently with different agents across various tasks and an arbitrary number of agents, showing the promising future of our proposed framework. More videos can be found at this https URL.

Abstract (translated)

在本文中，我们研究了 embodied multi-agent 合作问题，在这种设置中，分散的代理必须根据仅有的部分自中心化视图来合作。为了在這種设置中有效地规划，与在单 agent 情景中学习世界动态不同，我们必须根据仅有的部分自中心化视图来模拟世界动态。为了应对这种部分可观测性问题，我们首先训练生成模型来估计给定部分自中心化观测值的总体世界状态。为了确保准确地模拟多个动作在世界状态上，我们 then 提出了一种多代理合作的世界模型，通过分解多个代理的自然可组合的联合动作并递归地生成视频来实现。通过利用这个可组合的世界模型，我们可以结合 Vision Language Models 推断其他代理的行动，从而使用树搜索过程将这些模块整合起来，促进在线合作规划。为了评估我们方法的有效性，我们使用 ThreeDWorld 模拟器创建两个具有挑战性的 embodied multi-agent 长时合作任务，并对 2-4 个代理进行实验。结果表明，我们的可组合的世界模型有效，并为不同任务和任意数量代理的 embodied 合作提供了高效的框架，展示了我们提出框架的潜在前景。更多视频可以在这个链接 https://www.youtube.com/watch?v= 找到。

URL

https://arxiv.org/abs/2404.10775

PDF

https://arxiv.org/pdf/2404.10775.pdf

COMBO: Compositional World Models for Embodied Multi-Agent Cooperation

Abstract

Abstract (translated)

URL

PDF Copy

PDF