Abstract
In this work we consider a generalization of the well-known multivehicle routing problem: given a network, a set of agents occupying a subset of its nodes, and a set of tasks, we seek a minimum cost sequence of movements subject to the constraint that each task is visited by some agent at least once. The classical version of this problem assumes a central computational server that observes the entire state of the system perfectly and directs individual agents according to a centralized control scheme. In contrast, we assume that there is no centralized server and that each agent is an individual processor with no a priori knowledge of the underlying network (including task and agent locations). Moreover, our agents possess strictly local communication and sensing capabilities (restricted to a fixed radius around their respective locations), aligning more closely with several real-world multiagent applications. These restrictions introduce many challenges that are overcome through local information sharing and direct coordination between agents. We present a fully distributed, online, and scalable reinforcement learning algorithm for this problem whereby agents self-organize into local clusters and independently apply a multiagent rollout scheme locally to each cluster. We demonstrate empirically via extensive simulations that there exists a critical sensing radius beyond which the distributed rollout algorithm begins to improve over a greedy base policy. This critical sensing radius grows proportionally to the $\log^*$ function of the size of the network, and is, therefore, a small constant for any relevant network. Our decentralized reinforcement learning algorithm achieves approximately a factor of two cost improvement over the base policy for a range of radii bounded from below and above by two and three times the critical sensing radius, respectively.
Abstract (translated)
在本研究中,我们考虑了著名的多辆车路由问题的一般化:给定一个网络、一群占有其节点一部分的Agents和一组任务,我们寻求一个最小成本序列的运动,受限于每个任务至少被某些Agents访问一次的限制。经典版本的这个问题假设有一个中央计算服务器,完美观察整个系统的状态,并按照一个集中控制Scheme对个体Agents进行指导。相比之下,我们假设没有中央服务器,每个Agent都是一个个体处理器,没有预先知道的底层网络(包括任务和Agents的位置)。此外,我们的Agents具有严格的本地通信和感知能力(限制在各自位置的固定半径内),更贴近几个真实的多Agent应用程序。这些限制引入了许多通过本地信息分享和直接协调Agents之间的合作克服了的挑战。我们提出了一个完全分布式、在线和可扩展的强化学习算法,以便Agents自我组织成本地簇,并独立地在本簇内应用多Agent部署计划方案。我们通过广泛的模拟实验经验证明,存在一个关键感知半径,超过该半径分布式部署算法开始优于贪婪基策略。这个关键感知半径 proportional to the $\log^*$ function of the network size,因此,对于任何相关的网络都是一个小型常数。我们的分散强化学习算法实现了大约两因子的成本改进,对于从下到上包括两个和三个关键感知半径的半径范围,分别实现了这一目标。
URL
https://arxiv.org/abs/2305.15596