Abstract
Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent's policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.
Abstract (translated)
自我进化是使基于大型语言模型(LLM)的代理在其预训练后能够持续改进能力的核心研究课题。最近的研究见证了从无强化学习(RL)的方法向基于RL方法的过渡。目前的RL方法要么依赖于密集的外部奖励信号,要么从LLM自身提取内在奖励信号。然而,这些方法偏离了人类智能中观察到的自我进化机制,在这种机制下,个体通过相互讨论和协作来学习和改进。 在这项工作中,我们引入了协同演化的多代理系统(Co-Evolving Multi-Agent Systems, CoMAS),这是一个新颖的框架,它使代理能够在没有外部监督的情况下自主提高性能,通过从相互代理交互中学习。CoMAS 从丰富的讨论动态中生成内在奖励,并采用LLM-as-a-judge机制来制定这些奖励,同时通过RL优化每个代理的策略,从而实现去中心化和可扩展的协同进化。 实验结果表明,CoMAS 在未训练的代理上表现一致优于其他方法,并在大多数评估设置下达到了最先进的性能水平。消融研究证实了基于互动的奖励信号的重要性,并揭示随着代理数量和多样性的增加,具有令人鼓舞的可扩展性潜力。这些发现将 CoMAS 建立为LLM基础代理自我进化的一种新颖且有效的范式。
URL
https://arxiv.org/abs/2510.08529