Abstract
Large language model based multi-agent systems have demonstrated significant potential in social simulation and complex task resolution domains. However, current frameworks face critical challenges in system architecture design, cross-domain generalizability, and performance guarantees, particularly as task complexity and number of agents increases. We introduces AgentGroupChat-V2, a novel framework addressing these challenges through three core innovations: (1) a divide-and-conquer fully parallel architecture that decomposes user queries into hierarchical task forest structures enabling dependency management and distributed concurrent processing. (2) an adaptive collaboration engine that dynamically selects heterogeneous LLM combinations and interaction modes based on task characteristics. (3) agent organization optimization strategies combining divide-and-conquer approaches for efficient problem decomposition. Extensive experiments demonstrate AgentGroupChat-V2's superior performance across diverse domains, achieving 91.50% accuracy on GSM8K (exceeding the best baseline by 5.6 percentage points), 30.4% accuracy on competition-level AIME (nearly doubling other methods), and 79.20% pass@1 on HumanEval. Performance advantages become increasingly pronounced with higher task difficulty, particularly on Level 5 MATH problems where improvements exceed 11 percentage points compared to state-of-the-art baselines. These results confirm that AgentGroupChat-V2 provides a comprehensive solution for building efficient, general-purpose LLM multi-agent systems with significant advantages in complex reasoning scenarios. Code is available at this https URL.
Abstract (translated)
基于大型语言模型的多智能体系统在社会模拟和复杂任务解决领域展现出了巨大的潜力。然而,当前框架面临着架构设计、跨域泛化能力以及性能保证等方面的挑战,尤其是在任务复杂度增加和代理数量增多时更为显著。我们在此介绍AgentGroupChat-V2,这是一个通过三大创新来应对这些挑战的新型框架: 1. 分治全并行架构:将用户查询分解为层次化的任务森林结构,以管理依赖关系,并实现分布式并发处理。 2. 自适应协作引擎:根据任务特性动态选择异构大型语言模型组合和交互模式。 3. 结合分治方法的问题优化组织策略,以高效地进行问题分解。 广泛的实验表明,AgentGroupChat-V2在多个领域中均表现出色: - 在GSM8K数据集上达到了91.50%的准确性(优于最佳基线模型5.6个百分点)。 - 在竞赛级别的AIME数据集中达到了30.4%的准确率(几乎翻倍于其他方法)。 - 在HumanEval数据集上的通过率为79.20%。 随着任务难度增加,性能优势变得更加明显,在Level 5 MATH问题上相较于最先进的基线模型提高了超过11个百分点。这些结果证实了AgentGroupChat-V2能够为构建高效、通用的大型语言模型多智能体系统提供全面解决方案,并且在复杂的推理场景中具有显著的优势。 源代码可在以下链接获取:[此URL](https://this-url.com)
URL
https://arxiv.org/abs/2506.15451