Abstract
In an era where single large language models have dominated the landscape of artificial intelligence for years, multi-agent systems arise as new protagonists in conversational task-solving. While previous studies have showcased their potential in reasoning tasks and creative endeavors, an analysis of their limitations concerning the conversational paradigms and the impact of individual agents is missing. It remains unascertained how multi-agent discussions perform across tasks of varying complexity and how the structure of these conversations influences the process. To fill that gap, this work systematically evaluates multi-agent systems across various discussion paradigms, assessing their strengths and weaknesses in both generative tasks and question-answering tasks. Alongside the experiments, I propose a taxonomy of 20 multi-agent research studies from 2022 to 2024, followed by the introduction of a framework for deploying multi-agent LLMs in conversational task-solving. I demonstrate that while multi-agent systems excel in complex reasoning tasks, outperforming a single model by leveraging expert personas, they fail on basic tasks. Concretely, I identify three challenges that arise: 1) While longer discussions enhance reasoning, agents fail to maintain conformity to strict task requirements, which leads to problem drift, making shorter conversations more effective for basic tasks. 2) Prolonged discussions risk alignment collapse, raising new safety concerns for these systems. 3) I showcase discussion monopolization through long generations, posing the problem of fairness in decision-making for tasks like summarization. This work uncovers both the potential and challenges that arise with multi-agent interaction and varying conversational paradigms, providing insights into how future research could improve the efficiency, performance, and safety of multi-agent LLMs.
Abstract (translated)
在一个由单一大型语言模型长期主导的人工智能时代,多代理系统作为新的主角出现在对话任务解决中。尽管之前的研究所展示了它们在推理任务和创造性努力中的潜力,但对于这些系统的对话范式局限性以及单个代理的影响的分析仍然缺失。目前还不清楚多代理讨论如何处理不同复杂度的任务,以及这种对话结构如何影响过程。为了填补这一空白,本研究系统地评估了多代理系统在各种讨论范式下的表现,评估它们在生成任务和问答任务中的优缺点。在此基础上,我提出了一个2022年至2024年间20项多代理研究的分类法,并引入了一个部署多代理LLM(大型语言模型)于对话任务解决的框架。我展示了尽管多代理系统在复杂推理任务中表现出色,通过利用专家角色超越单个模型的表现,但它们却在基本任务上失败了。具体来说,我识别出了三个挑战:1) 虽然更长的讨论能够增强推理能力,但代理无法保持对严格任务要求的一致性,这导致了问题偏离,使得简短对话对于基础任务更加有效。2) 延长的讨论有引发一致性崩溃的风险,为这些系统带来了新的安全顾虑。3) 我展示了通过长时间生成而导致的讨论垄断现象,这给如总结等任务中的公平决策提出了问题。这项工作揭示了多代理互动和不同对话范式带来的潜力与挑战,提供了对未来研究如何提升多代理LLM效率、性能及安全性的重要见解。
URL
https://arxiv.org/abs/2410.22932