Abstract
Customer Relationship Management (CRM) systems are vital for modern enterprises, providing a foundation for managing customer interactions and data. Integrating AI agents into CRM systems can automate routine processes and enhance personalized service. However, deploying and evaluating these agents is challenging due to the lack of realistic benchmarks that reflect the complexity of real-world CRM tasks. To address this issue, we introduce CRMArena, a novel benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments. Following guidance from CRM experts and industry best practices, we designed CRMArena with nine customer service tasks distributed across three personas: service agent, analyst, and manager. The benchmark includes 16 commonly used industrial objects (e.g., account, order, knowledge article, case) with high interconnectivity, along with latent variables (e.g., complaint habits, policy violations) to simulate realistic data distributions. Experimental results reveal that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities. Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments. CRMArena is an open challenge to the community: systems that can reliably complete tasks showcase direct business value in a popular work environment.
Abstract (translated)
客户关系管理(CRM)系统对于现代企业至关重要,为管理和处理客户互动和数据提供了基础。将AI代理集成到CRM系统中可以自动化常规流程并增强个性化服务。然而,由于缺乏能够反映现实世界CRM任务复杂性的实际基准,部署和评估这些代理变得具有挑战性。为了应对这一问题,我们推出了CRMArena,这是一个旨在基于专业工作环境的实际情况评估AI代理的新基准。根据CRM专家和行业最佳实践的指导,我们将CRMArena设计为包含三大角色(服务代理、分析师和经理)中的九项客户服务任务。该基准涵盖了16个高度相互关联的常用工业对象(如账户、订单、知识文章、案例),以及一些潜在变量(如投诉习惯、政策违规行为)来模拟现实的数据分布情况。实验结果显示,即使使用了功能调用能力,在ReAct提示下的最先进的LLM代理仅能在不到40%的任务中成功完成任务,甚至在具备功能调用能力的情况下成功率也低于55%。我们的研究结果突显了为了实现在真实工作环境中部署的需要增强AI代理的功能调用和规则遵循能力的需求。CRMArena向社区发出了一项公开挑战:能够可靠地完成任务的系统直接展示出在其广泛使用的工作环境中的商业价值。
URL
https://arxiv.org/abs/2411.02305