Abstract
We introduce WorkBench: a benchmark dataset for evaluating agents' ability to execute tasks in a workplace setting. WorkBench contains a sandbox environment with five databases, 26 tools, and 690 tasks. These tasks represent common business activities, such as sending emails and scheduling meetings. The tasks in WorkBench are challenging as they require planning, tool selection, and often multiple actions. If a task has been successfully executed, one (or more) of the database values may change. The correct outcome for each task is unique and unambiguous, which allows for robust, automated evaluation. We call this key contribution outcome-centric evaluation. We evaluate five existing ReAct agents on WorkBench, finding they successfully complete as few as 3% of tasks (Llama2-70B), and just 43% for the best-performing (GPT-4). We further find that agents' errors can result in the wrong action being taken, such as an email being sent to the wrong person. WorkBench reveals weaknesses in agents' ability to undertake common business activities, raising questions about their use in high-stakes workplace settings. WorkBench is publicly available as a free resource at this https URL.
Abstract (translated)
我们介绍了一个名为WorkBench的基准数据集,用于评估代理在职场环境中执行任务的能力。WorkBench包含一个沙盒环境,包含五个数据库、26个工具和690个任务。这些任务代表常见的商务活动,如发送电子邮件和安排会议。WorkBench中的任务具有挑战性,因为它们需要计划、工具选择和通常需要多个行动。如果任务成功执行,则其中一个(或多个)数据库值可能会发生变化。每个任务的正确结果都是独特且无歧义的,这使得可以进行稳健且自动评估。我们将这种关键贡献称为成果中心评估。我们在WorkBench上评估了五种现有的ReAct代理,发现它们成功完成了不到3%的任务(Llama2-70B),而最佳表现者(GPT-4)也只完成了43%的任务。我们进一步发现,代理的错误可能导致错误的行动,例如将邮件发送给错误的人。WorkBench揭示了代理在执行常见商务活动方面的不足,引发了关于它们在高风险职场环境中的使用的疑问。WorkBench作为免费资源,现在可以在该链接 https:// URL 上公开使用。
URL
https://arxiv.org/abs/2405.00823