Paper Reading AI Learner

WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting

2024-05-01 19:07:03
Olly Styles, Sam Miller, Patricio Cerda-Mardini, Tanaya Guha, Victor Sanchez, Bertie Vidgen

Abstract

We introduce WorkBench: a benchmark dataset for evaluating agents' ability to execute tasks in a workplace setting. WorkBench contains a sandbox environment with five databases, 26 tools, and 690 tasks. These tasks represent common business activities, such as sending emails and scheduling meetings. The tasks in WorkBench are challenging as they require planning, tool selection, and often multiple actions. If a task has been successfully executed, one (or more) of the database values may change. The correct outcome for each task is unique and unambiguous, which allows for robust, automated evaluation. We call this key contribution outcome-centric evaluation. We evaluate five existing ReAct agents on WorkBench, finding they successfully complete as few as 3% of tasks (Llama2-70B), and just 43% for the best-performing (GPT-4). We further find that agents' errors can result in the wrong action being taken, such as an email being sent to the wrong person. WorkBench reveals weaknesses in agents' ability to undertake common business activities, raising questions about their use in high-stakes workplace settings. WorkBench is publicly available as a free resource at this https URL.

Abstract (translated)

我们介绍了一个名为WorkBench的基准数据集,用于评估代理在职场环境中执行任务的能力。WorkBench包含一个沙盒环境,包含五个数据库、26个工具和690个任务。这些任务代表常见的商务活动,如发送电子邮件和安排会议。WorkBench中的任务具有挑战性,因为它们需要计划、工具选择和通常需要多个行动。如果任务成功执行,则其中一个(或多个)数据库值可能会发生变化。每个任务的正确结果都是独特且无歧义的,这使得可以进行稳健且自动评估。我们将这种关键贡献称为成果中心评估。我们在WorkBench上评估了五种现有的ReAct代理,发现它们成功完成了不到3%的任务(Llama2-70B),而最佳表现者(GPT-4)也只完成了43%的任务。我们进一步发现,代理的错误可能导致错误的行动,例如将邮件发送给错误的人。WorkBench揭示了代理在执行常见商务活动方面的不足,引发了关于它们在高风险职场环境中的使用的疑问。WorkBench作为免费资源,现在可以在该链接 https:// URL 上公开使用。

URL

https://arxiv.org/abs/2405.00823

PDF

https://arxiv.org/pdf/2405.00823.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot