Abstract
As large language model (LLM) agents increasingly undertake digital work, reliable frameworks are needed to evaluate their real-world competence, adaptability, and capacity for human collaboration. Existing benchmarks remain largely static, synthetic, or domain-limited, providing limited insight into how agents perform in dynamic, economically meaningful environments. We introduce UpBench, a dynamically evolving benchmark grounded in real jobs drawn from the global Upwork labor marketplace. Each task corresponds to a verified client transaction, anchoring evaluation in genuine work activity and financial outcomes. UpBench employs a rubric-based evaluation framework, in which expert freelancers decompose each job into detailed, verifiable acceptance criteria and assess AI submissions with per-criterion feedback. This structure enables fine-grained analysis of model strengths, weaknesses, and instruction-following fidelity beyond binary pass/fail metrics. Human expertise is integrated throughout the data pipeline (from job curation and rubric construction to evaluation) ensuring fidelity to real professional standards and supporting research on human-AI collaboration. By regularly refreshing tasks to reflect the evolving nature of online work, UpBench provides a scalable, human-centered foundation for evaluating agentic systems in authentic labor-market contexts, offering a path toward a collaborative framework, where AI amplifies human capability through partnership rather than replacement.
Abstract (translated)
随着大型语言模型(LLM)代理越来越多地承担数字工作,可靠框架的建立变得越来越重要,这些框架能够评估其在现实世界中的实际能力、适应性和与人类协作的能力。现有的基准测试主要集中在静态、合成或特定领域的任务上,这限制了我们对代理如何在动态且具有经济意义的环境中表现的理解。我们引入了一个新的基准测试——UpBench,这是一个基于全球Upwork劳动力市场真实工作机会的动态演化基准。 每个任务都与经过验证的客户交易相对应,这意味着评估建立在真实的业务活动和财务成果基础上。UpBench采用了一种基于评分标准的评估框架,在此框架中,专家自由职业者将每项工作细分为详细且可核实的标准,并根据这些标准对AI提交的内容进行逐条反馈评价。这一结构允许对模型的优势、劣势以及指令遵循度进行精细化分析,而不仅仅是通过简单的通过/失败指标来判断。 在整个数据处理流程(从任务筛选和评分体系构建到最终评估)中,人的专业知识都被整合进来,确保了与真实专业标准的一致性,并支持关于人机协作的研究。通过定期更新任务以反映在线工作的演变性质,UpBench提供了一个可扩展、以人为中心的基础框架,用于在真正的劳动力市场环境中评估代理系统的能力。这为建立一种合作框架提供了可能,在这种框架中,AI通过伙伴关系增强人类能力,而非替代人类工作。
URL
https://arxiv.org/abs/2511.12306