Paper Reading AI Learner

UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI

2025-12-12 17:51:50
Darvin Yi, Teng Liu, Mattie Terzolo, Lance Hasson, Ayan Sinha, Pablo Mendes, Andrew Rabinovich

Abstract

As large language model (LLM) agents increasingly undertake digital work, reliable frameworks are needed to evaluate their real-world competence, adaptability, and capacity for human collaboration. Existing benchmarks remain largely static, synthetic, or domain-limited, providing limited insight into how agents perform in dynamic, economically meaningful environments. We introduce UpBench, a dynamically evolving benchmark grounded in real jobs drawn from the global Upwork labor marketplace. Each task corresponds to a verified client transaction, anchoring evaluation in genuine work activity and financial outcomes. UpBench employs a rubric-based evaluation framework, in which expert freelancers decompose each job into detailed, verifiable acceptance criteria and assess AI submissions with per-criterion feedback. This structure enables fine-grained analysis of model strengths, weaknesses, and instruction-following fidelity beyond binary pass/fail metrics. Human expertise is integrated throughout the data pipeline (from job curation and rubric construction to evaluation) ensuring fidelity to real professional standards and supporting research on human-AI collaboration. By regularly refreshing tasks to reflect the evolving nature of online work, UpBench provides a scalable, human-centered foundation for evaluating agentic systems in authentic labor-market contexts, offering a path toward a collaborative framework, where AI amplifies human capability through partnership rather than replacement.

Abstract (translated)

随着大型语言模型(LLM)代理越来越多地承担数字工作,可靠框架的建立变得越来越重要,这些框架能够评估其在现实世界中的实际能力、适应性和与人类协作的能力。现有的基准测试主要集中在静态、合成或特定领域的任务上,这限制了我们对代理如何在动态且具有经济意义的环境中表现的理解。我们引入了一个新的基准测试——UpBench,这是一个基于全球Upwork劳动力市场真实工作机会的动态演化基准。 每个任务都与经过验证的客户交易相对应,这意味着评估建立在真实的业务活动和财务成果基础上。UpBench采用了一种基于评分标准的评估框架,在此框架中,专家自由职业者将每项工作细分为详细且可核实的标准,并根据这些标准对AI提交的内容进行逐条反馈评价。这一结构允许对模型的优势、劣势以及指令遵循度进行精细化分析,而不仅仅是通过简单的通过/失败指标来判断。 在整个数据处理流程(从任务筛选和评分体系构建到最终评估)中,人的专业知识都被整合进来,确保了与真实专业标准的一致性,并支持关于人机协作的研究。通过定期更新任务以反映在线工作的演变性质,UpBench提供了一个可扩展、以人为中心的基础框架,用于在真正的劳动力市场环境中评估代理系统的能力。这为建立一种合作框架提供了可能,在这种框架中,AI通过伙伴关系增强人类能力,而非替代人类工作。

URL

https://arxiv.org/abs/2511.12306

PDF

https://arxiv.org/pdf/2511.12306.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot