Paper Reading AI Learner

Minority Reports: Balancing Cost and Quality in Ground Truth Data Annotation

2025-04-12 21:04:56
Hsuan Wei Liao, Christopher Klugmann, Daniel Kondermann, Rafid Mahmood

Abstract

High-quality data annotation is an essential but laborious and costly aspect of developing machine learning-based software. We explore the inherent tradeoff between annotation accuracy and cost by detecting and removing minority reports -- instances where annotators provide incorrect responses -- that indicate unnecessary redundancy in task assignments. We propose an approach to prune potentially redundant annotation task assignments before they are executed by estimating the likelihood of an annotator disagreeing with the majority vote for a given task. Our approach is informed by an empirical analysis over computer vision datasets annotated by a professional data annotation platform, which reveals that the likelihood of a minority report event is dependent primarily on image ambiguity, worker variability, and worker fatigue. Simulations over these datasets show that we can reduce the number of annotations required by over 60% with a small compromise in label quality, saving approximately 6.6 days-equivalent of labor. Our approach provides annotation service platforms with a method to balance cost and dataset quality. Machine learning practitioners can tailor annotation accuracy levels according to specific application needs, thereby optimizing budget allocation while maintaining the data quality necessary for critical settings like autonomous driving technology.

Abstract (translated)

高质量的数据标注是基于机器学习的软件开发中的一个关键但又费时且昂贵的部分。我们通过检测和移除少数报告(即注释者给出错误答案的情况)来探索注释准确性和成本之间的内在权衡,这些情况表明任务分配中存在不必要的冗余。 我们提出了一种方法,在执行之前估计出可能多余的标注任务分配的可能性,并据此修剪这些任务分配。该方法通过评估注解员与多数票意见不一致的概率来进行决策。我们的方法基于对由专业数据标注平台标记的计算机视觉数据集进行的经验分析,该分析揭示了少数报告事件发生的可能性主要依赖于图像模糊度、工人差异性和工人疲劳程度。 在这些数据集上进行的模拟表明,我们可以在标签质量仅略有妥协的情况下减少所需的注释数量超过60%,节省约6.6个人工天的工作量。我们的方法为标注服务平台提供了一种平衡成本和数据集质量的方法。机器学习从业者可以根据特定应用需求调整标注准确度水平,在保持关键领域(如自动驾驶技术)所需的数据质量的同时,优化预算分配。 通过这种方法,企业和服务提供商能够更有效地管理资源,并提高用于训练复杂机器学习模型的数据质量与效率。

URL

https://arxiv.org/abs/2504.09341

PDF

https://arxiv.org/pdf/2504.09341.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot