Abstract
High-quality data annotation is an essential but laborious and costly aspect of developing machine learning-based software. We explore the inherent tradeoff between annotation accuracy and cost by detecting and removing minority reports -- instances where annotators provide incorrect responses -- that indicate unnecessary redundancy in task assignments. We propose an approach to prune potentially redundant annotation task assignments before they are executed by estimating the likelihood of an annotator disagreeing with the majority vote for a given task. Our approach is informed by an empirical analysis over computer vision datasets annotated by a professional data annotation platform, which reveals that the likelihood of a minority report event is dependent primarily on image ambiguity, worker variability, and worker fatigue. Simulations over these datasets show that we can reduce the number of annotations required by over 60% with a small compromise in label quality, saving approximately 6.6 days-equivalent of labor. Our approach provides annotation service platforms with a method to balance cost and dataset quality. Machine learning practitioners can tailor annotation accuracy levels according to specific application needs, thereby optimizing budget allocation while maintaining the data quality necessary for critical settings like autonomous driving technology.
Abstract (translated)
高质量的数据标注是基于机器学习的软件开发中的一个关键但又费时且昂贵的部分。我们通过检测和移除少数报告(即注释者给出错误答案的情况)来探索注释准确性和成本之间的内在权衡,这些情况表明任务分配中存在不必要的冗余。 我们提出了一种方法,在执行之前估计出可能多余的标注任务分配的可能性,并据此修剪这些任务分配。该方法通过评估注解员与多数票意见不一致的概率来进行决策。我们的方法基于对由专业数据标注平台标记的计算机视觉数据集进行的经验分析,该分析揭示了少数报告事件发生的可能性主要依赖于图像模糊度、工人差异性和工人疲劳程度。 在这些数据集上进行的模拟表明,我们可以在标签质量仅略有妥协的情况下减少所需的注释数量超过60%,节省约6.6个人工天的工作量。我们的方法为标注服务平台提供了一种平衡成本和数据集质量的方法。机器学习从业者可以根据特定应用需求调整标注准确度水平,在保持关键领域(如自动驾驶技术)所需的数据质量的同时,优化预算分配。 通过这种方法,企业和服务提供商能够更有效地管理资源,并提高用于训练复杂机器学习模型的数据质量与效率。
URL
https://arxiv.org/abs/2504.09341