Abstract
Class imbalance and label noise are pervasive in large-scale datasets, yet much of machine learning research assumes well-labeled, balanced data, which rarely reflects real world conditions. Existing approaches typically address either label noise or class imbalance in isolation, leading to suboptimal results when both issues coexist. In this work, we propose Conformal-in-the-Loop (CitL), a novel training framework that addresses both challenges with a conformal prediction-based approach. CitL evaluates sample uncertainty to adjust weights and prune unreliable examples, enhancing model resilience and accuracy with minimal computational cost. Our extensive experiments include a detailed analysis showing how CitL effectively emphasizes impactful data in noisy, imbalanced datasets. Our results show that CitL consistently boosts model performance, achieving up to a 6.1% increase in classification accuracy and a 5.0 mIoU improvement in segmentation. Our code is publicly available: CitL.
Abstract (translated)
类不平衡和标签噪声在大规模数据集中普遍存在,但许多机器学习研究假设的是标注良好、平衡的数据集,这很少能反映现实世界的情况。现有的方法通常单独解决标签噪声或类别不平衡问题,当这两个问题同时存在时会导致次优结果。在这项工作中,我们提出了名为“循环中的合式预测”(Conformal-in-the-Loop, CitL)的新训练框架,该框架基于合式预测的方法来应对这些挑战。CitL通过评估样本不确定性来调整权重并剪枝不可靠的示例,从而以极小的计算成本增强模型的鲁棒性和准确性。我们的广泛实验包括详细分析了CitL如何有效地强调在嘈杂且不平衡的数据集中的关键数据。结果显示,CitL能持续提升模型性能,在分类准确率上提高多达6.1%,在分割方面mIoU提升了5.0。我们的代码已公开:CitL。
URL
https://arxiv.org/abs/2411.02281