Abstract
Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset. These results have been based on pruning commonly used image-caption datasets collected from the web -- datasets that are known to harbor harmful social biases that may then be codified in trained models. In this work, we evaluate how deduplication affects the prevalence of these biases in the resulting trained models and introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe. When examining CLIP-style models trained on deduplicated variants of LAION-400M, we find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over SemDeDup on the FairFace and FACET datasets while maintaining zero-shot performance on CLIP benchmarks.
Abstract (translated)
近年来,基于内容的图像-文本预训练模型 deduplication 技术已经证明了,在训练过程中删除常见于网络的图像-文本数据集可以显著降低训练 Visual-Language Pre-trained (VLP) 模型的成本,同时不显著影响性能。这些结果基于从互联网上收集的常见图像-文本数据集进行裁剪——这些数据集已知可能包含有害的社交偏见,而这些偏见可能会在训练模型时进行编码。在这项工作中,我们评估了 deduplication 对训练模型的影响以及最近采用 SemDeDup 算法进行修改以减少我们观察到的负面效应的容易实现方法。当研究基于 deduplicated 变体的 CLIP 模型在 FairFace 和 FACET 数据集上进行训练时,我们发现,我们提出的公平 DeDup 算法在 FairDeDup 数据集上的性能始终优于 SemDeDup,同时保持在 CLIP 基准测试上的零散性能。
URL
https://arxiv.org/abs/2404.16123