Abstract
Despite noise and caption quality having been acknowledged as important factors impacting vision-language contrastive pre-training, in this paper, we show that the full potential of improving the training process by addressing such issues is yet to be realized. Specifically, we firstly study and analyze two issues affecting training: incorrect assignment of negative pairs, and low caption quality and diversity. Then, we devise effective solutions for addressing both problems, which essentially require training with multiple true positive pairs. Finally, we propose training with sigmoid loss to address such a requirement. We show very large gains over the current state-of-the-art for both image recognition ($\sim +6\%$ on average over 11 datasets) and image retrieval ($\sim +19\%$ on Flickr30k and $\sim +15\%$ on MSCOCO).
Abstract (translated)
尽管噪音和字幕质量被认为是影响视觉语言对比预训练的重要因素,但在这篇论文中,我们展示了通过解决这些问题来改进训练过程的全部潜力尚未得到实现。具体来说,我们首先研究并分析了两个影响训练的问题:错误的负对分配和低字幕质量和多样性。然后,我们为解决这两个问题制定了有效的解决方案,这本质上需要进行多组真实正例的训练。最后,我们提出了使用sigmoid损失进行训练来满足这一要求。我们证明了在图像识别(平均每11个数据集提高约6%)和图像检索(Flicker30k上的平均提高约19%,MSCOCO上的平均提高约15%)方面,当前最先进的技术都有非常大的提升。
URL
https://arxiv.org/abs/2405.10286