Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality

Abstract
Abstract (translated)
URL
PDF

Abstract

Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning, leading to state-of-the-art models for various downstream multimodal tasks. However, recent research has highlighted severe limitations of these models in their ability to perform compositional reasoning over objects, attributes, and relations. Scene graphs have emerged as an effective way to understand images compositionally. These are graph-structured semantic representations of images that contain objects, their attributes, and relations with other objects in a scene. In this work, we consider the scene graph parsed from text as a proxy for the image scene graph and propose a graph decomposition and augmentation framework along with a coarse-to-fine contrastive learning objective between images and text that aligns sentences of various complexities to the same image. Along with this, we propose novel negative mining techniques in the scene graph space for improving attribute binding and relation understanding. Through extensive experiments, we demonstrate the effectiveness of our approach that significantly improves attribute binding, relation understanding, systematic generalization, and productivity on multiple recently proposed benchmarks (For example, improvements upto $18\%$ for systematic generalization, $16.5\%$ for relation understanding over a strong baseline), while achieving similar or better performance than CLIP on various general multimodal tasks.

Abstract (translated)

Contrastively trained vision-language models在视觉和语言表示学习方面取得了显著的进展，导致了许多后续多任务 multimodal 模型的先进技术。然而，最近的研究突出了这些模型在对象、属性和关系方面的 composition 推理能力方面的严重限制。场景图成为了一种有效的理解图像 composition 的方法。这些是图像中的对象的 graph 结构语义表示，包含了它们在场景中出现的其他对象、它们的属性以及与其他对象之间的关系。在这项工作中，我们将场景图从文本中解析为图像场景图的代理，并提出了 graph 分解和增强框架，以及图像和文本之间的粗到细的对比度学习目标，将复杂的句子对齐到相同的图像上。此外，我们还提出了场景图空间中的 novel 负挖掘技术，以改善属性绑定和关系理解。通过广泛的实验，我们证明了我们的方法的有效性，它在多个最近提出的基准任务上显著改善属性绑定、关系理解和系统性泛化性能(例如，系统性泛化性能高达 $18\%$，关系理解性能超过强基线 $16.5\%$)，同时在与 CLIP 在各种通用多任务任务方面的表现类似或更好的情况下，实现了其他一般多任务任务的性能。

URL

https://arxiv.org/abs/2305.13812

PDF

https://arxiv.org/pdf/2305.13812.pdf