Improving Referring Image Segmentation using Vision-Aware Text Features

Abstract
Abstract (translated)
URL
PDF

Abstract

Referring image segmentation is a challenging task that involves generating pixel-wise segmentation masks based on natural language descriptions. Existing methods have relied mostly on visual features to generate the segmentation masks while treating text features as supporting components. This over-reliance on visual features can lead to suboptimal results, especially in complex scenarios where text prompts are ambiguous or context-dependent. To overcome these challenges, we present a novel framework VATEX to improve referring image segmentation by enhancing object and context understanding with Vision-Aware Text Feature. Our method involves using CLIP to derive a CLIP Prior that integrates an object-centric visual heatmap with text description, which can be used as the initial query in DETR-based architecture for the segmentation task. Furthermore, by observing that there are multiple ways to describe an instance in an image, we enforce feature similarity between text variations referring to the same visual input by two components: a novel Contextual Multimodal Decoder that turns text embeddings into vision-aware text features, and a Meaning Consistency Constraint to ensure further the coherent and consistent interpretation of language expressions with the context understanding obtained from the image. Our method achieves a significant performance improvement on three benchmark datasets RefCOCO, RefCOCO+ and G-Ref. Code is available at: this https URL\_RIS.

Abstract (translated)

参考图像分割是一个具有挑战性的任务，它涉及根据自然语言描述生成像素级的分割掩码。现有的方法主要依赖视觉特征来生成分割掩码，而将文本特征视为支持组件。这种过度依赖视觉特征的做法可能导致最优结果，尤其是在复杂场景中，文本提示歧义或不依赖于上下文的情况下。为了克服这些挑战，我们提出了一个名为VATEX的新框架，通过增强视觉感知下的物体和上下文理解来改善参考图像分割。我们的方法涉及使用CLIP来获得CLIP优先级，该优先级将物体中心视觉热图与文本描述相结合，可以作为基于DETR架构的分割任务初始查询。此外，通过观察到图像中存在多种描述实例的方式，我们通过两个组件之间的文本变化特征相似性来强制执行视觉上下文理解：一种新颖的上下文多模态解码器，将文本嵌入转换为视觉感知文本特征；另一种意义一致性约束，以确保从图像中获得的上下文理解进一步促进语言表达与上下文理解的连贯性和一致性。我们的方法在RefCOCO、RefCOCO+和G-Ref等三个基准数据集上的性能得到了显著的提高。代码可在此处访问：https://this URL_RIS。

URL

https://arxiv.org/abs/2404.08590

PDF

https://arxiv.org/pdf/2404.08590.pdf

Improving Referring Image Segmentation using Vision-Aware Text Features

Abstract

Abstract (translated)

URL

PDF Copy

PDF