Paper Reading AI Learner

Improving Referring Image Segmentation using Vision-Aware Text Features

2024-04-12 16:38:48
Hai Nguyen-Truong, E-Ro Nguyen, Tuan-Anh Vu, Minh-Triet Tran, Binh-Son Hua, Sai-Kit Yeung

Abstract

Referring image segmentation is a challenging task that involves generating pixel-wise segmentation masks based on natural language descriptions. Existing methods have relied mostly on visual features to generate the segmentation masks while treating text features as supporting components. This over-reliance on visual features can lead to suboptimal results, especially in complex scenarios where text prompts are ambiguous or context-dependent. To overcome these challenges, we present a novel framework VATEX to improve referring image segmentation by enhancing object and context understanding with Vision-Aware Text Feature. Our method involves using CLIP to derive a CLIP Prior that integrates an object-centric visual heatmap with text description, which can be used as the initial query in DETR-based architecture for the segmentation task. Furthermore, by observing that there are multiple ways to describe an instance in an image, we enforce feature similarity between text variations referring to the same visual input by two components: a novel Contextual Multimodal Decoder that turns text embeddings into vision-aware text features, and a Meaning Consistency Constraint to ensure further the coherent and consistent interpretation of language expressions with the context understanding obtained from the image. Our method achieves a significant performance improvement on three benchmark datasets RefCOCO, RefCOCO+ and G-Ref. Code is available at: this https URL\_RIS.

Abstract (translated)

参考图像分割是一个具有挑战性的任务,它涉及根据自然语言描述生成像素级的分割掩码。现有的方法主要依赖视觉特征来生成分割掩码,而将文本特征视为支持组件。这种过度依赖视觉特征的做法可能导致最优结果,尤其是在复杂场景中,文本提示歧义或不依赖于上下文的情况下。为了克服这些挑战,我们提出了一个名为VATEX的新框架,通过增强视觉感知下的物体和上下文理解来改善参考图像分割。我们的方法涉及使用CLIP来获得CLIP优先级,该优先级将物体中心视觉热图与文本描述相结合,可以作为基于DETR架构的分割任务初始查询。此外,通过观察到图像中存在多种描述实例的方式,我们通过两个组件之间的文本变化特征相似性来强制执行视觉上下文理解:一种新颖的上下文多模态解码器,将文本嵌入转换为视觉感知文本特征;另一种意义一致性约束,以确保从图像中获得的上下文理解进一步促进语言表达与上下文理解的连贯性和一致性。我们的方法在RefCOCO、RefCOCO+和G-Ref等三个基准数据集上的性能得到了显著的提高。代码可在此处访问:https://this URL_RIS。

URL

https://arxiv.org/abs/2404.08590

PDF

https://arxiv.org/pdf/2404.08590.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot