Paper Reading AI Learner

OpenSeg-R: Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning

2025-05-22 17:51:48
Zongyan Han, Jiale Cao, Shuo Chen, Tong Wang, Jorma Laaksonen, Rao Muhammad Anwer

Abstract

Open-Vocabulary Segmentation (OVS) has drawn increasing attention for its capacity to generalize segmentation beyond predefined categories. However, existing methods typically predict segmentation masks with simple forward inference, lacking explicit reasoning and interpretability. This makes it challenging for OVS model to distinguish similar categories in open-world settings due to the lack of contextual understanding and discriminative visual cues. To address this limitation, we propose a step-by-step visual reasoning framework for open-vocabulary segmentation, named OpenSeg-R. The proposed OpenSeg-R leverages Large Multimodal Models (LMMs) to perform hierarchical visual reasoning before segmentation. Specifically, we generate both generic and image-specific reasoning for each image, forming structured triplets that explain the visual reason for objects in a coarse-to-fine manner. Based on these reasoning steps, we can compose detailed description prompts, and feed them to the segmentor to produce more accurate segmentation masks. To the best of our knowledge, OpenSeg-R is the first framework to introduce explicit step-by-step visual reasoning into OVS. Experimental results demonstrate that OpenSeg-R significantly outperforms state-of-the-art methods on open-vocabulary semantic segmentation across five benchmark datasets. Moreover, it achieves consistent gains across all metrics on open-vocabulary panoptic segmentation. Qualitative results further highlight the effectiveness of our reasoning-guided framework in improving both segmentation precision and interpretability. Our code is publicly available at this https URL.

Abstract (translated)

开放词汇分割(OVS)因其能够将分割任务推广到预定义类别之外的能力而越来越受到关注。然而,现有的方法通常通过简单的前向推理来预测分割掩码,缺乏明确的推理过程和可解释性。这使得OVS模型在开放式环境中难以区分相似的类别,因为缺少上下文理解和具有判别性的视觉线索。 为了解决这一局限性,我们提出了一种逐步视觉推理框架,用于开放词汇分割,并将其命名为OpenSeg-R。所提出的OpenSeg-R利用大型多模态模型(LMMs)来进行分层视觉推理,在进行分割之前完成该步骤。具体而言,针对每张图像生成通用和特定于图像的推理内容,形成结构化三元组,以从粗到细的方式解释对象的视觉原因。基于这些推理步骤,我们可以合成详细的描述提示,并将其输入到分割器中,从而产生更准确的分割掩码。 据我们所知,OpenSeg-R是第一个将明确的逐步视觉推理引入OVS框架的方法。实验结果表明,在五个基准数据集上的开放词汇语义分割任务上,OpenSeg-R显著优于当前最佳方法。此外,在开放词汇全景分割的所有指标中也取得了持续性的改进。定性结果显示了我们推理引导框架在提高分割精度和可解释性方面的有效性。 我们的代码可在以下链接获取:[提供URL的地方](请替换为实际的代码公开链接)。

URL

https://arxiv.org/abs/2505.16974

PDF

https://arxiv.org/pdf/2505.16974.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot