The objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video. This is a much studied area with numerous careful,and sometimes complex, approaches and training schemes including: self-supervised learning, learning from synthetic datasets, object-centric representations, amodal representations, and many more. Our interest in this paper is to determine if the Segment Anything model (SAM) can contribute to this task. We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects. In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt. These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks. We also extend these frame-level segmentations to sequence-level segmentations that maintain object identity. Again, this simple model outperforms previous methods on multiple video object segmentation benchmarks.
本论文的目标是运动分割,即在视频中发现和分割运动物体。这是一个已经研究广泛的领域,包括许多仔细研究过的方法,有时很复杂,包括自监督学习、从合成数据中学习、以物体为中心表示、以模式为基础表示等等。本文的兴趣在于确定Segment Anything模型(SAM)是否能为这项任务做出贡献。我们研究了两个将SAM与光学流结合的模型,利用SAM的分割能力与流发现和分组移动物体的能力。在第一个模型中,我们将SAM适应为以光学流为输入。在第二个模型中,SAM以RGB为输入,并使用流作为分割提示。这些简单的方法,没有任何进一步的修改,在单物体和多物体基准测试中显著优于所有先前的方法。我们还将这些帧级分割扩展到序列级分割,保持物体身份。再次,这个简单模型在多个视频物体分割基准测试中优于先前的方法。
https://arxiv.org/abs/2404.12389
Open-world entity segmentation, as an emerging computer vision task, aims at segmenting entities in images without being restricted by pre-defined classes, offering impressive generalization capabilities on unseen images and concepts. Despite its promise, existing entity segmentation methods like Segment Anything Model (SAM) rely heavily on costly expert annotators. This work presents Self-supervised Open-world Hierarchical Entity Segmentation (SOHES), a novel approach that eliminates the need for human annotations. SOHES operates in three phases: self-exploration, self-instruction, and self-correction. Given a pre-trained self-supervised representation, we produce abundant high-quality pseudo-labels through visual feature clustering. Then, we train a segmentation model on the pseudo-labels, and rectify the noises in pseudo-labels via a teacher-student mutual-learning procedure. Beyond segmenting entities, SOHES also captures their constituent parts, providing a hierarchical understanding of visual entities. Using raw images as the sole training data, our method achieves unprecedented performance in self-supervised open-world segmentation, marking a significant milestone towards high-quality open-world entity segmentation in the absence of human-annotated masks. Project page: this https URL.
开放世界实体分割(Open-world entity segmentation)是一个新兴的计算机视觉任务,旨在通过不限制预定义类别的图像,实现对未见过的图像和概念的令人印象深刻的泛化能力。尽管其前景广阔,但现有的实体分割方法,如Segment Anything Model(SAM),仍然依赖于昂贵的专家注释。这项工作提出了自监督开放世界层次实体分割(SOHES)这一新方法,消除了人类注释的需要。SOHES分为三个阶段:自探索、自指导和学习。在一个预训练的自监督表示的基础上,我们通过视觉特征聚类生成丰富的高质量伪标签。然后,我们在伪标签上训练一个分割模型,并通过师生互学方法修复伪标签中的噪声。 除了对实体进行分割外,SOHES还捕获了它们的组成部分,提供了一个视觉实体的层次理解。使用原始图像作为唯一训练数据,我们的方法在自监督开放世界分割方面取得了史无前例的性能,这标志着在无需人类注释掩码的情况下实现高质量开放世界实体分割的里程碑。项目页面:此https URL。
https://arxiv.org/abs/2404.12386
With the emergence of large-scale models trained on diverse datasets, in-context learning has emerged as a promising paradigm for multitasking, notably in natural language processing and image processing. However, its application in 3D point cloud tasks remains largely unexplored. In this work, we introduce Point-In-Context (PIC), a novel framework for 3D point cloud understanding via in-context learning. We address the technical challenge of effectively extending masked point modeling to 3D point clouds by introducing a Joint Sampling module and proposing a vanilla version of PIC called Point-In-Context-Generalist (PIC-G). PIC-G is designed as a generalist model for various 3D point cloud tasks, with inputs and outputs modeled as coordinates. In this paradigm, the challenging segmentation task is achieved by assigning label points with XYZ coordinates for each category; the final prediction is then chosen based on the label point closest to the predictions. To break the limitation by the fixed label-coordinate assignment, which has poor generalization upon novel classes, we propose two novel training strategies, In-Context Labeling and In-Context Enhancing, forming an extended version of PIC named Point-In-Context-Segmenter (PIC-S), targeting improving dynamic context labeling and model training. By utilizing dynamic in-context labels and extra in-context pairs, PIC-S achieves enhanced performance and generalization capability in and across part segmentation datasets. PIC is a general framework so that other tasks or datasets can be seamlessly introduced into our PIC through a unified data format. We conduct extensive experiments to validate the versatility and adaptability of our proposed methods in handling a wide range of tasks and segmenting multi-datasets. Our PIC-S is capable of generalizing unseen datasets and performing novel part segmentation by customizing prompts.
随着大型模型在多样数据集上的出现,上下文学习已经成为多任务处理的有前景的范式,尤其是在自然语言处理和图像处理领域。然而,在3D点云任务中,它的应用仍然有很大一部分没有被探索。在这项工作中,我们引入了Point-In-Context (PIC),一种通过上下文学习对3D点云进行理解的全新框架。我们通过引入联合采样模块并提出了一个名为Point-In-Context-Generalist (PIC-G) 的普通版本解决了在3D点云上有效扩展遮罩点模型的技术挑战。PIC-G被设计成一个通用的3D点云任务模型,其输入和输出建模为坐标。在这种范式下,通过为每个类别分配标记点的XYZ坐标,实现了挑战性的分割任务;然后根据预测中距离最近的标签点进行最终选择。为了克服固定标签坐标分配的局限性,即在新类上表现不佳,我们提出了两种新的训练策略,称为In-Context Labeling和In-Context Enhancing,作为PIC的扩展版本,针对提高动态上下文标注和模型训练。通过利用动态上下文标签和额外的上下文对,PIC-S在各种3D点云任务上实现了增强的性能和泛化能力。PIC是一个通用框架,以便将其他任务或数据集轻松地引入到我们的PIC中,实现统一的数据格式。我们进行了广泛的实验来验证我们所提出的方法的多样性和适应性在处理各种任务和分割多数据集方面。我们的PIC-S能够通过自定义提示进行个性化扩展,实现新的部分分割。
https://arxiv.org/abs/2404.12352
This paper introduces a new technique to measure the feature dependency of neural network models. The motivation is to better understand a model by querying whether it is using information from human-understandable features, e.g., anatomical shape, volume, or image texture. Our method is based on the principle that if a model is dependent on a feature, then removal of that feature should significantly harm its performance. A targeted feature is "removed" by collapsing the dimension in the data distribution that corresponds to that feature. We perform this by moving data points along the feature dimension to a baseline feature value while staying on the data manifold, as estimated by a deep generative model. Then we observe how the model's performance changes on the modified test data set, with the target feature dimension removed. We test our method on deep neural network models trained on synthetic image data with known ground truth, an Alzheimer's disease prediction task using MRI and hippocampus segmentations from the OASIS-3 dataset, and a cell nuclei classification task using the Lizard dataset.
本文提出了一种新的测量神经网络模型特征依赖性的技术。动机是更好地理解模型,通过查询它是否使用人类可理解特征(例如解剖形状、体积或图像纹理)来查询。我们的方法基于一个原理,即如果一个模型依赖于一个特征,那么移除该特征应显著损害其性能。我们通过在数据分布中沿着特征维度移动数据点,同时保持在数据流上,作为由深度生成模型估计的数据范式的“移除”来执行此操作。然后,我们在包含目标特征的修改测试数据集上观察模型性能的变化。我们用已知真实标准的深度神经网络模型进行了测试,该模型基于OASIS-3数据集的MRI和 hippocampus 分割预测阿尔茨海默病,以及使用Lizard数据集的细胞核分类任务。
https://arxiv.org/abs/2404.12341
Resource-constrained hardware, such as edge devices or cell phones, often rely on cloud servers to provide the required computational resources for inference in deep vision models. However, transferring image and video data from an edge or mobile device to a cloud server requires coding to deal with network constraints. The use of standardized codecs, such as JPEG or H.264, is prevalent and required to ensure interoperability. This paper aims to examine the implications of employing standardized codecs within deep vision pipelines. We find that using JPEG and H.264 coding significantly deteriorates the accuracy across a broad range of vision tasks and models. For instance, strong compression rates reduce semantic segmentation accuracy by more than 80% in mIoU. In contrast to previous findings, our analysis extends beyond image and action classification to localization and dense prediction tasks, thus providing a more comprehensive perspective.
资源受限的硬件设备(如边缘设备或智能手机)通常依赖于云计算服务器为深度视觉模型的推理提供所需的计算资源。然而,将边缘或移动设备中的图像和视频数据传输到云计算服务器需要处理网络限制。使用标准的编解码器(如JPEG或H.264)较为普遍,以确保互操作性。本文旨在探讨在深度视觉管道中使用标准化编解码器的影响。我们发现,使用JPEG和H.264编解码会显著降低各种视觉任务和模型的准确性。例如,强大的压缩率将掩码IoU的准确性降低超过80%。与之前的研究相比,我们的分析超越了图像和动作分类,扩展到局部定位和密集预测任务,从而提供了更全面的视角。
https://arxiv.org/abs/2404.12330
The Segment Anything Model (SAM) is a deep neural network foundational model designed to perform instance segmentation which has gained significant popularity given its zero-shot segmentation ability. SAM operates by generating masks based on various input prompts such as text, bounding boxes, points, or masks, introducing a novel methodology to overcome the constraints posed by dataset-specific scarcity. While SAM is trained on an extensive dataset, comprising ~11M images, it mostly consists of natural photographic images with only very limited images from other modalities. Whilst the rapid progress in visual infrared surveillance and X-ray security screening imaging technologies, driven forward by advances in deep learning, has significantly enhanced the ability to detect, classify and segment objects with high accuracy, it is not evident if the SAM zero-shot capabilities can be transferred to such modalities. This work assesses SAM capabilities in segmenting objects of interest in the X-ray/infrared modalities. Our approach reuses the pre-trained SAM with three different prompts: bounding box, centroid and random points. We present quantitative/qualitative results to showcase the performance on selected datasets. Our results show that SAM can segment objects in the X-ray modality when given a box prompt, but its performance varies for point prompts. Specifically, SAM performs poorly in segmenting slender objects and organic materials, such as plastic bottles. We find that infrared objects are also challenging to segment with point prompts given the low-contrast nature of this modality. This study shows that while SAM demonstrates outstanding zero-shot capabilities with box prompts, its performance ranges from moderate to poor for point prompts, indicating that special consideration on the cross-modal generalisation of SAM is needed when considering use on X-ray/infrared imagery.
Segment Anything Model (SAM)是一种深度神经网络基础模型,旨在执行实例分割,由于其零 shot分割能力而获得了显著的流行。SAM通过根据各种输入提示生成掩码来操作,引入了一种新的方法来克服数据集特异性稀疏性的限制。尽管SAM在广泛的训练数据集上进行训练,包括~11M张图像,但它主要由仅包含非常有限其他模态图像的自然摄影图像组成。尽管随着深度学习技术的进步,视觉红外监视和X射线安全筛选成像技术的发展,大大提高了检测、分类和分割物体的准确性,但目前尚不清楚SAM的零 shot分割能力是否可以应用到这种模态。 本文评估了SAM在X-ray/红外模态中分割物体的能力。我们的方法重用了预训练的SAM,并使用三种不同的提示:边界框、中心点和随机点。我们提供了定量/定性结果,以展示SAM在这些选定数据集上的性能。我们的结果表明,当给定边界框提示时,SAM可以在X-ray模态上分割物体,但性能因点提示而异。具体来说,SAM在分割细长物体和有机材料(如塑料瓶)方面表现不佳。我们发现,由于这种模态的低对比度性质,红外物体也难以通过点提示进行分割。 本研究显示,尽管SAM在边界框提示下表现出出色的零 shot能力,但其在点提示下的性能从中等到差,表明在考虑在X-ray/红外图像上使用SAM时,需要特别注意跨模态通用性。
https://arxiv.org/abs/2404.12285
The recent emergence of deep learning has led to a great deal of work on designing supervised deep semantic segmentation algorithms. As in many tasks sufficient pixel-level labels are very difficult to obtain, we propose a method which combines a Gaussian mixture model (GMM) with unsupervised deep learning techniques. In the standard GMM the pixel values with each sub-region are modelled by a Gaussian distribution. In order to identify the different regions, the parameter vector that minimizes the negative log-likelihood (NLL) function regarding the GMM has to be approximated. For this task, usually iterative optimization methods such as the expectation-maximization (EM) algorithm are used. In this paper, we propose to estimate these parameters directly from the image using a convolutional neural network (CNN). We thus change the iterative procedure in the EM algorithm replacing the expectation-step by a gradient-step with regard to the networks parameters. This means that the network is trained to minimize the NLL function of the GMM which comes with at least two advantages. As once trained, the network is able to predict label probabilities very quickly compared with time consuming iterative optimization methods. Secondly, due to the deep image prior our method is able to partially overcome one of the main disadvantages of GMM, which is not taking into account correlation between neighboring pixels, as it assumes independence between them. We demonstrate the advantages of our method in various experiments on the example of myocardial infarct segmentation on multi-sequence MRI images.
近年来,深度学习的出现导致了许多关于设计有监督深度语义分割算法的辛勤工作。由于在许多任务中,获得足够的像素级标签非常困难,我们提出了一种将高斯混合模型(GMM)与无监督深度学习技术相结合的方法。在标准GMM中,每个子区域的像素值由高斯分布建模。为了确定不同区域,关于GMM的最小负对数(NLL)函数的参数向量必须近似。对于这项任务,通常使用迭代优化方法(如期望最大(EM)算法)进行优化。在本文中,我们提出了一种直接从图像中使用卷积神经网络(CNN)估计这些参数的方法。我们因此用网络参数的梯度代替了EM算法中的期望步骤。这意味着网络训练以最小化GMM的NLL函数,这具有至少两个优点。一旦训练完成,与时间耗费的迭代优化方法相比,网络能够非常快速地预测标签概率。其次,由于深度图像先验,我们的方法能够部分克服GMM的一个主要缺陷,即没有考虑到邻近像素之间的相关性。我们在多序列MRI图像上对心肌梗死分割进行各种实验,以展示我们方法的优势。
https://arxiv.org/abs/2404.12252
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) in computer vision has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the extremely simple ViTs' fine-tuning performance with a small-scale architecture can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology with sophisticated components introduced. By carefully adapting various typical MIM pre-training methods to this lightweight regime and comparing them with the contrastive learning (CL) pre-training on various downstream image classification and dense prediction tasks, we systematically observe different behaviors between MIM and CL with respect to the downstream fine-tuning data scales. Furthermore, we analyze the frozen features under linear probing evaluation and also the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory fine-tuning performance on data-insufficient downstream tasks. This finding is naturally a guide to choosing appropriate distillation strategies during pre-training to solve the above deterioration problem. Extensive experiments on various vision tasks demonstrate the effectiveness of our observation-analysis-solution flow. In particular, our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design (5.7M/6.5M) can achieve 79.4%/78.9% top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K semantic segmentation task (42.8% mIoU) and LaSOT visual tracking task (66.1% AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.
遮罩图像建模(MIM)在计算机视觉中的大规模ViT预训练已经实现了在学到的自监督ViT特征上具有 promising 的下游性能。在本文中,我们怀疑极简单的ViT在小规模架构上的微调性能是否也能从中获得好处,相比之下,这种预训练方法在研究方面还比较薄弱。与具有复杂组件的成熟轻量级架构设计方法相比,这种预训练方法的研究程度要低得多。通过谨慎地适应各种常见的MIM预训练方法到轻量级状态,并将其与各种下游图像分类和密集预测任务中的对比学习(CL)预训练进行比较,我们系统地观察到MIM和CL在下游细粒度数据上的行为存在差异。此外,我们分析了几种典型MIM预训练方法在轻量级状态下的冻结特征以及获得的模型中层表示相似度和注意图,这显然表明了在较高层的学习不足,导致在数据不足的下游任务上的不令人满意的细粒度预训练性能。这一发现自然地为指导在预训练过程中选择合适的去混淆策略来解决上述恶化问题提供了指导。在各种视觉任务上的广泛实验证明了我们观察-分析和解决方案流程的有效性。特别是,我们在纯轻量级ViT上进行去混淆的预训练,具有(5.7M/6.5M)ImageNet-1K的79.4%/78.9% top-1准确率。这还在轻量状态下实现了ADE20K语义分割任务(42.8% mIoU)和LaSOT视觉跟踪任务(66.1% AUC)的SOTA性能。后一个甚至超过了所有当前的SOTA轻量级CPU实时跟踪器的性能。
https://arxiv.org/abs/2404.12210
Recent vision foundation models (VFMs) have demonstrated proficiency in various tasks but require supervised fine-tuning to perform the task of semantic segmentation effectively. Benchmarking their performance is essential for selecting current models and guiding future model developments for this task. The lack of a standardized benchmark complicates comparisons. Therefore, the primary objective of this paper is to study how VFMs should be benchmarked for semantic segmentation. To do so, various VFMs are fine-tuned under various settings, and the impact of individual settings on the performance ranking and training time is assessed. Based on the results, the recommendation is to fine-tune the ViT-B variants of VFMs with a 16x16 patch size and a linear decoder, as these settings are representative of using a larger model, more advanced decoder and smaller patch size, while reducing training time by more than 13 times. Using multiple datasets for training and evaluation is also recommended, as the performance ranking across datasets and domain shifts varies. Linear probing, a common practice for some VFMs, is not recommended, as it is not representative of end-to-end fine-tuning. The benchmarking setup recommended in this paper enables a performance analysis of VFMs for semantic segmentation. The findings of such an analysis reveal that pretraining with promptable segmentation is not beneficial, whereas masked image modeling (MIM) with abstract representations is crucial, even more important than the type of supervision used. The code for efficiently fine-tuning VFMs for semantic segmentation can be accessed through the project page at: this https URL.
近年来,基于视觉基础模型的(VFMs)已经在各种任务上表现出了熟练程度,但在进行语义分割任务时,仍需要监督微调才能实现有效性能。对VFMs的性能进行基准测试对于选择当前模型和指导未来模型发展至关重要。缺乏标准化基准使得比较复杂化。因此,本文的主要目标是对语义分割任务中VFMs的基准测试进行研究。为此,在不同的设置中对各种VFMs进行微调,并评估各个设置对性能排名和训练时间的影响。根据研究结果,建议使用16x16补丁大小和线性解码器的ViT-B变体进行微调,因为这些设置代表使用更大的模型、更先进的解码器和较小的补丁大小,同时将训练时间减少超过13倍。同时,使用多个数据集进行训练和评估也是一种推荐,因为数据集和领域之间的性能排名和转移具有重要差异。本文推荐的基准设置可以使VFMs的语义分割性能分析成为可能。这种分析的结果表明,在提示性分割的预训练中,没有好处;而使用抽象表示的遮罩图像建模(MIM)则至关重要,甚至比使用的监督类型更重要。可以通过访问本文的项目页面(https:// this URL)获取VFMs进行语义分割微调的代码。
https://arxiv.org/abs/2404.12172
As recent advances in mobile camera technology have enabled the capability to capture high-resolution images, such as 4K images, the demand for an efficient deblurring model handling large motion has increased. In this paper, we discover that the image residual errors, i.e., blur-sharp pixel differences, can be grouped into some categories according to their motion blur type and how complex their neighboring pixels are. Inspired by this, we decompose the deblurring (regression) task into blur pixel discretization (pixel-level blur classification) and discrete-to-continuous conversion (regression with blur class map) tasks. Specifically, we generate the discretized image residual errors by identifying the blur pixels and then transform them to a continuous form, which is computationally more efficient than naively solving the original regression problem with continuous values. Here, we found that the discretization result, i.e., blur segmentation map, remarkably exhibits visual similarity with the image residual errors. As a result, our efficient model shows comparable performance to state-of-the-art methods in realistic benchmarks, while our method is up to 10 times computationally more efficient.
随着移动相机技术的最新进展,已经能够捕捉到高分辨率图像,如4K图像,对大运动模糊的有效的模糊模型处理需求增加了。在本文中,我们发现根据模糊类型的不同,图像残差误差可以分为一些类别。为了验证这个想法,我们分解了模糊(回归)任务为模糊像素离散化(像素级模糊分类)和离散-到连续转换(带有模糊类图的回归)任务。具体来说,我们通过识别模糊像素并将其转换为连续形式,使得计算更加高效,而原始回归问题使用连续值求解在计算上更加昂贵。在这里,我们发现离散化结果,即模糊分割图,与图像残差误差具有显著的视觉相似性。因此,我们的有效模型在现实基准测试中的性能与最先进的 methods相当,而我们的方法比传统方法在计算上效率高达10倍。
https://arxiv.org/abs/2404.12168
Humans show an innate capability to identify tools to support specific actions. The association between objects parts and the actions they facilitate is usually named affordance. Being able to segment objects parts depending on the tasks they afford is crucial to enable intelligent robots to use objects of daily living. Traditional supervised learning methods for affordance segmentation require costly pixel-level annotations, while weakly supervised approaches, though less demanding, still rely on object-interaction examples and support a closed set of actions. These limitations hinder scalability, may introduce biases, and usually restrict models to a limited set of predefined actions. This paper proposes AffordanceCLIP, to overcome these limitations by leveraging the implicit affordance knowledge embedded within large pre-trained Vision-Language models like CLIP. We experimentally demonstrate that CLIP, although not explicitly trained for affordances detection, retains valuable information for the task. Our AffordanceCLIP achieves competitive zero-shot performance compared to methods with specialized training, while offering several advantages: i) it works with any action prompt, not just a predefined set; ii) it requires training only a small number of additional parameters compared to existing solutions and iii) eliminates the need for direct supervision on action-object pairs, opening new perspectives for functionality-based reasoning of models.
人类表现出一种固有的能力,即识别支持特定动作的工具。对象部分与它们促进的动作之间的关联通常被称为 affordance。能够根据它们所促进的动作对对象部分进行分割是实现智能机器人使用日常生活中的物体的重要途径。传统的监督学习方法 for affordance segmentation 需要昂贵的像素级注释,而弱监督方法,尽管相对较少要求,但仍依赖于物体交互示例和支持一组动作。这些限制阻碍了可扩展性,可能引入偏差,并且通常将模型限制为有限的一组预定义动作。本文提出 AffordanceCLIP,通过利用预训练的 Vision-Language 模型如 CLIP 中内嵌的隐含 affordance 知识,从而克服这些限制。我们通过实验验证,CLIP 虽然在 affordance 检测方面并未进行专门训练,但保留了许多有价值的信息,对于该任务。我们的 AffordanceCLIP 与具有专门训练的方法相比具有竞争性的 zero-shot 性能,同时提供了几个优势:(i)它适用于任何动作提示,而不仅限于预定义的一组;(ii)与现有解决方案相比,需要训练的额外参数非常少;(iii)它消除了对动作-物体对之间的直接监督,为基于功能进行模型的功能推理打开了新的视角。
https://arxiv.org/abs/2404.12015
Referring image segmentation (RIS) aims to precisely segment referents in images through corresponding natural language expressions, yet relying on cost-intensive mask annotations. Weakly supervised RIS thus learns from image-text pairs to pixel-level semantics, which is challenging for segmenting fine-grained masks. A natural approach to enhancing segmentation precision is to empower weakly supervised RIS with the image segmentation foundation model SAM. Nevertheless, we observe that simply integrating SAM yields limited benefits and can even lead to performance regression due to the inevitable noise issues and challenges in excessive focus on object parts. In this paper, we present an innovative framework, Point PrompTing (PPT), incorporated with the proposed multi-source curriculum learning strategy to address these challenges. Specifically, the core of PPT is a point generator that not only harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability but also generates negative point prompts to address the noisy and excessive focus issues inherently and effectively. In addition, we introduce a curriculum learning strategy with object-centric images to help PPT gradually learn from simpler yet precise semantic alignment to more complex RIS. Experiments demonstrate that our PPT significantly and consistently outperforms prior weakly supervised techniques on mIoU by 11.34%, 14.14%, and 6.97% across RefCOCO, RefCOCO+, and G-Ref, respectively.
参考图像分割(RIS)旨在通过相应的自然语言表达精确分割图像中的指称,然而却依赖于代价高昂的掩膜注释。因此,弱监督的RIS从图像-文本对中学习像素级的语义,这使得对细粒度掩膜进行分割具有挑战性。增强分割精度的自然方法是使用图像分割基础模型SAM来增强弱监督的RIS。然而,我们观察到,仅仅通过集成SAM并不能带来很大的益处,甚至由于过度的关注对象部分而导致性能下降。在本文中,我们提出了一个创新框架,Point Prompting (PPT),结合了所提出的多源课程学习策略来解决这些挑战。具体来说,PPT的核心是一个点生成器,它不仅利用了CLIP的文本-图像对齐能力和SAM的强大掩膜生成能力,还生成负点提示来解决噪音和过度关注对象部分的问题,从而有效地解决其自身的缺陷。此外,我们还引入了一个以物体为中心的 curriculum 学习策略,帮助PPT逐渐从简单的语义对齐学习到更复杂的 RIS。实验证明,我们的PPT在 mIoU 上的性能比之前弱监督技术提高了11.34%、14.14% 和 6.97%,分别应用于 RefCOCO、RefCOCO+ 和 G-Ref。
https://arxiv.org/abs/2404.11998
Weakly Incremental Learning for Semantic Segmentation (WILSS) leverages a pre-trained segmentation model to segment new classes using cost-effective and readily available image-level labels. A prevailing way to solve WILSS is the generation of seed areas for each new class, serving as a form of pixel-level supervision. However, a scenario usually arises where a pixel is concurrently predicted as an old class by the pre-trained segmentation model and a new class by the seed areas. Such a scenario becomes particularly problematic in WILSS, as the lack of pixel-level annotations on new classes makes it intractable to ascertain whether the pixel pertains to the new class or not. To surmount this issue, we propose an innovative, tendency-driven relationship of mutual exclusivity, meticulously tailored to govern the behavior of the seed areas and the predictions generated by the pre-trained segmentation model. This relationship stipulates that predictions for the new and old classes must not conflict whilst prioritizing the preservation of predictions for the old classes, which not only addresses the conflicting prediction issue but also effectively mitigates the inherent challenge of incremental learning - catastrophic forgetting. Furthermore, under the auspices of this tendency-driven mutual exclusivity relationship, we generate pseudo masks for the new classes, allowing for concurrent execution with model parameter updating via the resolution of a bi-level optimization problem. Extensive experiments substantiate the effectiveness of our framework, resulting in the establishment of new benchmarks and paving the way for further research in this field.
我们的研究"Weakly Incremental Learning for Semantic Segmentation (WILSS)"利用预训练的分割模型对新的类别进行分割,使用成本效益高且易得的开源图像级标签进行有效的分割。解决WILSS的一种方法是为新每个类别生成种子区域,作为一种像素级别的监督。然而,在WILSS中,预训练的分割模型预测像素为旧类和新类的情况通常会发生。这种情况在WILSS中变得尤为严重,因为新类缺乏像素级别的注释,因此无法确定像素是否属于新类。为了克服这个问题,我们提出了一个创新的分歧驱动关系,精心设计以管理种子区域和预训练分割模型生成的预测的行为。该关系规定,新旧类的预测不能冲突,同时优先考虑保留旧类的预测,这不仅解决了冲突预测问题,还有效地缓解了逐步学习固有的挑战 - 灾难性遗忘。此外,在分歧驱动 mutual exclusivity关系的帮助下,我们生成新类的伪掩码,使得通过解决双层优化问题对模型参数进行更新时,可以实现同时执行。大量实验证实了我们在该领域的有效性和创新性,从而为该领域建立了新的基准,并为进一步研究铺平道路。
https://arxiv.org/abs/2404.11981
Foundation models, pre-trained on a large amount of data have demonstrated impressive zero-shot capabilities in various downstream tasks. However, in object detection and instance segmentation, two fundamental computer vision tasks heavily reliant on extensive human annotations, foundation models such as SAM and DINO struggle to achieve satisfactory performance. In this study, we reveal that the devil is in the object boundary, \textit{i.e.}, these foundation models fail to discern boundaries between individual objects. For the first time, we probe that CLIP, which has never accessed any instance-level annotations, can provide a highly beneficial and strong instance-level boundary prior in the clustering results of its particular intermediate layer. Following this surprising observation, we propose $\textbf{Zip}$ which $\textbf{Z}$ips up CL$\textbf{ip}$ and SAM in a novel classification-first-then-discovery pipeline, enabling annotation-free, complex-scene-capable, open-vocabulary object detection and instance segmentation. Our Zip significantly boosts SAM's mask AP on COCO dataset by 12.5% and establishes state-of-the-art performance in various settings, including training-free, self-training, and label-efficient finetuning. Furthermore, annotation-free Zip even achieves comparable performance to the best-performing open-vocabulary object detecters using base annotations. Code is released at this https URL
基础模型,在大量数据上预训练,已经在各种下游任务中展示了出色的零样本能力。然而,在目标检测和实例分割这两个对大量人类标注依赖的基本计算机视觉任务中,基础模型如SAM和DINO很难实现令人满意的成绩。在这项研究中,我们揭示了对象边界就在那里,即这些基础模型无法区分单个对象的边界。对于第一个,我们观察到CLIP,它从未访问过任何实例级别的标注,在其特定中间层的聚类结果中可以提供高度有益的实例级别边界先验。接着,我们提出了Zip,它将CLIP和SAM在一种新颖的分类先于发现的数据管道中结合,实现无标注、复杂场景 capable的开放词汇目标检测和实例分割。我们的Zip显著提高了SAM在COCO数据集上的掩码AP,并建立了各种设置中的最先进性能,包括无需训练、自训练和标签效率微调。此外,无标注的Zip甚至与使用基本注释的最佳性能对象检测器相当。代码发布在https://这个URL上。
https://arxiv.org/abs/2404.11957
X-ray images play a vital role in the intraoperative processes due to their high resolution and fast imaging speed and greatly promote the subsequent segmentation, registration and reconstruction. However, over-dosed X-rays superimpose potential risks to human health to some extent. Data-driven algorithms from volume scans to X-ray images are restricted by the scarcity of paired X-ray and volume data. Existing methods are mainly realized by modelling the whole X-ray imaging procedure. In this study, we propose a learning-based approach termed CT2X-GAN to synthesize the X-ray images in an end-to-end manner using the content and style disentanglement from three different image domains. Our method decouples the anatomical structure information from CT scans and style information from unpaired real X-ray images/ digital reconstructed radiography (DRR) images via a series of decoupling encoders. Additionally, we introduce a novel consistency regularization term to improve the stylistic resemblance between synthesized X-ray images and real X-ray images. Meanwhile, we also impose a supervised process by computing the similarity of computed real DRR and synthesized DRR images. We further develop a pose attention module to fully strengthen the comprehensive information in the decoupled content code from CT scans, facilitating high-quality multi-view image synthesis in the lower 2D space. Extensive experiments were conducted on the publicly available CTSpine1K dataset and achieved 97.8350, 0.0842 and 3.0938 in terms of FID, KID and defined user-scored X-ray similarity, respectively. In comparison with 3D-aware methods ($\pi$-GAN, EG3D), CT2X-GAN is superior in improving the synthesis quality and realistic to the real X-ray images.
由于其高分辨率和高成像速度,X 射线图像在术中进程中有很高的价值。然而,过度曝光的 X 射线会带来一定的对人类健康的潜在风险。数据驱动的算法从体积扫描到 X 射线图像都受到稀疏的成对 X 射线和体积数据不足的限制。现有方法主要是通过建模整个 X 射线成像过程来实现。在这项研究中,我们提出了一种基于学习的称为 CT2X-GAN 的方法,用于端到端地合成三个不同图像域中的 X 射线图像,通过一系列解耦编码器实现解剖结构信息和风格信息之间的解耦。此外,我们还引入了一个新的一致性正则化项以提高合成 X 射线图像和真实 X 射线图像之间的风格相似度。同时,我们通过计算计算得到的真实 DRR 和合成 DRR 图像的相似度来实现监督过程。我们进一步开发了一个姿态注意模块,以增强从 CT 扫描中解耦得到的内容代码的全面信息,从而在较低的 2D 空间中实现高质量的多视角图像合成。我们对公开可用的 CTSpine1K 数据集进行了广泛的实验,分别实现了 97.8350、0.0842 和 3.0938 的 FID、KID 和用户评分的 X 射线相似度。与 3D 感知方法(π-GAN、EG3D)相比,CT2X-GAN 在提高合成质量和真实性方面具有优势。
https://arxiv.org/abs/2404.11889
One-shot semantic segmentation aims to segment query images given only ONE annotated support image of the same class. This task is challenging because target objects in the support and query images can be largely different in appearance and pose (i.e., intra-class variation). Prior works suggested that incorporating more annotated support images in few-shot settings boosts performances but increases costs due to additional manual labeling. In this paper, we propose a novel approach for ONE-shot semantic segmentation, called Group-On, which packs multiple query images in batches for the benefit of mutual knowledge support within the same category. Specifically, after coarse segmentation masks of the batch of queries are predicted, query-mask pairs act as pseudo support data to enhance mask predictions mutually, under the guidance of a simple Group-On Voting module. Comprehensive experiments on three standard benchmarks show that, in the ONE-shot setting, our Group-On approach significantly outperforms previous works by considerable margins. For example, on the COCO-20i dataset, we increase mIoU scores by 8.21% and 7.46% on ASNet and HSNet baselines, respectively. With only one support image, Group-On can be even competitive with the counterparts using 5 annotated support images.
一次性的语义分割旨在对同一类别的仅有一个标注支持图像的查询图像进行分割。这个任务具有挑战性,因为支持图像和查询图像中的目标物体在 appearance 和 pose(即类内变化)上可能会有很大的差异。之前的 works 建议,在少样本设置中包含更多的标注支持图像可以提高性能,但增加成本是因为需要进行手动标注。在本文中,我们提出了一种名为 Group-On 的新颖的 ONE-shot语义分割方法,将多个查询图像打包成批次以促进同一类别内的相互知识支持。具体来说,在粗分割掩码预测之后,查询-掩码对充当伪支持数据,在简单 Group-On 投票模块的指导下相互增强 mask 预测。在三个标准基准上进行全面的实验表明,在 ONE-shot设置中,我们的 Group-On 方法显著超过了之前的工作。例如,在 COCO-20i 数据集上,我们将 mIoU 分数分别提高了 8.21% 和 7.46%。仅使用一个支持图像时,Group-On 甚至可以与使用 5 个标注支持图像的对照者相匹敌。
https://arxiv.org/abs/2404.11871
Medical image segmentation typically demands extensive dense annotations for model training, which is both time-consuming and skill-intensive. To mitigate this burden, exemplar-based medical image segmentation methods have been introduced to achieve effective training with only one annotated image. In this paper, we introduce a novel Cross-model Mutual learning framework for Exemplar-based Medical image Segmentation (CMEMS), which leverages two models to mutually excavate implicit information from unlabeled data at multiple granularities. CMEMS can eliminate confirmation bias and enable collaborative training to learn complementary information by enforcing consistency at different granularities across models. Concretely, cross-model image perturbation based mutual learning is devised by using weakly perturbed images to generate high-confidence pseudo-labels, supervising predictions of strongly perturbed images across models. This approach enables joint pursuit of prediction consistency at the image granularity. Moreover, cross-model multi-level feature perturbation based mutual learning is designed by letting pseudo-labels supervise predictions from perturbed multi-level features with different resolutions, which can broaden the perturbation space and enhance the robustness of our framework. CMEMS is jointly trained using exemplar data, synthetic data, and unlabeled data in an end-to-end manner. Experimental results on two medical image datasets indicate that the proposed CMEMS outperforms the state-of-the-art segmentation methods with extremely limited supervision.
医学图像分割通常需要大量标注数据来训练模型,这既耗时又需要专业技能。为了减轻这一负担,基于示例的医学图像分割方法已经引入,通过仅使用一个标注图像实现有效的训练。在本文中,我们提出了一个新颖的跨模型 mutual learning 框架,即 Cross-Model Mutual Learning (CMEMS),利用两个模型在多个粒度上共同挖掘未标注数据的隐含信息。CMEMS 可以通过在模型之间强制不同粒度上的一致性来消除确认偏差,并允许在模型之间进行协作训练,通过在模型之间强制不同粒度上的一致性来学习互补信息。具体来说,跨模型图像扰动基于 mutual learning 通过使用弱化扰动的图像生成高置信度的伪标签,监督模型中强烈扰动的图像的预测。这种方法在图像粒度上实现了预测一致性的共同追求。此外,跨模型多级特征扰动基于 mutual learning 是由让伪标签监督不同分辨率扰动的多级特征的预测而设计的。这可以拓宽扰动空间,增强框架的鲁棒性。CMEMS 采用示例数据、合成数据和未标注数据在端到端方式上进行联合训练。在两个医学图像数据集上的实验结果表明,与最先进的分割方法相比,CMEMS 在极其有限的监督下取得了卓越的性能。
https://arxiv.org/abs/2404.11812
Autonomous driving requires an accurate representation of the environment. A strategy toward high accuracy is to fuse data from several sensors. Learned Bird's-Eye View (BEV) encoders can achieve this by mapping data from individual sensors into one joint latent space. For cost-efficient camera-only systems, this provides an effective mechanism to fuse data from multiple cameras with different views. Accuracy can further be improved by aggregating sensor information over time. This is especially important in monocular camera systems to account for the lack of explicit depth and velocity measurements. Thereby, the effectiveness of developed BEV encoders crucially depends on the operators used to aggregate temporal information and on the used latent representation spaces. We analyze BEV encoders proposed in the literature and compare their effectiveness, quantifying the effects of aggregation operators and latent representations. While most existing approaches aggregate temporal information either in image or in BEV latent space, our analyses and performance comparisons suggest that these latent representations exhibit complementary strengths. Therefore, we develop a novel temporal BEV encoder, TempBEV, which integrates aggregated temporal information from both latent spaces. We consider subsequent image frames as stereo through time and leverage methods from optical flow estimation for temporal stereo encoding. Empirical evaluation on the NuScenes dataset shows a significant improvement by TempBEV over the baseline for 3D object detection and BEV segmentation. The ablation uncovers a strong synergy of joint temporal aggregation in the image and BEV latent space. These results indicate the overall effectiveness of our approach and make a strong case for aggregating temporal information in both image and BEV latent spaces.
自动驾驶需要准确地描述环境。实现高准确度的策略是将来自多个传感器的数据进行融合。通过将来自单个传感器的数据映射到联合latent空间,学习到的Bird's-Eye View (BEV)编码器可以实现这一目标。对于成本效益高的摄像头仅系统,这提供了一种将来自不同视角的数据进行融合的有效机制。通过在一段时间内聚合传感器信息,可以进一步提高准确性。这对于单目相机系统尤为重要,因为它们缺乏明确的深度和速度测量。因此,开发出的BEV编码器的有效性取决于用于聚合时间信息的操作员和使用的潜在表示空间。我们分析了许多文献中提出的BEV编码器,并比较了它们的有效性,并量化了聚合操作符和潜在表示空间的影响。虽然大多数现有方法在图像或BEV潜在空间中聚合时间信息,但我们的分析和性能比较结果表明,这些潜在表示空间表现出互补的优势。因此,我们开发了一个新颖的时间BEV编码器,TempBEV,它整合了来自两个潜在空间的时间聚合信息。我们将接下来的图像帧视为立体通过时间,并利用光学流估计的方法进行时间立体编码。在NuScenes数据集上的实证评估表明,TempBEV在3D物体检测和BEV分割方面的性能显著优于基线。消融揭示了图像和BEV潜在空间中关节时间聚合的强烈协同作用。这些结果表明,我们的方法的整体有效性,以及将时间信息在图像和BEV潜在空间中进行聚合的必要性。
https://arxiv.org/abs/2404.11803
Internet of Things (IoT) devices generate heterogeneous data over time; and relying solely on individual data points is inadequate for accurate analysis. Segmentation is a common preprocessing step in many IoT applications, including IoT-based activity recognition, aiming to address the limitations of individual events and streamline the process. However, this step introduces at least two families of uncontrollable biases. The first is caused by the changes made by the segmentation process on the initial problem space, such as dividing the input data into 60 seconds windows. The second category of biases results from the segmentation process itself, including the fixation of the segmentation method and its parameters. To address these biases, we propose to redefine the segmentation problem as a special case of a decomposition problem, including three key components: a decomposer, resolutions, and a composer. The inclusion of the composer task in the segmentation process facilitates an assessment of the relationship between the original problem and the problem after the segmentation. Therefore, It leads to an improvement in the evaluation process and, consequently, in the selection of the appropriate segmentation method. Then, we formally introduce our novel meta-decomposition or learning-to-decompose approach. It reduces the segmentation biases by considering the segmentation as a hyperparameter to be optimized by the outer learning problem. Therefore, meta-decomposition improves the overall system performance by dynamically selecting the appropriate segmentation method without including the mentioned biases. Extensive experiments on four real-world datasets demonstrate the effectiveness of our proposal.
物联网设备随着时间的推移会产生异构数据;仅依赖个体数据点进行准确分析是不够的。在许多物联网应用中,包括基于物联网的活动识别,分段是一个常见的预处理步骤,旨在解决个体事件的局限性,并简化过程。然而,这一步骤引入了至少两个不可控的偏见家族。第一个偏见是由分段过程对初始问题空间所做的改变引起的,例如将输入数据分为60秒窗口。第二个偏见是由分段过程本身引起的,包括对分段方法和其参数的固定。为解决这些偏见,我们提出将分段问题重新定义为分解问题的一般情况,包括三个关键组件:分解者、分辨率和支持者。将支持者任务包含在分段过程中有助于评估分割前后原始问题与问题之间的关系。因此,这导致了评估过程的改进,并相应地选择了适当的分段方法。接着,我们正式引入了我们新颖的元分解或学习分解方法。它通过将分段视为外学习问题中的超参数来减少分段偏见。因此,元分解通过动态选择适当的分段方法来提高整体系统性能,而不会包括上述偏见。在四个真实世界数据集上的大量实验证明了我们建议的有效性。
https://arxiv.org/abs/2404.11742
Popular representation learning methods encourage feature invariance under transformations applied at the input. However, in 3D perception tasks like object localization and segmentation, outputs are naturally equivariant to some transformations, such as rotation. Using pre-training loss functions that encourage equivariance of features under certain transformations provides a strong self-supervision signal while also retaining information of geometric relationships between transformed feature representations. This can enable improved performance in downstream tasks that are equivariant to such transformations. In this paper, we propose a spatio-temporal equivariant learning framework by considering both spatial and temporal augmentations jointly. Our experiments show that the best performance arises with a pre-training approach that encourages equivariance to translation, scaling, and flip, rotation and scene flow. For spatial augmentations, we find that depending on the transformation, either a contrastive objective or an equivariance-by-classification objective yields best results. To leverage real-world object deformations and motion, we consider sequential LiDAR scene pairs and develop a novel 3D scene flow-based equivariance objective that leads to improved performance overall. We show our pre-training method for 3D object detection which outperforms existing equivariant and invariant approaches in many settings.
流行的表示学习方法鼓励在应用于输入时的变换下保持特征的不变性。然而,在像物体定位和分割这样的3D感知任务中,输出自然地对某些变换(例如旋转)具有等价性。通过使用鼓励在某些变换下保持特征等价的预训练损失函数,可以提供强大的自监督信号,同时保留变换前特征表示之间几何关系的信息。这可以提高在下游具有这种变换的任务的性能。在本文中,我们提出了一种空间和时间等价的表示学习框架,通过同时考虑空间和时间增强。我们的实验表明,最佳性能通过鼓励对平移、缩放和翻转、旋转和场景流动的等价性来实现。对于空间增强,我们发现,根据变换,无论是对比性目标还是类比目标都能获得最佳结果。为了利用真实的物体变形和运动,我们考虑了连续的激光雷达场景对,并开发了一个新的基于3D场景流的三等价目标,这使得整体性能得到提高。我们证明了我们的预训练方法在许多设置中优于现有的等价和不变方法。
https://arxiv.org/abs/2404.11737