Weakly Incremental Learning for Semantic Segmentation (WILSS) leverages a pre-trained segmentation model to segment new classes using cost-effective and readily available image-level labels. A prevailing way to solve WILSS is the generation of seed areas for each new class, serving as a form of pixel-level supervision. However, a scenario usually arises where a pixel is concurrently predicted as an old class by the pre-trained segmentation model and a new class by the seed areas. Such a scenario becomes particularly problematic in WILSS, as the lack of pixel-level annotations on new classes makes it intractable to ascertain whether the pixel pertains to the new class or not. To surmount this issue, we propose an innovative, tendency-driven relationship of mutual exclusivity, meticulously tailored to govern the behavior of the seed areas and the predictions generated by the pre-trained segmentation model. This relationship stipulates that predictions for the new and old classes must not conflict whilst prioritizing the preservation of predictions for the old classes, which not only addresses the conflicting prediction issue but also effectively mitigates the inherent challenge of incremental learning - catastrophic forgetting. Furthermore, under the auspices of this tendency-driven mutual exclusivity relationship, we generate pseudo masks for the new classes, allowing for concurrent execution with model parameter updating via the resolution of a bi-level optimization problem. Extensive experiments substantiate the effectiveness of our framework, resulting in the establishment of new benchmarks and paving the way for further research in this field.
我们的研究"Weakly Incremental Learning for Semantic Segmentation (WILSS)"利用预训练的分割模型对新的类别进行分割,使用成本效益高且易得的开源图像级标签进行有效的分割。解决WILSS的一种方法是为新每个类别生成种子区域,作为一种像素级别的监督。然而,在WILSS中,预训练的分割模型预测像素为旧类和新类的情况通常会发生。这种情况在WILSS中变得尤为严重,因为新类缺乏像素级别的注释,因此无法确定像素是否属于新类。为了克服这个问题,我们提出了一个创新的分歧驱动关系,精心设计以管理种子区域和预训练分割模型生成的预测的行为。该关系规定,新旧类的预测不能冲突,同时优先考虑保留旧类的预测,这不仅解决了冲突预测问题,还有效地缓解了逐步学习固有的挑战 - 灾难性遗忘。此外,在分歧驱动 mutual exclusivity关系的帮助下,我们生成新类的伪掩码,使得通过解决双层优化问题对模型参数进行更新时,可以实现同时执行。大量实验证实了我们在该领域的有效性和创新性,从而为该领域建立了新的基准,并为进一步研究铺平道路。
https://arxiv.org/abs/2404.11981
One-shot semantic segmentation aims to segment query images given only ONE annotated support image of the same class. This task is challenging because target objects in the support and query images can be largely different in appearance and pose (i.e., intra-class variation). Prior works suggested that incorporating more annotated support images in few-shot settings boosts performances but increases costs due to additional manual labeling. In this paper, we propose a novel approach for ONE-shot semantic segmentation, called Group-On, which packs multiple query images in batches for the benefit of mutual knowledge support within the same category. Specifically, after coarse segmentation masks of the batch of queries are predicted, query-mask pairs act as pseudo support data to enhance mask predictions mutually, under the guidance of a simple Group-On Voting module. Comprehensive experiments on three standard benchmarks show that, in the ONE-shot setting, our Group-On approach significantly outperforms previous works by considerable margins. For example, on the COCO-20i dataset, we increase mIoU scores by 8.21% and 7.46% on ASNet and HSNet baselines, respectively. With only one support image, Group-On can be even competitive with the counterparts using 5 annotated support images.
一次性的语义分割旨在对同一类别的仅有一个标注支持图像的查询图像进行分割。这个任务具有挑战性,因为支持图像和查询图像中的目标物体在 appearance 和 pose(即类内变化)上可能会有很大的差异。之前的 works 建议,在少样本设置中包含更多的标注支持图像可以提高性能,但增加成本是因为需要进行手动标注。在本文中,我们提出了一种名为 Group-On 的新颖的 ONE-shot语义分割方法,将多个查询图像打包成批次以促进同一类别内的相互知识支持。具体来说,在粗分割掩码预测之后,查询-掩码对充当伪支持数据,在简单 Group-On 投票模块的指导下相互增强 mask 预测。在三个标准基准上进行全面的实验表明,在 ONE-shot设置中,我们的 Group-On 方法显著超过了之前的工作。例如,在 COCO-20i 数据集上,我们将 mIoU 分数分别提高了 8.21% 和 7.46%。仅使用一个支持图像时,Group-On 甚至可以与使用 5 个标注支持图像的对照者相匹敌。
https://arxiv.org/abs/2404.11871
The emergence of attention-based transformer models has led to their extensive use in various tasks, due to their superior generalization and transfer properties. Recent research has demonstrated that such models, when prompted appropriately, are excellent for few-shot inference. However, such techniques are under-explored for dense prediction tasks like semantic segmentation. In this work, we examine the effectiveness of prompting a transformer-decoder with learned visual prompts for the generalized few-shot segmentation (GFSS) task. Our goal is to achieve strong performance not only on novel categories with limited examples, but also to retain performance on base categories. We propose an approach to learn visual prompts with limited examples. These learned visual prompts are used to prompt a multiscale transformer decoder to facilitate accurate dense predictions. Additionally, we introduce a unidirectional causal attention mechanism between the novel prompts, learned with limited examples, and the base prompts, learned with abundant data. This mechanism enriches the novel prompts without deteriorating the base class performance. Overall, this form of prompting helps us achieve state-of-the-art performance for GFSS on two different benchmark datasets: COCO-$20^i$ and Pascal-$5^i$, without the need for test-time optimization (or transduction). Furthermore, test-time optimization leveraging unlabelled test data can be used to improve the prompts, which we refer to as transductive prompt tuning.
基于注意力的Transformer模型的出现使得它们在各种任务中得到了广泛的应用,因为它们具有卓越的泛化和传输特性。最近的研究表明,当适当地提示下,这种模型在少样本推理任务中表现出色。然而,在类似于语义分割等密集预测任务中,这种技术尚未得到充分的探索。在这项工作中,我们研究了使用学习过的视觉提示 prompt a transformer-decoder 在GFSS(泛化少样本分割)任务中的效果。我们的目标是在既不丢失对少量示例的新兴类别的表现,也不影响基本类别的表现的情况下实现强大的性能。我们提出了通过学习有限示例的视觉提示来学习的方法。这些学习过的视觉提示被用于提示多尺度Transformer解码器,以实现准确的密集预测。此外,我们在新提示和基本提示之间引入了一种单向因果注意力机制。这种机制在丰富新提示的同时,没有削弱基本类别的表现。总体而言,这种提示方式有助于我们在两个不同的基准数据集上实现GFSS的尖端性能:COCO-$20^i$和Pascal-$5^i$,而无需进行测试时间的优化(或转换)。此外,利用未标记测试数据进行测试时间的优化可以进一步改善提示,我们称之为Transductive prompt tuning。
https://arxiv.org/abs/2404.11732
Nowadays the accurate geo-localization of ground-view images has an important role across domains as diverse as journalism, forensics analysis, transports, and Earth Observation. This work addresses the problem of matching a query ground-view image with the corresponding satellite image without GPS data. This is done by comparing the features from a ground-view image and a satellite one, innovatively leveraging the corresponding latter's segmentation mask through a three-stream Siamese-like network. The proposed method, Semantic Align Net (SAN), focuses on limited Field-of-View (FoV) and ground panorama images (images with a FoV of 360°). The novelty lies in the fusion of satellite images in combination with their semantic segmentation masks, aimed at ensuring that the model can extract useful features and focus on the significant parts of the images. This work shows how SAN through semantic analysis of images improves the performance on the unlabelled CVUSA dataset for all the tested FoVs.
如今,准确的地形图像的地理定位在各种领域(如新闻、法医分析、交通和地球观测)中具有重要作用。本文解决了在没有GPS数据的情况下将查询地形图像与相应卫星图像相匹配的问题。这是通过比较地形图像和卫星图像的特征来实现创新性的,并利用卫星图像相应的分割掩模通过三个通道的Siamese-like网络。所提出的方法,语义对齐网络(SAN),专注于有限的地形视野(FoV)和地面全景图像(具有360°视野的照片)。创新之处在于将卫星图像与它们的语义分割掩模进行融合,旨在确保模型可以提取有用的特征并关注图像的重要部分。本文证明了SAN通过图像语义分析如何提高所有测试FoV的未标记CVUSA数据集的性能。
https://arxiv.org/abs/2404.11302
Data from satellites or aerial vehicles are most of the times unlabelled. Annotating such data accurately is difficult, requires expertise, and is costly in terms of time. Even if Earth Observation (EO) data were correctly labelled, labels might change over time. Learning from unlabelled data within a semi-supervised learning framework for segmentation of aerial images is challenging. In this paper, we develop a new model for semantic segmentation of unlabelled images, the Non-annotated Earth Observation Semantic Segmentation (NEOS) model. NEOS performs domain adaptation as the target domain does not have ground truth semantic segmentation masks. The distribution inconsistencies between the target and source domains are due to differences in acquisition scenes, environment conditions, sensors, and times. Our model aligns the learned representations of the different domains to make them coincide. The evaluation results show that NEOS is successful and outperforms other models for semantic segmentation of unlabelled data.
卫星或无人机收集的数据通常没有标签。准确地注释这些数据具有困难性,需要专业知识,并且在时间上代价昂贵。即使地球观测数据(EO)正确地进行了标签,标签也可能会随着时间的推移而变化。在半监督学习框架中,学习未标记的图像分割任务的模型具有挑战性。在本文中,我们开发了一个名为未标记地球观测语义分割(NEOS)的新模型。NEOS在目标域没有真实语义分割掩膜的情况下进行领域适应。目标和源域之间分布不一致的原因是获取场景、环境条件、传感器和时间的差异。我们的模型将不同领域的学习表示对齐,使它们重叠。评估结果显示,NEOS取得了成功,并且对未标记数据的语义分割优于其他模型。
https://arxiv.org/abs/2404.11299
We propose a new tiling strategy, Flip-n-Slide, which has been developed for specific use with large Earth observation satellite images when the location of objects-of-interest (OoI) is unknown and spatial context can be necessary for class disambiguation. Flip-n-Slide is a concise and minimalistic approach that allows OoI to be represented at multiple tile positions and orientations. This strategy introduces multiple views of spatio-contextual information, without introducing redundancies into the training set. By maintaining distinct transformation permutations for each tile overlap, we enhance the generalizability of the training set without misrepresenting the true data distribution. Our experiments validate the effectiveness of Flip-n-Slide in the task of semantic segmentation, a necessary data product in geophysical studies. We find that Flip-n-Slide outperforms the previous state-of-the-art augmentation routines for tiled data in all evaluation metrics. For underrepresented classes, Flip-n-Slide increases precision by as much as 15.8%.
我们提出了一个新的镶嵌策略,名为翻转-滑动(Flip-n-Slide),专门用于在物体兴趣点(OoI)位置未知且需要空间上下文进行分类时的大型地球观测卫星图像。翻转-滑动是一种简洁而 minimalistic 的方法,允许 OoI 在多个贴图位置和方向上表示。这种策略引入了多个空间上下文信息视图,而不会引入训练集中的冗余。通过保持每个贴图覆盖的变换变换,我们增强了训练集的泛化能力,同时没有误解真实数据分布。我们的实验验证了翻转-滑动在语义分割任务中的有效性,这是地球物理学研究中的必要数据产品。我们发现,翻转-滑动在所有评估指标上都优于前 state-of-the-art 增强方法。对于代表性不足的类别,翻转-滑动将精度提高了 15.8%。
https://arxiv.org/abs/2404.10927
Large vision-language models revolutionized image classification and semantic segmentation paradigms. However, they typically assume a pre-defined set of categories, or vocabulary, at test time for composing textual prompts. This assumption is impractical in scenarios with unknown or evolving semantic context. Here, we address this issue and introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an unconstrained language-induced semantic space to an input image without needing a known vocabulary. VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories. To address VIC, we propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database. CaSED first extracts the set of candidate categories from the most semantically similar captions in the database and then assigns the image to the best-matching candidate category according to the same vision-language model. Furthermore, we demonstrate that CaSED can be applied locally to generate a coarse segmentation mask that classifies image regions, introducing the task of Vocabulary-free Semantic Segmentation. CaSED and its variants outperform other more complex vision-language models, on classification and semantic segmentation benchmarks, while using much fewer parameters.
大视觉语言模型彻底颠覆了图像分类和语义分割范式。然而,它们通常在测试时假设一个预定义的词汇表,或词汇集,用于构建文本提示。在语义上下文未知或不断变化的情况下,这个假设是不实用的。在这里,我们解决了这个问题,并引入了无词汇图像分类(VIC)任务,该任务旨在将不受已知词汇表约束的语义空间中的类分配给输入图像。VIC 具有挑战性,因为语义空间非常广泛,包含数百万个概念,包括细粒度分类。为了应对 VIC,我们提出了从外部数据库中进行类别搜索(CaSED)的方法,这是一种训练免费的方法,它利用了一个预训练的视觉语言模型和外部数据库。 CaSED 首先从数据库中提取出最具语义相似性的捕捉到的候选类,然后根据相同的视觉语言模型将图像分配给最佳匹配的候选类。此外,我们还证明了 CaSED 可以局部应用于生成一个粗分割掩码,对图像区域进行分类,从而引入了词汇无语义分割任务。CaSED 和它的变体在分类和语义分割基准测试中优于其他更复杂的视觉语言模型,同时使用了更少的参数。
https://arxiv.org/abs/2404.10864
Methane emissions from livestock, particularly cattle, significantly contribute to climate change. Effective methane emission mitigation strategies are crucial as the global population and demand for livestock products increase. We introduce Gasformer, a novel semantic segmentation architecture for detecting low-flow rate methane emissions from livestock, and controlled release experiments using optical gas imaging. We present two unique datasets captured with a FLIR GF77 OGI camera. Gasformer leverages a Mix Vision Transformer encoder and a Light-Ham decoder to generate multi-scale features and refine segmentation maps. Gasformer outperforms other state-of-the-art models on both datasets, demonstrating its effectiveness in detecting and segmenting methane plumes in controlled and real-world scenarios. On the livestock dataset, Gasformer achieves mIoU of 88.56%, surpassing other state-of-the-art models. Materials are available at: this http URL.
来自家畜(尤其是牛)的甲烷排放对气候变化具有重要影响。有效的甲烷排放减缓策略至关重要,因为全球人口和畜牧产品需求不断增加。我们介绍了一种名为Gasformer的新型语义分割架构,用于检测家畜中低流量率甲烷排放,并使用光学气体成像进行控制释放实验。我们展示了使用FLIR GF77 OGI相机捕获的两种独特数据集。Gasformer利用Mix Vision Transformer编码器和Light-Ham解码器生成多尺度特征并优化分割图。在控制和现实世界场景中,Gasformer在其他最先进的模型都表现优异,证明了其在检测和分割甲烷浓度的有效性。在畜牧 dataset 上,Gasformer 实现了 mIoU 的88.56%,超越了其他最先进的模型。材料可在以下这个网址找到:http:// this http URL。
https://arxiv.org/abs/2404.10841
Anomaly detection (AD) is often focused on detecting anomaly areas for industrial quality inspection and medical lesion examination. However, due to the specific scenario targets, the data scale for AD is relatively small, and evaluation metrics are still deficient compared to classic vision tasks, such as object detection and semantic segmentation. To fill these gaps, this work first constructs a large-scale and general-purpose COCO-AD dataset by extending COCO to the AD field. This enables fair evaluation and sustainable development for different methods on this challenging benchmark. Moreover, current metrics such as AU-ROC have nearly reached saturation on simple datasets, which prevents a comprehensive evaluation of different methods. Inspired by the metrics in the segmentation field, we further propose several more practical threshold-dependent AD-specific metrics, ie, m$F_1$$^{.2}_{.8}$, mAcc$^{.2}_{.8}$, mIoU$^{.2}_{.8}$, and mIoU-max. Motivated by GAN inversion's high-quality reconstruction capability, we propose a simple but more powerful InvAD framework to achieve high-quality feature reconstruction. Our method improves the effectiveness of reconstruction-based methods on popular MVTec AD, VisA, and our newly proposed COCO-AD datasets under a multi-class unsupervised setting, where only a single detection model is trained to detect anomalies from different classes. Extensive ablation experiments have demonstrated the effectiveness of each component of our InvAD. Full codes and models are available at this https URL.
异常检测(AD)通常关注工业质量检测和医学伤口检验中的异常区域检测。然而,由于特定场景的目标,AD的数据规模相对较小,与经典视觉任务(如物体检测和语义分割)相比,评估指标仍然缺乏。为了填补这些空白,本文首先通过在AD领域扩展COCO来构建一个大规模且通用的COCO-AD数据集。这使得对于这个具有挑战性的基准,不同方法具有公平的评估和可持续的发展。此外,如分割域中的指标一样,目前的AU-ROC指标在简单的数据集上几乎达到饱和,这阻止了不同方法的全面评估。受到分割域指标的启发,我们进一步提出了几个更具体的AD特定指标,即m$F_1^{.2}_{.8}$,mAcc$^{.2}_{.8}$,mIoU$^{.2}_{.8}$和mIoU-max。受到GAN逆变换高质量重构能力的影响,我们提出了一个简单但更强大的InvAD框架,以实现高质量特征重构。我们的方法在多类无监督设置中提高了基于重构的复原方法在流行MVTec AD,VisA和我们所提出的COCO-AD数据集上的有效性。广泛的消融实验证明了每个组件的有效性。完整代码和模型可以从该链接的https URL中获取。
https://arxiv.org/abs/2404.10760
We introduce ECLAIR (Extended Classification of Lidar for AI Recognition), a new outdoor large-scale aerial LiDAR dataset designed specifically for advancing research in point cloud semantic segmentation. As the most extensive and diverse collection of its kind to date, the dataset covers a total area of 10$km^2$ with close to 600 million points and features eleven distinct object categories. To guarantee the dataset's quality and utility, we have thoroughly curated the point labels through an internal team of experts, ensuring accuracy and consistency in semantic labeling. The dataset is engineered to move forward the fields of 3D urban modeling, scene understanding, and utility infrastructure management by presenting new challenges and potential applications. As a benchmark, we report qualitative and quantitative analysis of a voxel-based point cloud segmentation approach based on the Minkowski Engine.
我们介绍了一个名为 ECLAIR(扩展分类激光雷达数据集)的新一代户外大型无人机激光雷达数据集,专门用于促进点云语义分割研究的进展。作为有史以来最广泛和多样化的数据集之一,该数据集涵盖了总面积为 10$km^2$,拥有近 600 万个点,并提供了十一条不同的物体类别。为了确保数据集的质量和实用性,我们通过内部专家团队对点标签进行了彻底审核,确保语义标注的准确性和一致性。该数据集通过呈现新的挑战和潜在应用,推动了三维城市建模、场景理解和实用基础设施管理领域的发展。作为基准,我们报道了基于Minkowski引擎的体素点云分割方法的定性和定量分析。
https://arxiv.org/abs/2404.10699
Despite great improvements in semantic segmentation, challenges persist because of the lack of local/global contexts and the relationship between them. In this paper, we propose Contextrast, a contrastive learning-based semantic segmentation method that allows to capture local/global contexts and comprehend their relationships. Our proposed method comprises two parts: a) contextual contrastive learning (CCL) and b) boundary-aware negative (BANE) sampling. Contextual contrastive learning obtains local/global context from multi-scale feature aggregation and inter/intra-relationship of features for better discrimination capabilities. Meanwhile, BANE sampling selects embedding features along the boundaries of incorrectly predicted regions to employ them as harder negative samples on our contrastive learning, resolving segmentation issues along the boundary region by exploiting fine-grained details. We demonstrate that our Contextrast substantially enhances the performance of semantic segmentation networks, outperforming state-of-the-art contrastive learning approaches on diverse public datasets, e.g. Cityscapes, CamVid, PASCAL-C, COCO-Stuff, and ADE20K, without an increase in computational cost during inference.
尽管在语义分割方面取得了很大的改进,但挑战仍然存在,因为缺乏局部/全局上下文以及它们之间的关系。在本文中,我们提出了Contextrast,一种基于对比学习语义分割的方法,允许捕捉局部/全局上下文并理解它们之间的关系。我们提出的方法包括两个部分:a) 上下文对比学习(CCL)和 b) 边界感知负样本(BANE)选择。上下文对比学习通过多尺度特征聚合和特征之间的 inter/intra 关系获取局部/全局上下文,从而提高鉴别的功能。同时,BANE选择沿着错误预测区域边界处的嵌入特征作为 harder 负样本,通过利用细粒度细节解决边界区域的分割问题。我们证明了,我们的Contextrast极大地增强了语义分割网络的表现,在各种公共数据集(如 Cityscapes、CamVid、PASCAL-C、COCO-Stuff 和 ADE20K)上优于最先进的对比学习方法,而无需增加推理期间的计算成本。
https://arxiv.org/abs/2404.10633
Whole brain parcellation requires inferring hundreds of segmentation labels in large image volumes and thus presents significant practical challenges for deep learning approaches. We introduce label merge-and-split, a method that first greatly reduces the effective number of labels required for learning-based whole brain parcellation and then recovers original labels. Using a greedy graph colouring algorithm, our method automatically groups and merges multiple spatially separate labels prior to model training and inference. The merged labels may be semantically unrelated. A deep learning model is trained to predict merged labels. At inference time, original labels are restored using atlas-based influence regions. In our experiments, the proposed approach reduces the number of labels by up to 68% while achieving segmentation accuracy comparable to the baseline method without label merging and splitting. Moreover, model training and inference times as well as GPU memory requirements were reduced significantly. The proposed method can be applied to all semantic segmentation tasks with a large number of spatially separate classes within an atlas-based prior.
全脑分割需要推断大型图像卷积区中的数百个分割标签,因此对于深度学习方法来说,它带来了显著的实践挑战。我们引入了标签合并和分割方法,一种首先极大地减少了学习为基础的全脑分割所需的有效标签数量,然后恢复原始标签的方法。使用贪婪的图着色算法,我们的方法在模型训练和推理之前自动对多个空间上分离的标签进行分组和合并。合并的标签可能具有语义无关性。一个深度学习模型用于预测合并的标签。在推理时,使用基于解剖结构的感子区恢复原始标签。在我们的实验中,与不进行标签合并和分割的基线方法相比,将标签数量减少了68%,同时实现与基线方法相当的分割精度。此外,模型训练和推理时间以及GPU内存需求都显著减少。所提出的方法可以应用于所有具有大图卷积区中的大量空间上分离类别的语义分割任务。
https://arxiv.org/abs/2404.10572
Recent large vision models (e.g., SAM) enjoy great potential to facilitate intelligent perception with high accuracy. Yet, the resource constraints in the IoT environment tend to limit such large vision models to be locally deployed, incurring considerable inference latency thereby making it difficult to support real-time applications, such as autonomous driving and robotics. Edge-cloud collaboration with large-small model co-inference offers a promising approach to achieving high inference accuracy and low latency. However, existing edge-cloud collaboration methods are tightly coupled with the model architecture and cannot adapt to the dynamic data drifts in heterogeneous IoT environments. To address the issues, we propose LAECIPS, a new edge-cloud collaboration framework. In LAECIPS, both the large vision model on the cloud and the lightweight model on the edge are plug-and-play. We design an edge-cloud collaboration strategy based on hard input mining, optimized for both high accuracy and low latency. We propose to update the edge model and its collaboration strategy with the cloud under the supervision of the large vision model, so as to adapt to the dynamic IoT data streams. Theoretical analysis of LAECIPS proves its feasibility. Experiments conducted in a robotic semantic segmentation system using real-world datasets show that LAECIPS outperforms its state-of-the-art competitors in accuracy, latency, and communication overhead while having better adaptability to dynamic environments.
近年来,大型视觉模型(例如,SAM)具有很大潜力,可以通过高精度智能感知促进实时应用。然而,物联网环境中的资源限制往往限制了大型视觉模型在本地部署,从而导致相当长的推理延迟,使得支持实时应用(如自动驾驶和机器人)变得困难。边缘云协同大型小模型推理提供了一种实现高推理准确性和低延迟的有前途的方法。然而,现有的边缘云协同方法紧密耦合于模型架构,无法适应异构物联网环境中的动态数据漂移。为解决这个问题,我们提出了LAECIPS,一种新的边缘云协同框架。在LAECIPS中,云上的大视觉模型和边缘上的轻量级模型都是插件和可用的。我们基于硬输入挖掘设计了一种边缘云协同策略,既具有高准确度又具有低延迟。我们建议在大型视觉模型的监督下更新边缘模型及其协同策略,以便适应动态的物联网数据流。LAECIPS的理论分析证明了其可行性。使用真实世界数据集的机器人语义分割系统进行的实验表明,LAECIPS在准确性、延迟和通信开销方面都优于其最先进的竞争对手,同时具有更好的适应动态环境的能力。
https://arxiv.org/abs/2404.10498
Nowadays, deep learning models have reached incredible performance in the task of image generation. Plenty of literature works address the task of face generation and editing, with human and automatic systems that struggle to distinguish what's real from generated. Whereas most systems reached excellent visual generation quality, they still face difficulties in preserving the identity of the starting input subject. Among all the explored techniques, Semantic Image Synthesis (SIS) methods, whose goal is to generate an image conditioned on a semantic segmentation mask, are the most promising, even though preserving the perceived identity of the input subject is not their main concern. Therefore, in this paper, we investigate the problem of identity preservation in face image generation and present an SIS architecture that exploits a cross-attention mechanism to merge identity, style, and semantic features to generate faces whose identities are as similar as possible to the input ones. Experimental results reveal that the proposed method is not only suitable for preserving the identity but is also effective in the face recognition adversarial attack, i.e. hiding a second identity in the generated faces.
如今,在图像生成任务中,深度学习模型已经达到了惊人的表现。大量文献都研究了面向人脸生成和编辑的任务,其中人类和自动系统很难区分真实和生成内容。虽然大多数系统都达到了出色的视觉生成质量,但它们仍然面临着保留输入主题身份的困难。在所有探索的技术中,语义图像合成(SIS)方法最具前景,尽管它们保留输入主题身份的不是它们的主要关注点。因此,在本文中,我们研究了人脸图像生成中身份保留的问题,并提出了一个SIS架构,它利用交叉注意机制将身份、风格和语义特征合并以生成尽可能与输入主题相似的面孔。实验结果表明,所提出的方法不仅适合保留身份,而且在面部识别对抗攻击中也非常有效,即在生成面部时隐藏第二个身份。
https://arxiv.org/abs/2404.10408
Few-shot semantic segmentation (FSS) has achieved great success on segmenting objects of novel classes, supported by only a few annotated samples. However, existing FSS methods often underperform in the presence of domain shifts, especially when encountering new domain styles that are unseen during training. It is suboptimal to directly adapt or generalize the entire model to new domains in the few-shot scenario. Instead, our key idea is to adapt a small adapter for rectifying diverse target domain styles to the source domain. Consequently, the rectified target domain features can fittingly benefit from the well-optimized source domain segmentation model, which is intently trained on sufficient source domain data. Training domain-rectifying adapter requires sufficiently diverse target domains. We thus propose a novel local-global style perturbation method to simulate diverse potential target domains by perturbating the feature channel statistics of the individual images and collective statistics of the entire source domain, respectively. Additionally, we propose a cyclic domain alignment module to facilitate the adapter effectively rectifying domains using a reverse domain rectification supervision. The adapter is trained to rectify the image features from diverse synthesized target domains to align with the source domain. During testing on target domains, we start by rectifying the image features and then conduct few-shot segmentation on the domain-rectified features. Extensive experiments demonstrate the effectiveness of our method, achieving promising results on cross-domain few-shot semantic segmentation tasks. Our code is available at this https URL.
少数shot语义分割(FSS)已经在对新颖类别的物体进行分割时取得了巨大的成功,仅依赖于几篇注释样本。然而,在领域漂移存在时,现有的FSS方法通常表现不佳,尤其是在遇到训练中未见过的全新领域样式时。在少数 shot 场景中直接将整个模型适应或泛化到新领域往往是低效的。相反,我们的关键想法是针对每个目标域调整一个小的适配器来修复多样目标域样式。因此,调整后的目标域特征可以有效地利用经过良好优化的源域分割模型,该模型在充分的源域数据上进行训练。训练领域归一化适配器需要足够多样化的目标域。因此,我们提出了一个新颖的局部-全局样式扰动方法,通过扰动单个图像的特征通道统计和整个源域的特征统计来模拟不同的潜在目标域。此外,我们还提出了一个环形域对齐模块,以便于适配器通过反向域矩形化监督有效地修复领域。适配器从多样合成目标域中调整图像特征与源域对齐。在测试目标域时,我们首先对图像特征进行矩形化,然后对领域归一化的特征进行少量 shot 分割。大量实验证明,我们的方法的有效性,在跨领域少量 shot 语义分割任务中取得了很好的结果。我们的代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2404.10322
Few-shot segmentation is a task to segment objects or regions of novel classes within an image given only a few annotated examples. In the generalized setting, the task extends to segment both the base and the novel classes. The main challenge is how to train the model such that the addition of novel classes does not hurt the base classes performance, also known as catastrophic forgetting. To mitigate this issue, we use SegGPT as our base model and train it on the base classes. Then, we use separate learnable prompts to handle predictions for each novel class. To handle various object sizes which typically present in remote sensing domain, we perform patch-based prediction. To address the discontinuities along patch boundaries, we propose a patch-and-stitch technique by re-framing the problem as an image inpainting task. During inference, we also utilize image similarity search over image embeddings for prompt selection and novel class filtering to reduce false positive predictions. Based on our experiments, our proposed method boosts the weighted mIoU of a simple fine-tuned SegGPT from 15.96 to 35.08 on the validation set of few-shot OpenEarthMap dataset given in the challenge.
少样本分割是在只有几篇注释示例的情况下,对图像中 novel 类别的对象或区域进行分割的任务。在扩展设置中,任务扩展到同时分割基础类和 novel 类别。主要挑战是训练模型,使得 novel 类别的添加不会损害基础类别的性能,也就是灾难性遗忘(catastrophic forgetting)。为了减轻这个问题,我们使用 SegGPT 作为基础模型,并在其基础上进行训练。然后,我们使用独立的可学习提示来处理每个 novel 类别的预测。为了处理遥感领域中通常存在的各种对象大小,我们采用基于补丁的预测。为了处理补丁边界上的不连续性,我们提出了通过重新将问题重构为图像修复任务来解决补丁和缝合技术。在推理过程中,我们还利用图像相似搜索来选择提示和过滤 novel 类别,以降低虚假阳性预测。根据我们的实验,我们对简单微调的 SegGPT 的加权 mIoU 在 few-shot OpenEarthMap 数据集的验证集上从 15.96 提高到了 35.08。
https://arxiv.org/abs/2404.10307
Embodied visual tracking is to follow a target object in dynamic 3D environments using an agent's egocentric vision. This is a vital and challenging skill for embodied agents. However, existing methods suffer from inefficient training and poor generalization. In this paper, we propose a novel framework that combines visual foundation models (VFM) and offline reinforcement learning (offline RL) to empower embodied visual tracking. We use a pre-trained VFM, such as ``Tracking Anything", to extract semantic segmentation masks with text prompts. We then train a recurrent policy network with offline RL, e.g., Conservative Q-Learning, to learn from the collected demonstrations without online agent-environment interactions. To further improve the robustness and generalization of the policy network, we also introduce a mask re-targeting mechanism and a multi-level data collection strategy. In this way, we can train a robust tracker within an hour on a consumer-level GPU, e.g., Nvidia RTX 3090. Such efficiency is unprecedented for RL-based visual tracking methods. We evaluate our tracker on several high-fidelity environments with challenging situations, such as distraction and occlusion. The results show that our agent outperforms state-of-the-art methods in terms of sample efficiency, robustness to distractors, and generalization to unseen scenarios and targets. We also demonstrate the transferability of the learned tracker from the virtual world to real-world scenarios.
肢体视觉跟踪是通过使用代理的以自我为中心的视觉来跟随动态3D环境中的目标对象。这是 embodied 代理的一种关键和具有挑战性的技能。然而,现有的方法在训练和泛化方面存在效率低和表现差的问题。在本文中,我们提出了一种结合视觉基础模型(VFM)和离线强化学习(offline RL)的新框架,以增强 embodied 视觉跟踪。我们使用预训练的 VFM,如 "Tracking Anything",以提取带文本提示的语义分割掩码。然后,我们使用离线 RL 训练一个循环策略网络,例如 Conservative Q-Learning,以从收集的演示中学习,而无需与在线代理环境和交互。为了进一步提高策略网络的稳健性和泛化性,我们还引入了掩码重置机制和多级数据收集策略。通过这种方式,我们可以在消费者级 GPU(例如 Nvidia RTX 3090)上训练一个稳健的跟踪器,例如一个小时。这是基于 RL 的视觉跟踪方法前所未有的效率。我们在具有挑战性的环境中评估我们的跟踪器,例如分心和遮挡。结果表明,我们的代理在样本效率、对干扰者的鲁棒性和对未见过的场景和目标的泛化方面优于最先进的 methods。我们还证明了从虚拟世界中学到的跟踪器在现实世界场景中的可转移性。
https://arxiv.org/abs/2404.09857
We propose In-Context Translation (ICT), a general learning framework to unify visual recognition (e.g., semantic segmentation), low-level image processing (e.g., denoising), and conditional image generation (e.g., edge-to-image synthesis). Thanks to unification, ICT significantly reduces the inherent inductive bias that comes with designing models for specific tasks, and it maximizes mutual enhancement across similar tasks. However, the unification across a large number of tasks is non-trivial due to various data formats and training pipelines. To this end, ICT introduces two designs. Firstly, it standardizes input-output data of different tasks into RGB image pairs, e.g., semantic segmentation data pairs an RGB image with its segmentation mask in the same RGB format. This turns different tasks into a general translation task between two RGB images. Secondly, it standardizes the training of different tasks into a general in-context learning, where "in-context" means the input comprises an example input-output pair of the target task and a query image. The learning objective is to generate the "missing" data paired with the query. The implicit translation process is thus between the query and the generated image. In experiments, ICT unifies ten vision tasks and showcases impressive performance on their respective benchmarks. Notably, compared to its competitors, e.g., Painter and PromptDiffusion, ICT trained on only 4 RTX 3090 GPUs is shown to be more efficient and less costly in training.
我们提出了In-Context Translation(ICT)学习框架,这是一个通用的学习框架,旨在统一视觉识别(例如语义分割)、低级图像处理(例如去噪)和条件图像生成(例如边缘到图像合成)。由于统一,ICT显著减少了为特定任务设计模型所固有的归纳偏差,并最大化了具有类似任务的相互增强。然而,在大量任务上的统一并非易事,由于各种数据格式和训练途径。为此,ICT引入了两种设计。首先,它将不同任务的输入输出数据标准化为RGB图像对,例如语义分割数据的输入和分割掩码在同一RGB格式下。这使得不同任务成为两个RGB图像之间的通用翻译任务。其次,它将不同任务的训练标准化为通用In-Context学习,其中“In-Context”意味着输入包括目标任务的实例输入-输出对和查询图像。学习目标是要生成与查询图像相配对的“缺失”数据。因此,隐式翻译过程在查询和生成图像之间。在实验中,ICT统一了十个视觉任务,并在其各自的基准测试中展示了令人印象深刻的性能。值得注意的是,与竞争对手,如Painter和PromptDiffusion相比,ICT仅在4个RTX 3090 GPU上训练,就被证明在训练过程中更有效且成本更低。
https://arxiv.org/abs/2404.09633
Recent advancements in image segmentation have focused on enhancing the efficiency of the models to meet the demands of real-time applications, especially on edge devices. However, existing research has primarily concentrated on single-task settings, especially on semantic segmentation, leading to redundant efforts and specialized architectures for different tasks. To address this limitation, we propose a novel architecture for efficient multi-task image segmentation, capable of handling various segmentation tasks without sacrificing efficiency or accuracy. We introduce BiSeNetFormer, that leverages the efficiency of two-stream semantic segmentation architectures and it extends them into a mask classification framework. Our approach maintains the efficient spatial and context paths to capture detailed and semantic information, respectively, while leveraging an efficient transformed-based segmentation head that computes the binary masks and class probabilities. By seamlessly supporting multiple tasks, namely semantic and panoptic segmentation, BiSeNetFormer offers a versatile solution for multi-task segmentation. We evaluate our approach on popular datasets, Cityscapes and ADE20K, demonstrating impressive inference speeds while maintaining competitive accuracy compared to state-of-the-art architectures. Our results indicate that BiSeNetFormer represents a significant advancement towards fast, efficient, and multi-task segmentation networks, bridging the gap between model efficiency and task adaptability.
近年来,在图像分割领域的进步主要集中在提高模型的实时应用需求,尤其是边缘设备。然而,现有的研究主要集中在单任务设置,尤其是在语义分割上,导致对不同任务的冗余努力和专业架构。为了克服这一局限,我们提出了一个名为BiSeNetFormer的新型多任务图像分割架构,能够处理各种分割任务,同时不牺牲效率或准确性。我们引入了BiSeNetFormer,它利用了两流语义分割架构的效率,并将其扩展到掩码分类框架。我们的方法保持了捕捉详细和语义信息的高效空间和上下文路径,同时利用了高效的可转换基分割头计算二进制掩码和类概率。通过轻松支持多个任务,包括语义和视网膜分割,BiSeNetFormer为多任务分割提供了一个通用的解决方案。我们在流行的数据集(城市风光和ADE20K)上评估我们的方法,证明了我们令人印象深刻的推理速度,同时保持与最先进架构的竞争准确性。我们的结果表明,BiSeNetFormer在快速、高效和多任务分割网络方面取得了显著的进展,缩小了模型效率和任务适应性的差距。
https://arxiv.org/abs/2404.09570
Broad-scale marine surveys performed by underwater vehicles significantly increase the availability of coral reef imagery, however it is costly and time-consuming for domain experts to label images. Point label propagation is an approach used to leverage existing image data labeled with sparse point labels. The resulting augmented ground truth generated is then used to train a semantic segmentation model. Here, we first demonstrate that recent advances in foundation models enable generation of multi-species coral augmented ground truth masks using denoised DINOv2 features and K-Nearest Neighbors (KNN), without the need for any pre-training or custom-designed algorithms. For extremely sparsely labeled images, we propose a labeling regime based on human-in-the-loop principles, resulting in significant improvement in annotation efficiency: If only 5 point labels per image are available, our proposed human-in-the-loop approach improves on the state-of-the-art by 17.3% for pixel accuracy and 22.6% for mIoU; and by 10.6% and 19.1% when 10 point labels per image are available. Even if the human-in-the-loop labeling regime is not used, the denoised DINOv2 features with a KNN outperforms the prior state-of-the-art by 3.5% for pixel accuracy and 5.7% for mIoU (5 grid points). We also provide a detailed analysis of how point labeling style and the quantity of points per image affects the point label propagation quality and provide general recommendations on maximizing point label efficiency.
广泛的海洋调查通过水下机器人进行的珊瑚礁图像调查显著增加了珊瑚礁图像的可用性,然而领域专家花时间和精力来标注图像是非常昂贵和费时的。点标签传播是一种利用带有稀疏点标签的现有图像数据来训练模型的方法。然后,生成的增强真实值被用于训练语义分割模型。在这里,我们首先证明, recent advances in foundation models enable the generation of multi-species coral augmented ground truth masks using denoised DINOv2 features and K-Nearest Neighbors (KNN),而无需进行预训练或自定义算法。对于极度稀疏的标注图像,我们提出了基于人类闭环原则的标注方案,从而在像素准确度和mIoU方面实现了显著的改进:如果每张图片只有5个点标签可用,我们的人类闭环方法在像素准确度和mIoU方面的表现与最先进的水平相当,相差17.3%;而当每张图片有10个点标签时,相差22.6%。即使不使用人类闭环标注方案, denoised DINOv2 features with KNN也超越了之前的状态水平,在像素准确度和mIoU(5个网格点)方面分别提高了3.5%和5.7%。我们还对点标记样式和每张图片的点数对点标记传播质量的影响进行了详细分析,并提供了关于最大化点标记效率的一般建议。
https://arxiv.org/abs/2404.09406