Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce FeatUp, a task- and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multi-view consistency loss with deep analogies to NeRFs. Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.
深度特征是计算机视觉研究的核心,捕捉图像语义并使社区能够在零或少数样本的情况下解决下游任务。然而,这些特征通常缺乏进行如分割和深度预测等直接密集预测任务的空间分辨率,因为模型在大型区域上积极抽取信息。在这项工作中,我们引入了FeatUp,一个任务和模型无关的框架,用于在深度特征中恢复丢失的时空信息。我们引入了两种FeatUp变体:一种在单前向传递中指导具有高分辨率信号的特征,另一种将隐式模型适配于单个图像以重构任何分辨率下的特征。两种方法都使用深度模拟NeRFs的多视图一致性损失。我们的特征保留其原始语义,可以交换到现有的应用程序中,甚至在没有重新训练的情况下实现分辨率和性能的提升。我们证明了FeatUp在类激活图生成、用于分割和深度预测的迁移学习以及语义分割的端到端训练方面显著优于其他特征放大和图像超分辨率方法。
https://arxiv.org/abs/2403.10516
Image segmentation is one of the most fundamental problems in computer vision and has drawn a lot of attentions due to its vast applications in image understanding and autonomous driving. However, designing effective and efficient segmentation neural architectures is a labor-intensive process that may require lots of trials by human experts. In this paper, we address the challenge of integrating multi-head self-attention into high resolution representation CNNs efficiently, by leveraging architecture search. Manually replacing convolution layers with multi-head self-attention is non-trivial due to the costly overhead in memory to maintain high resolution. By contrast, we develop a multi-target multi-branch supernet method, which not only fully utilizes the advantages of high-resolution features, but also finds the proper location for placing multi-head self-attention module. Our search algorithm is optimized towards multiple objective s (e.g., latency and mIoU) and capable of finding architectures on Pareto frontier with arbitrary number of branches in a single search. We further present a series of model via Hybrid Convolutional-Transformer Architecture Search (HyCTAS) method that searched for the best hybrid combination of light-weight convolution layers and memory-efficient self-attention layers between branches from different resolutions and fuse to high resolution for both efficiency and effectiveness. Extensive experiments demonstrate that HyCTAS outperforms previous methods on semantic segmentation task. Code and models are available at \url{this https URL}.
图像分割是计算机视觉中最基本的问题之一,由于其在图像理解和自动驾驶中的广泛应用,因此受到了很多关注。然而,设计有效且高效的分割神经架构是一个劳动密集的过程,可能需要许多专家的人工作尝试。在本文中,我们通过利用架构搜索解决了将多头自注意力集成到高分辨率表示CNNs中的问题,通过构建多目标多分支超级网络。通过手动替换卷积层为多头自注意力,由于需要高昂的内存开销来维持高分辨率,因此这是不可能的。相反,我们开发了一种多目标多分支超级网络方法,不仅充分利用了高分辨率特征的优势,而且发现了放置多头自注意力的适当位置。我们的搜索算法针对多个目标(如延迟和mIoU)进行优化,可以在一个搜索中找到架构在帕累托前沿的任意数量分支上的最优架构。我们还通过HyCTAS方法展示了一系列模型,该方法在寻找不同分辨率分支的最佳轻量级卷积层和内存高效的自注意力层之间进行了搜索,将高分辨率与效率进行了平衡。大量实验证明,HyCTAS在语义分割任务上优于以前的算法。代码和模型可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2403.10413
In this study, we address the intricate challenge of multi-task dense prediction, encompassing tasks such as semantic segmentation, depth estimation, and surface normal estimation, particularly when dealing with partially annotated data (MTPSL). The complexity arises from the absence of complete task labels for each training image. Given the inter-related nature of these pixel-wise dense tasks, our focus is on mining and capturing cross-task relationships. Existing solutions typically rely on learning global image representations for global cross-task image matching, imposing constraints that, unfortunately, sacrifice the finer structures within the images. Attempting local matching as a remedy faces hurdles due to the lack of precise region supervision, making local alignment a challenging endeavor. The introduction of Segment Anything Model (SAM) sheds light on addressing local alignment challenges by providing free and high-quality solutions for region detection. Leveraging SAM-detected regions, the subsequent challenge lies in aligning the representations within these regions. Diverging from conventional methods that directly learn a monolithic image representation, our proposal involves modeling region-wise representations using Gaussian Distributions. Aligning these distributions between corresponding regions from different tasks imparts higher flexibility and capacity to capture intra-region structures, accommodating a broader range of tasks. This innovative approach significantly enhances our ability to effectively capture cross-task relationships, resulting in improved overall performance in partially supervised multi-task dense prediction scenarios. Extensive experiments conducted on two widely used benchmarks underscore the superior effectiveness of our proposed method, showcasing state-of-the-art performance even when compared to fully supervised methods.
在这项研究中,我们解决了多任务密集预测的复杂挑战,包括语义分割、深度估计和表面法线估计等任务,尤其是在处理部分标注数据(MTPSL)时。复杂性源于每个训练图像缺少完整的任务标签。鉴于这些像素级密集任务之间的相关性,我们的重点是挖掘和捕捉跨任务关系。现有的解决方案通常依赖于全局图像表示学习全局跨任务图像匹配,并强加一些限制,不幸的是,牺牲了图像中的更细结构。尝试局部匹配作为解决方法遇到了障碍,因为缺乏精确的区域指导,使得局部对齐成为一个具有挑战性的任务。SAM模型的引入使通过为区域检测提供免费且高质量解决方案来解决局部对齐问题。利用SAM检测到的区域,接下来的挑战在于对齐这些区域内的表示。从传统的直接学习单体图像表示的方法中脱离出来,我们的方案使用高斯分布对区域进行建模。在对应任务的不同区域之间对齐这些分布赋予了更高的灵活性和容量,适应了更广泛的任务。这种创新方法显著增强了我们捕捉跨任务关系的能力,在部分监督的多任务密集预测场景中实现了整体性能的提高。在两个广泛使用的基准上进行的大量实验都证实了所提出方法的优势,即使与完全监督方法相比,表现也相当卓越。
https://arxiv.org/abs/2403.10252
Surgical instrument segmentation in laparoscopy is essential for computer-assisted surgical systems. Despite the Deep Learning progress in recent years, the dynamic setting of laparoscopic surgery still presents challenges for precise segmentation. The nnU-Net framework excelled in semantic segmentation analyzing single frames without temporal information. The framework's ease of use, including its ability to be automatically configured, and its low expertise requirements, have made it a popular base framework for comparisons. Optical flow (OF) is a tool commonly used in video tasks to estimate motion and represent it in a single frame, containing temporal information. This work seeks to employ OF maps as an additional input to the nnU-Net architecture to improve its performance in the surgical instrument segmentation task, taking advantage of the fact that instruments are the main moving objects in the surgical field. With this new input, the temporal component would be indirectly added without modifying the architecture. Using CholecSeg8k dataset, three different representations of movement were estimated and used as new inputs, comparing them with a baseline model. Results showed that the use of OF maps improves the detection of classes with high movement, even when these are scarce in the dataset. To further improve performance, future work may focus on implementing other OF-preserving augmentations.
腹腔镜手术中手术器械分割对于计算机辅助手术系统至关重要。尽管近年来深度学习取得了进步,但腹腔镜手术的动态设置仍然存在对于精确分割的挑战。nnU-Net框架在语义分割中分析单帧数据时表现出色,而无需考虑时间信息。该框架的易用性(包括自动配置的能力)以及低专业要求,使其成为比较激烈的基础框架。 光流(OF)是一种常用的视频任务工具,用于估计运动并将其表示为一帧,包含时间信息。这项工作旨在将OF地图作为nnU-Net架构的额外输入,以提高其在手术器械分割任务中的性能,并利用手术领域中器械是主要运动对象的事实。通过这种新输入,可以间接地添加时间组件而无需修改架构。使用CholecSeg8k数据集,估计了三种不同的运动表示,并将其用作新的输入,与基线模型进行比较。结果显示,使用OF地图可以提高对于数据集中运动类别的检测,即使这些类别在数据集中较为稀缺。为了进一步提高性能,未来的工作可以关注实现其他OF保留的增强。
https://arxiv.org/abs/2403.10216
Landslides are one of the most destructive natural disasters in the world, posing a serious threat to human life and safety. The development of foundation models has provided a new research paradigm for large-scale landslide detection. The Segment Anything Model (SAM) has garnered widespread attention in the field of image segmentation. However, our experiment found that SAM performed poorly in the task of landslide segmentation. We propose TransLandSeg, which is a transfer learning approach for landslide semantic segmentation based on a vision foundation model (VFM). TransLandSeg outperforms traditional semantic segmentation models on both the Landslide4Sense dataset and the Bijie landslide dataset. Our proposed adaptive transfer learning (ATL) architecture enables the powerful segmentation capability of SAM to be transferred to landslide detection by training only 1.3% of the number of the parameters of SAM, which greatly improves the training efficiency of the model. Finally we also conducted ablation experiments on models with different ATL structures, concluded that the deployment location and residual connection of ATL play an important role in TransLandSeg accuracy improvement.
滑坡是世界上最具破坏性的自然灾害之一,对人类生命和安全构成严重威胁。基于基础模型的滑坡检测研究新范式已经取得了一定的进展。在图像分割领域,Segment Anything Model (SAM) 已经引起了广泛的关注。然而,我们的实验发现,SAM在滑坡分割任务中的表现不佳。我们提出了TransLandSeg,一种基于视觉基础模型(VFM)的滑坡语义分割传输学习方法。TransLandSeg在Landslide4Sense数据集和Bijie landslide数据集上都优于传统的语义分割模型。我们提出的自适应传输学习(ATL)架构通过仅训练SAM的1.3%参数,实现了SAM强大的分割能力,大大提高了模型的训练效率。最后,我们还对具有不同ATL结构模型的模型进行了消融实验,结果表明,ATL的部署位置和残差连接在TransLandSeg准确性改进中发挥着重要作用。
https://arxiv.org/abs/2403.10127
Unsupervised domain adaptation (UDA) is vital for alleviating the workload of labeling 3D point cloud data and mitigating the absence of labels when facing a newly defined domain. Various methods of utilizing images to enhance the performance of cross-domain 3D segmentation have recently emerged. However, the pseudo labels, which are generated from models trained on the source domain and provide additional supervised signals for the unseen domain, are inadequate when utilized for 3D segmentation due to their inherent noisiness and consequently restrict the accuracy of neural networks. With the advent of 2D visual foundation models (VFMs) and their abundant knowledge prior, we propose a novel pipeline VFMSeg to further enhance the cross-modal unsupervised domain adaptation framework by leveraging these models. In this work, we study how to harness the knowledge priors learned by VFMs to produce more accurate labels for unlabeled target domains and improve overall performance. We first utilize a multi-modal VFM, which is pre-trained on large scale image-text pairs, to provide supervised labels (VFM-PL) for images and point clouds from the target domain. Then, another VFM trained on fine-grained 2D masks is adopted to guide the generation of semantically augmented images and point clouds to enhance the performance of neural networks, which mix the data from source and target domains like view frustums (FrustumMixing). Finally, we merge class-wise prediction across modalities to produce more accurate annotations for unlabeled target domains. Our method is evaluated on various autonomous driving datasets and the results demonstrate a significant improvement for 3D segmentation task.
无监督域适应(UDA)对于减轻标注3D点云数据的劳动量并在面对新定义领域时缓解缺乏标签非常重要。最近,出现了许多利用图像增强跨域3D分割性能的方法。然而,由于其固有的噪声问题,伪标签,即从预训练于源域的模型生成的提供给未见领域额外监督信号的模型,在用于3D分割时是不够的,从而限制了神经网络的准确性。随着二维视觉基础模型(VFMs)的出现及其丰富的知识储备,我们提出了一个名为VFMSeg的新管道VFMSeg,通过利用这些模型进一步增强了跨模态的无监督域适应框架。 在这项工作中,我们研究了VFMs通过学习知识储备如何为未标记的目标域产生更准确标签,从而提高整体性能。首先,我们利用预训练的跨模态VFM,该模型在大型图像-文本对上进行预训练,为源域的图像和点云提供监督标签(VFM-PL)。然后,我们选择另一个在细粒度2D掩码上训练的VFM,该模型用于生成语义增强的图像和点云,以提高神经网络的性能,这些数据来自源域和目标域,就像视差(FrustumMixing)一样混合数据。最后,我们将类别级别的预测合并,以产生更准确的无标记目标域的注释。 我们对各种自动驾驶数据集进行了评估,结果表明,我们的方法在3D分割任务上取得了显著的改进。
https://arxiv.org/abs/2403.10001
Weakly supervised surgical instrument segmentation with only instrument presence labels has been rarely explored in surgical domain. To mitigate the highly under-constrained challenges, we extend a two-stage weakly supervised segmentation paradigm with temporal attributes from two perspectives. From a temporal equivariance perspective, we propose a prototype-based temporal equivariance regulation loss to enhance pixel-wise consistency between adjacent features. From a semantic continuity perspective, we propose a class-aware temporal semantic continuity loss to constrain the semantic consistency between a global view of target frame and local non-discriminative regions of adjacent reference frame. To the best of our knowledge, WeakSurg is the first instrument-presence-only weakly supervised segmentation architecture to take temporal information into account for surgical scenarios. Extensive experiments are validated on Cholec80, an open benchmark for phase and instrument recognition. We annotate instance-wise instrument labels with fixed time-steps which are double checked by a clinician with 3-years experience. Our results show that WeakSurg compares favorably with state-of-the-art methods not only on semantic segmentation metrics but also on instance segmentation metrics.
在手术领域中,我们很少对仅凭器械存在标签的弱监督手术器械分割进行研究。为了减轻高度约束力的问题,我们在两个角度扩展了两级弱监督分割范式。从时间等价性角度来看,我们提出了一个基于原型的时间等价性调节损失,以增强相邻特征之间的像素级一致性。从语义连续性角度来看,我们提出了一个类感知的时间语义连续性损失,以约束目标帧全局视图与相邻参考帧局部非区分区域之间的语义一致性。据我们所知,WeakSurg是第一个考虑了手术场景中时间信息的仅凭器械存在标签的弱监督分割架构。我们在Cholec80(用于阶段和器械识别的开放基准)上进行了大量实验验证。我们通过具有3年临床经验的专业医生对实例逐个进行标注的时间步长。我们的结果表明,WeakSurg不仅在语义分割指标上与最先进的方法相当,而且在实例分割指标上也表现出了竞争力。
https://arxiv.org/abs/2403.09551
Semantic Segmentation is one of the most challenging vision tasks, usually requiring large amounts of training data with expensive pixel-level annotations. With the success of foundation models and especially vision-language models, recent works attempt to achieve zero-shot semantic segmentation while requiring either large scale training or additional image/pixel-level annotations. In this work, we build a lightweight module on top of a self-supervised pretrained vision encoder to align patch features with a pre-trained text encoder. Importantly, we generate free annotations for any semantic segmentation dataset using existing foundation models and train our alignment module cost free. We use CLIP to detect objects and SAM to generate high quality object masks. Our approach can bring language-based semantics to any pre-trained vision encoder with minimal training. Our module is lightweight, uses foundation models as a sole source of supervision and shows impressive generalization capability from little training data with no annotation.
语义分割是视觉任务中最具有挑战性的任务之一,通常需要大量训练数据,且通常需要昂贵的像素级注释。随着基础模型(如视觉语言模型)的成功,最近的工作尝试在需要大规模训练或额外图像/像素级注释的情况下实现零散语义分割。在这项工作中,我们在自监督预训练视觉编码器的基础上构建了一个轻量级的模块,以将补丁特征与预训练文本编码器对齐。重要的是,我们使用现有的基础模型生成任何语义分割数据集的免费注释,并免费训练我们的对齐模块。我们使用 CLIP 来检测物体,SAM 来生成高质量的物体掩码。我们的方法可以将语言基的语义信息应用到任何预训练的视觉编码器上,只要有很少的训练数据,无需注释。我们的模块轻便,使用基础模型作为唯一监督来源,表现出令人印象深刻的泛化能力。
https://arxiv.org/abs/2403.09307
Despite recent advancements in semantic segmentation, where and what pixels are hard to segment remains largely unexplored. Existing research only separates an image into easy and hard regions and empirically observes the latter are associated with object boundaries. In this paper, we conduct a comprehensive analysis of hard pixel errors, categorizing them into three types: false responses, merging mistakes, and displacements. Our findings reveal a quantitative association between hard pixels and aliasing, which is distortion caused by the overlapping of frequency components in the Fourier domain during downsampling. To identify the frequencies responsible for aliasing, we propose using the equivalent sampling rate to calculate the Nyquist frequency, which marks the threshold for aliasing. Then, we introduce the aliasing score as a metric to quantify the extent of aliasing. While positively correlated with the proposed aliasing score, three types of hard pixels exhibit different patterns. Here, we propose two novel de-aliasing filter (DAF) and frequency mixing (FreqMix) modules to alleviate aliasing degradation by accurately removing or adjusting frequencies higher than the Nyquist frequency. The DAF precisely removes the frequencies responsible for aliasing before downsampling, while the FreqMix dynamically selects high-frequency components within the encoder block. Experimental results demonstrate consistent improvements in semantic segmentation and low-light instance segmentation tasks. The code is available at: \url{this https URL}.
尽管最近在语义分割方面取得了进展,但哪些和哪些像素难以分割仍然是大部分未探索的。现有的研究仅将图像分为容易和困难区域,并经验性地观察到后者的像素与物体边界相关。在本文中,我们对硬像素误差进行全面分析,并将它们分为三种类型:虚假响应、合并错误和位移。我们的研究揭示了硬像素与混叠频率分量在傅里叶域下采样过程中的 aliasing 之间的定量关系。为了确定导致 aliasing 的频率,我们提出使用等效采样率计算奈奎斯特频率,这是 aliasing 的阈值。然后,我们将 aliasing 分数作为一种指标来量化 aliasing 的程度。虽然 aliasing 得分与提出的 aliasing 得分呈正相关,但三种类型的硬像素表现出不同的模式。在这里,我们提出两种新的去偏滤波器(DAF)和频率混合(FreqMix)模块,通过准确地去除或调整高于奈奎斯特频率的频率来减轻 aliasing 降解。DAF 在下采样之前精确地去除与 aliasing 相关的频率,而 FreqMix 动态选择编码器块中的高频成分。实验结果表明,在语义分割和低光实例分割任务中,去偏效果得到了显著的改善。代码可在此处下载:\url{此链接}。
https://arxiv.org/abs/2403.09065
We present the first publicly available RGB-thermal dataset designed for aerial robotics operating in natural environments. Our dataset captures a variety of terrains across the continental United States, including rivers, lakes, coastlines, deserts, and forests, and consists of synchronized RGB, long-wave thermal, global positioning, and inertial data. Furthermore, we provide semantic segmentation annotations for 10 classes commonly encountered in natural settings in order to facilitate the development of perception algorithms robust to adverse weather and nighttime conditions. Using this dataset, we propose new and challenging benchmarks for thermal and RGB-thermal semantic segmentation, RGB-to-thermal image translation, and visual-inertial odometry. We present extensive results using state-of-the-art methods and highlight the challenges posed by temporal and geographical domain shifts in our data. Dataset and accompanying code will be provided at this https URL
我们为您呈现了第一个公开可用的用于自然环境中操作的无人机遥感(RGB-thermal)数据集。我们的数据集涵盖了美国大陆的多种地形,包括河流、湖泊、海岸线、沙漠和森林,由同步的RGB、长波红外、全球定位和惯性数据组成。此外,我们还提供了10个自然环境中常见类别的语义分割注释,以便开发对不利天气和夜间条件的感知算法。使用这个数据集,我们提出了新的具有挑战性的 thermal 和 RGB-thermal 语义分割基准,RGB-to-thermal 图像翻译和视觉惯性导航。我们使用最先进的方法呈现了广泛的结果,并突出了数据中时间和技术领域变化带来的挑战。数据集及相应代码将在此链接处提供。
https://arxiv.org/abs/2403.08997
Knee osteoarthritis is a degenerative joint disease that induces chronic pain and disability. Bone morphological analysis is a promising tool to understand the mechanical aspect of this disorder. This study proposes a 2D bone morphological analysis using manually segmented bones to explore morphological features related to distinct pain conditions. Furthermore, six semantic segmentation algorithms are assessed for extracting femur and tibia bones from X-ray images. Our analysis reveals that the morphology of the femur undergoes significant changes in instances where pain worsens. Conversely, improvements in pain may not manifest pronounced alterations in bone shape. The few-shot-learning-based algorithm, UniverSeg, demonstrated superior segmentation results with Dice scores of 99.69% for femur and 99.60% for tibia. Regarding pain condition classification, the zero-shot-learning-based algorithm, CP-SAM, achieved the highest accuracy at 66% among all models. UniverSeg is recommended for automatic knee bone segmentation, while SAM models show potential with prompt encoder modifications for optimized outcomes. These findings highlight the effectiveness of few-shot learning for semantic segmentation and the potential of zero-shot learning in enhancing classification models for knee osteoarthritis diagnosis.
膝关节骨性关节炎是一种退行性关节疾病,导致持续疼痛和残疾。骨形态分析是一种有希望的工具,用于了解这种疾病的机械方面。本研究提出了一种使用手动分割的骨进行二维骨形态分析,以探索与不同疼痛情况相关的形态特征。此外,还评估了六个语义分割算法从X光图像中提取股骨和胫骨骨头的效果。我们的分析发现,在疼痛加剧的情况下,股骨的形态发生显著变化。相反,疼痛的改善可能不会显著改变骨的形状。基于少数样本学习(UniverSeg)的算法在股骨和胫骨的Dice分数分别为99.69%和99.60%时表现出优异的分割结果。在疼痛状况分类方面,基于零样本学习(CP-SAM)的算法在所有模型中取得了最高准确率,达到66%。UniverSeg建议用于自动膝关节骨分割,而SAM模型则表明,通过针对提示编码器的修改,可以实现更好的分类结果。这些发现突出了少量样本学习在语义分割的有效性,以及零样本学习在增强膝关节骨性关节炎诊断分类模型方面的潜力。
https://arxiv.org/abs/2403.08761
In autonomous vehicles, understanding the surrounding 3D environment of the ego vehicle in real-time is essential. A compact way to represent scenes while encoding geometric distances and semantic object information is via 3D semantic occupancy maps. State of the art 3D mapping methods leverage transformers with cross-attention mechanisms to elevate 2D vision-centric camera features into the 3D domain. However, these methods encounter significant challenges in real-time applications due to their high computational demands during inference. This limitation is particularly problematic in autonomous vehicles, where GPU resources must be shared with other tasks such as localization and planning. In this paper, we introduce an approach that extracts features from front-view 2D camera images and LiDAR scans, then employs a sparse convolution network (Minkowski Engine), for 3D semantic occupancy prediction. Given that outdoor scenes in autonomous driving scenarios are inherently sparse, the utilization of sparse convolution is particularly apt. By jointly solving the problems of 3D scene completion of sparse scenes and 3D semantic segmentation, we provide a more efficient learning framework suitable for real-time applications in autonomous vehicles. We also demonstrate competitive accuracy on the nuScenes dataset.
在自动驾驶中,实时理解 ego 车辆周围的 3D 环境对自动驾驶至关重要。一种通过 3D 语义图腾地图表示场景的方法是,通过 3D 语义图腾地图编码几何距离和语义物体信息来实现。最先进的 3D 映射方法利用跨注意机制的 transformer 将二维视觉相关特征提升到 3D 领域。然而,这些方法在实时应用中遇到了重大挑战,因为其在推理过程中需要高计算需求。这个限制在自动驾驶中尤为重要,因为必须将 GPU 资源与其他任务(如定位和规划)共享。在本文中,我们提出了一个方法,该方法从前视 2D 相机图像和激光雷达扫描中提取特征,然后采用稀疏卷积网络(Minkowski 引擎)进行 3D 语义图腾预测。由于自动驾驶场景中的户外场景固有稀疏,稀疏卷积的应用尤为合适。通过共同解决 3D 场景的完整性和 3D 语义分割问题,我们为自动驾驶车辆提供了一个更有效的学习框架,适用于实时应用。我们还展示了在 nuScenes 数据集上的竞争精度。
https://arxiv.org/abs/2403.08748
The pre-trained vision-language model, exemplified by CLIP, advances zero-shot semantic segmentation by aligning visual features with class embeddings through a transformer decoder to generate semantic masks. Despite its effectiveness, prevailing methods within this paradigm encounter challenges, including overfitting on seen classes and small fragmentation in masks. To mitigate these issues, we propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.Specifically, we leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings. Moreover, to circumvent noisy alignments from the vision part due to its redundant nature, we introduce route attention into self-attention for finding visual consensus, thereby enhancing semantic consistency within the same object. Equipped with a vision-language prompting strategy, our approach significantly boosts the generalization capacity of segmentation models for unseen classes. Experimental results underscore the effectiveness of our approach, showcasing mIoU gains of 4.5 on the PASCAL VOC 2012 and 3.6 on the COCO-Stuff 164k for unseen classes compared with the state-of-the-art methods.
预训练的视觉语言模型,例如CLIP,通过将视觉特征与类嵌入对齐来实现零散语义分割,从而生成语义掩码。尽管这种方法的有效性很高,但该范式中的现有方法遇到了一些挑战,包括对可见类别的过拟合和对掩码的微破碎。为了减轻这些问题,我们提出了一个语言驱动的视觉共识(LDVC)方法,促进语义和视觉信息之间的更好对齐。 具体来说,我们利用类嵌入的离散和抽象性质作为锚点,将视觉特征引导向类嵌入。此外,为了绕过由于其冗余性质而导致的视觉部分噪声对齐,我们引入了路径关注到自注意力中,以寻找视觉共识,从而增强同一物体内部语义的一致性。配备了视觉语言提示策略,我们的方法显著提高了未见类别分割模型的泛化能力。实验结果证实了我们的方法的有效性,与最先进的methods相比,mIoU gains分别为4.5和3.6。
https://arxiv.org/abs/2403.08426
Despite the impressive performance achieved by data-fusion networks with duplex encoders for visual semantic segmentation, they become ineffective when spatial geometric data are not available. Implicitly infusing the spatial geometric prior knowledge acquired by a duplex-encoder teacher model into a single-encoder student model is a practical, albeit less explored research avenue. This paper delves into this topic and resorts to knowledge distillation approaches to address this problem. We introduce the Learning to Infuse "X" (LIX) framework, with novel contributions in both logit distillation and feature distillation aspects. We present a mathematical proof that underscores the limitation of using a single fixed weight in decoupled knowledge distillation and introduce a logit-wise dynamic weight controller as a solution to this issue. Furthermore, we develop an adaptively-recalibrated feature distillation algorithm, including two technical novelties: feature recalibration via kernel regression and in-depth feature consistency quantification via centered kernel alignment. Extensive experiments conducted with intermediate-fusion and late-fusion networks across various public datasets provide both quantitative and qualitative evaluations, demonstrating the superior performance of our LIX framework when compared to other state-of-the-art approaches.
尽管使用双层编码器进行视觉语义分割的数据融合网络取得了令人印象深刻的性能,但当缺乏空间几何数据时,它们变得无效。将双层编码器教师模型的空间几何先验知识隐式地注入到一个单独的编码器学生模型中,是实际可行的,尽管这个研究领域的探索还不够深入。本文探讨了这个问题,并采用了知识蒸馏方法来解决这个问题。我们引入了“学习引导的蒸馏”框架(LIX),在逻辑蒸馏和特征蒸馏方面都有新的贡献。我们给出了一个数学证明,强调在解耦知识蒸馏中使用单个固定权重是有限的。我们还提出了一种通过核回归进行特征重新归一化的动态权重控制器,作为解决这个问题的一种解决方案。此外,我们开发了一种自适应重新校准的特征蒸馏算法,包括两个技术创新:通过核回归进行特征重新归一化以及通过中心核对齐进行深度特征一致性量化。通过对各种公共数据集上使用中间融合和晚融合网络的广泛实验,提供了定量和定性评估,证明了LIX框架与最先进的方法的优越性能。
https://arxiv.org/abs/2403.08215
Deep learning and Convolutional Neural Networks (CNNs) have driven major transformations in diverse research areas. However, their limitations in handling low-frequency information present obstacles in certain tasks like interpreting global structures or managing smooth transition images. Despite the promising performance of transformer structures in numerous tasks, their intricate optimization complexities highlight the persistent need for refined CNN enhancements using limited resources. Responding to these complexities, we introduce a novel framework, the Multiscale Low-Frequency Memory (MLFM) Network, with the goal to harness the full potential of CNNs while keeping their complexity unchanged. The MLFM efficiently preserves low-frequency information, enhancing performance in targeted computer vision tasks. Central to our MLFM is the Low-Frequency Memory Unit (LFMU), which stores various low-frequency data and forms a parallel channel to the core network. A key advantage of MLFM is its seamless compatibility with various prevalent networks, requiring no alterations to their original core structure. Testing on ImageNet demonstrated substantial accuracy improvements in multiple 2D CNNs, including ResNet, MobileNet, EfficientNet, and ConvNeXt. Furthermore, we showcase MLFM's versatility beyond traditional image classification by successfully integrating it into image-to-image translation tasks, specifically in semantic segmentation networks like FCN and U-Net. In conclusion, our work signifies a pivotal stride in the journey of optimizing the efficacy and efficiency of CNNs with limited resources. This research builds upon the existing CNN foundations and paves the way for future advancements in computer vision. Our codes are available at this https URL MLFM.
深度学习和卷积神经网络(CNNs)已经在各种研究领域取得了重大变革。然而,它们在处理低频信息方面存在局限性,这使得在某些任务中(如解释全局结构或管理平滑过渡图像)解释全局结构存在障碍。尽管Transformer结构在许多任务上表现出良好的性能,但它们复杂的优化复杂性揭示了在有限资源下优化CNN的持续需求。为了应对这些复杂性,我们引入了一个新框架——多尺度低频内存(MLFM)网络,以充分利用CNN的潜力,同时保持其复杂性不变。 MLFM的核心是低频内存单元(LFMU),它存储各种低频数据并形成与核心网络的并行通道。MLFM的关键优势在于其与各种流行网络的无缝兼容,无需对它们的原始核心结构进行修改。在ImageNet上的测试表明,包括ResNet、MobileNet、EfficientNet和ConvNeXt在内的多个2D CNN的准确率都有显著提高。此外,我们还通过成功将MLFM集成到图像到图像的翻译任务中,尤其是在语义分割网络(如FCN和U-Net)中,展示了MLFM的多样性。 总之,我们的工作在优化使用有限资源优化CNN的效率和效率方面迈出了关键一步。这项研究建立在现有的CNN基础之上,为计算机视觉的未来发展铺平道路。我们的代码可以从以下链接处获取MLFM:https://MLFM.com。
https://arxiv.org/abs/2403.08157
Facial attribute editing using generative models can impair automated face recognition. This degradation persists even with recent identity-preserving models such as InstantID. To mitigate this issue, we propose two techniques that perform local and global attribute editing. Local editing operates on the finer details via a regularization-free method based on ControlNet conditioned on depth maps and auxiliary semantic segmentation masks. Global editing operates on coarser details via a regularization-based method guided by custom loss and regularization set. In this work, we empirically ablate twenty-six facial semantic, demographic and expression-based attributes altered using state-of-the-art generative models and evaluate them using ArcFace and AdaFace matchers on CelebA, CelebAMaskHQ and LFW datasets. Finally, we use LLaVA, a vision-language framework for attribute prediction to validate our editing techniques. Our methods outperform SoTA (BLIP, InstantID) at facial editing while retaining identity.
使用生成模型进行面部属性编辑可能会损害自动人脸识别的准确性。即使使用最新的自证身份保留模型(如InstantID)也会出现这种降级。为了减轻这个问题,我们提出了两种方法,它们都进行局部和全局属性编辑。局部编辑通过基于深度图的条件自由度为ControlNet的规范化方法来操作更细的细节。全局编辑通过基于自定义损失和正则化设置的规范化方法来操作较粗的细节。在本文中,我们通过实验对使用最先进的生成模型(如ArcFace和AdaFace)对26个面部语义、人口统计和表情属性进行编辑,并使用CelebA、CelebAMaskHQ和LFW数据集评估它们。最后,我们使用LLaVa(用于属性预测的视觉语言框架)验证我们的编辑技术。我们的方法在保留身份的同时优于SoTA(BLIP和InstantID)在面部编辑方面。
https://arxiv.org/abs/2403.08092
Recent weakly supervised semantic segmentation (WSSS) methods strive to incorporate contextual knowledge to improve the completeness of class activation maps (CAM). In this work, we argue that the knowledge bias between instances and contexts affects the capability of the prototype to sufficiently understand instance semantics. Inspired by prototype learning theory, we propose leveraging prototype awareness to capture diverse and fine-grained feature attributes of instances. The hypothesis is that contextual prototypes might erroneously activate similar and frequently co-occurring object categories due to this knowledge bias. Therefore, we propose to enhance the prototype representation ability by mitigating the bias to better capture spatial coverage in semantic object regions. With this goal, we present a Context Prototype-Aware Learning (CPAL) strategy, which leverages semantic context to enrich instance comprehension. The core of this method is to accurately capture intra-class variations in object features through context-aware prototypes, facilitating the adaptation to the semantic attributes of various instances. We design feature distribution alignment to optimize prototype awareness, aligning instance feature distributions with dense features. In addition, a unified training framework is proposed to combine label-guided classification supervision and prototypes-guided self-supervision. Experimental results on PASCAL VOC 2012 and MS COCO 2014 show that CPAL significantly improves off-the-shelf methods and achieves state-of-the-art performance. The project is available at this https URL.
近年来,弱监督语义分割(WSSS)方法试图引入上下文知识来提高分类激活图(CAM)的完整性。在本文中,我们认为实例和上下文之间的知识偏差会影响原型对实例语义的理解能力。受到原型学习理论的启发,我们提出了利用原型意识捕捉实例多样性和细粒度特征属性的建议。假设是,由于知识偏差,上下文原型可能会错误地激活相似且频繁共同出现的物体类别。因此,我们提出了一种通过减轻偏差来更好地捕捉语义物体区域的空间覆盖的方法,即增强原型表示能力。为实现这一目标,我们提出了一个上下文原型感知学习(CPAL)策略,它利用语义上下文来丰富实例理解。这一方法的核心是通过语境感知的原型准确捕捉类内特征的差异,促进适应各种实例的语义属性。我们设计了一个特征分布对齐来优化原型意识,将实例特征分布与密集特征对齐。此外,我们还提出了一个统一训练框架,将标签指导分类监督和原型指导自监督相结合。在PASCAL VOC 2012和MS COCO 2014数据集上的实验结果表明,CPAL显著提高了备选方法,并达到了最先进的性能水平。该项目现在可以在以下链接处找到:https://www.acm.org/dl/d/2022.01.02013126000000/
https://arxiv.org/abs/2403.07630
Recently, some large kernel convnets strike back with appealing performance and efficiency. However, given the square complexity of convolution, scaling up kernels can bring about an enormous amount of parameters and the proliferated parameters can induce severe optimization problem. Due to these issues, current CNNs compromise to scale up to 51x51 in the form of stripe convolution (i.e., 51x5 + 5x51) and start to saturate as the kernel size continues growing. In this paper, we delve into addressing these vital issues and explore whether we can continue scaling up kernels for more performance gains. Inspired by human vision, we propose a human-like peripheral convolution that efficiently reduces over 90% parameter count of dense grid convolution through parameter sharing, and manage to scale up kernel size to extremely large. Our peripheral convolution behaves highly similar to human, reducing the complexity of convolution from O(K^2) to O(logK) without backfiring performance. Built on this, we propose Parameter-efficient Large Kernel Network (PeLK). Our PeLK outperforms modern vision Transformers and ConvNet architectures like Swin, ConvNeXt, RepLKNet and SLaK on various vision tasks including ImageNet classification, semantic segmentation on ADE20K and object detection on MS COCO. For the first time, we successfully scale up the kernel size of CNNs to an unprecedented 101x101 and demonstrate consistent improvements.
近年来,一些大型卷积神经网络在吸引人的性能和效率方面回归了。然而,由于卷积的平方复杂性,增加卷积核尺寸可能导致大量参数,并且增多的参数可能会导致严重的优化问题。由于这些问题,目前的CNN不得不将卷积扩展到51x51的形式(即51x5+5x51),并且当卷积尺寸继续增加时,它们会开始变得饱和。在本文中,我们深入研究了这些问题,并探讨了我们是否能在获得更多性能提升的情况下继续扩展卷积核。受到人类视觉的启发,我们提出了一个人类似边缘卷积,通过参数共享有效地减少了密集网格卷积的90%参数数量,并成功地将卷积尺寸扩展到非常大的程度。我们的边缘卷积在很大程度上与人类相似,将从O(K^2)的复杂度降低到O(logK),而不会降低性能。在此基础上,我们提出了参数高效的巨大卷积网络(PeLK)。我们的PeLK在包括ImageNet分类、ADE20K上的语义分割和MS COCO上的物体检测等各种视觉任务中,都优于现代视觉Transformer和ConvNet架构,如Swin、ConvNeXt、RepLKNet和SLAK。对于第一个成功将CNN的卷积大小扩展到前所未有的101x101,并展示出持续改进的情况,我们感到非常自豪。
https://arxiv.org/abs/2403.07589
Interpreting camera data is key for autonomously acting systems, such as autonomous vehicles. Vision systems that operate in real-world environments must be able to understand their surroundings and need the ability to deal with novel situations. This paper tackles open-world semantic segmentation, i.e., the variant of interpreting image data in which objects occur that have not been seen during training. We propose a novel approach that performs accurate closed-world semantic segmentation and, at the same time, can identify new categories without requiring any additional training data. Our approach additionally provides a similarity measure for every newly discovered class in an image to a known category, which can be useful information in downstream tasks such as planning or mapping. Through extensive experiments, we show that our model achieves state-of-the-art results on classes known from training data as well as for anomaly segmentation and can distinguish between different unknown classes.
解码相机数据是自动驾驶等自主系统的关键。在现实环境中运行的视觉系统必须能够理解周围的环境,并需要处理新颖的情况。本文解决了开放世界语义分割问题,即在训练过程中没有见过的新物体出现的图像数据解码。我们提出了一个新颖的方法,可以准确地进行封闭世界语义分割,同时,不需要任何附加训练数据来识别新的类别。我们的方法还提供了对每个新发现类在图像中的已知类别的相似度度量,这在下游任务(如规划和映射)中可以是有用信息。通过广泛的实验,我们证明了我们的模型在训练数据中已知类别的类以及异常分割方面的最佳性能,并且能够区分不同的未知类别。
https://arxiv.org/abs/2403.07532
Deep neural networks for medical image segmentation often produce overconfident results misaligned with empirical observations. Such miscalibration, challenges their clinical translation. We propose to use marginal L1 average calibration error (mL1-ACE) as a novel auxiliary loss function to improve pixel-wise calibration without compromising segmentation quality. We show that this loss, despite using hard binning, is directly differentiable, bypassing the need for approximate but differentiable surrogate or soft binning approaches. Our work also introduces the concept of dataset reliability histograms which generalises standard reliability diagrams for refined visual assessment of calibration in semantic segmentation aggregated at the dataset level. Using mL1-ACE, we reduce average and maximum calibration error by 45% and 55% respectively, maintaining a Dice score of 87% on the BraTS 2021 dataset. We share our code here: this https URL
深度神经网络在医学图像分割中通常会产生过度自信的结果,与经验观察结果不符。这种偏差挑战了其临床应用。我们提出了一种名为边际L1平均校正误差(mL1-ACE)的新辅助损失函数,以改善像素级的校准,同时不牺牲分割质量。我们证明了,尽管使用了硬离散化,这个损失仍然是可导的,绕过了需要近似但不同可导的代理或软离散化方法的需
https://arxiv.org/abs/2403.06759