Text-to-Image diffusion models have made tremendous progress over the past two years, enabling the generation of highly realistic images based on open-domain text descriptions. However, despite their success, text descriptions often struggle to adequately convey detailed controls, even when composed of long and complex texts. Moreover, recent studies have also shown that these models face challenges in understanding such complex texts and generating the corresponding images. Therefore, there is a growing need to enable more control modes beyond text description. In this paper, we introduce Uni-ControlNet, a novel approach that allows for the simultaneous utilization of different local controls (e.g., edge maps, depth map, segmentation masks) and global controls (e.g., CLIP image embeddings) in a flexible and composable manner within one model. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models, eliminating the huge cost of training from scratch. Moreover, thanks to some dedicated adapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2) of adapters, regardless of the number of local or global controls used. This not only reduces the fine-tuning costs and model size, making it more suitable for real-world deployment, but also facilitate composability of different conditions. Through both quantitative and qualitative comparisons, Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. Code is available at \url{this https URL}.
文本到图像扩散模型在过去两年中取得了巨大的进展,基于公开领域的文本描述能够生成高度逼真的图像。然而,尽管它们取得了成功,文本描述常常难以充分传达详细的控制,即使包含长且复杂的文本。此外,最近的研究表明,这些模型在理解这些复杂的文本和生成相应的图像方面面临挑战。因此,越来越需要实现更多的控制模式超越文本描述。在本文中,我们介绍了 Uni-ControlNet,一种新颖的方法,能够同时利用不同的局部控制(例如边缘图、深度图、分块掩码)和全局控制(例如CLIP图像嵌入)在一个模型中以灵活和可组合的方式使用。与现有方法不同,Uni-ControlNet只需要在训练前冻存的文本到图像扩散模型进行微调,消除从头训练的巨大成本。此外,得益于一些专门的适配器设计,Uni-ControlNet只需要一个恒定的数量(即2)的适配器,无论使用多少局部或全局控制。这不仅减少了微调成本和模型大小,使其更适用于现实世界的部署,而且还促进了不同条件的可组合性。通过量化和定性比较,Uni-ControlNet证明了它在控制性和生成质量以及可组合性方面的优势。代码可在 \url{this https URL} 找到。
https://arxiv.org/abs/2305.16322
Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities. Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video. On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +4.2% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at this https URL.
最近,视频对象分割(VOS)由多种modal信号,如语言和音频,引起 industry 和学术界的广泛关注。探索modal之间语义对齐以及不同帧之间的视觉对应是挑战性的任务。然而,现有方法采用不同的modal网络架构,并忽略了帧间间的时间交互。在本文中,我们提出了MUTR,一个多模态统一时间Transformer,以 refering 视频对象分割。第一次使用统一框架,MUTR采用DETR风格的Transformer,能够以文本或音频参考分别分割指定视频对象。具体来说,我们介绍了两种策略,以 fully 探索视频和modal信号之间的时间关系。首先,在Transformer之前进行低级别时间聚合,我们使多模态参考能够从连续的视频帧捕获多尺度的视觉线索。这 effectively赋予文本或音频信号时间知识,并增强modal之间的语义对齐。其次,在Transformer之后进行高级别时间交互,我们进行不同物体嵌入之间的帧间特征通信,有助于更好地跟踪视频对象。在ref-YouTube-VOS和AVSbench数据集,以相应的文本和音频参考,MUTR取得了与当前方法相比 +4.2% 和 +4.2%的J&F改进,这表明我们对于统一多模态VOS的重要性。代码在此https URL发布。
https://arxiv.org/abs/2305.16318
For computer vision tasks, Vision Transformers (ViTs) have become one of the go-to deep net architectures. Despite being inspired by Convolutional Neural Networks (CNNs), ViTs remain sensitive to small shifts in the input image. To address this, we introduce novel designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding. With our proposed modules, we achieve truly shift-equivariant ViTs on four well-established models, namely, Swin, SwinV2, MViTv2, and CvT, both in theory and practice. Empirically, we tested these models on image classification and semantic segmentation, achieving competitive performance across three different datasets while maintaining 100% shift consistency.
对计算机视觉任务而言,视觉转换器(ViTs)已成为深度学习架构的首选之一。尽管受到了卷积神经网络(CNNs)的启发,ViTs仍然对输入图像的微小变化非常敏感。为了解决这一问题,我们提出了ViTs中的每个模块的全新设计,例如 tokenization、自注意力、块融合和位置编码。利用我们提出的模块,我们实现了真正的变换同构ViTs,对四个已知模型(Swin、SwinV2、MViTv2和CvT)进行了验证,在理论和实践上都实现了100%的变换一致性。具体来说,我们在实践中测试了这些模型的图像分类和语义分割性能,在不同数据集上取得了竞争表现,同时保持了100%的变换一致性。
https://arxiv.org/abs/2305.16316
Equivariance has gained strong interest as a desirable network property that inherently ensures robust generalization. However, when dealing with complex systems such as articulated objects or multi-object scenes, effectively capturing inter-part transformations poses a challenge, as it becomes entangled with the overall structure and local transformations. The interdependence of part assignment and per-part group action necessitates a novel equivariance formulation that allows for their co-evolution. In this paper, we present Banana, a Banach fixed-point network for equivariant segmentation with inter-part equivariance by construction. Our key insight is to iteratively solve a fixed-point problem, where point-part assignment labels and per-part SE(3)-equivariance co-evolve simultaneously. We provide theoretical derivations of both per-step equivariance and global convergence, which induces an equivariant final convergent state. Our formulation naturally provides a strict definition of inter-part equivariance that generalizes to unseen inter-part configurations. Through experiments conducted on both articulated objects and multi-object scans, we demonstrate the efficacy of our approach in achieving strong generalization under inter-part transformations, even when confronted with substantial changes in pointcloud geometry and topology.
一致性作为一个重要的网络属性,本身就确保了稳健泛化。然而,在与复杂的系统,如多关节疼痛对象或多对象场景时,有效地捕捉部分之间的变换面临着挑战,因为它与整体结构和局部变换交织在一起。部分分配和每个部分的独立行动需要一种独特的一致性定义,以便它们的协同演化。在本文中,我们介绍了banana,一个由Banach fixed-point网络构建的一致性分割网络,具有部分一致性。我们的关键发现是解决一个固定点问题,该问题点部分分配标签和每个部分的SE(3)一致性同时演化。我们提供了每个步骤的一致性和全球收敛的理论推导,导致一致性最终的收敛状态。我们的定义自然地提供了部门一致性的严格定义,可以泛化到未知的部门配置。通过在多关节疼痛对象和多对象扫描的实验中实施,我们证明了我们的方法在部分变换下实现稳健泛化的有效性,即使面对点云几何和拓扑的重大变化。
https://arxiv.org/abs/2305.16314
Text-to-image model personalization aims to introduce a user-provided concept to the model, allowing its synthesis in diverse contexts. However, current methods primarily focus on the case of learning a single concept from multiple images with variations in backgrounds and poses, and struggle when adapted to a different scenario. In this work, we introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method. Project page is available at: this https URL
文本到图像模型个性化的目标是将用户提供的概念引入模型,并在多种情境下进行合成。然而,当前的方法主要关注从多个图像中学习单一概念的情况,并在适应不同情境时面临困难。在本文中,我们介绍了文本场景分解任务:给定一张可能包含多个概念的图像,我们的目标是提取每个概念的 distinct 文本 token,从而实现对生成的场景的精细控制。为此,我们建议增加输入图像上的掩码,以指示目标概念的存在。这些掩码可以由用户提供或由预先训练的分割模型自动生成。然后我们介绍了一种独特的两阶段定制过程,该过程优化了一组专门化文本嵌入(handles)和模型权重,实现在准确捕捉概念和避免过拟合之间的微妙平衡。我们采用Masked Diffusion Loss来实现handles 生成其指定的概念,并添加Cross-Attention Loss以防止纠缠,我们还介绍了合并采样训练策略,旨在提高生成图像中多个概念的合并能力。我们使用多个自动指标对方法和多个基准进行比较,并使用用户研究进一步确认结果。最后,我们展示了我们方法的多个应用。项目页面可用如下: this https URL
https://arxiv.org/abs/2305.16311
This paper investigates the potential of enhancing Neural Radiance Fields (NeRF) with semantics to expand their applications. Although NeRF has been proven useful in real-world applications like VR and digital creation, the lack of semantics hinders interaction with objects in complex scenes. We propose to imitate the backbone feature of off-the-shelf perception models to achieve zero-shot semantic segmentation with NeRF. Our framework reformulates the segmentation process by directly rendering semantic features and only applying the decoder from perception models. This eliminates the need for expensive backbones and benefits 3D consistency. Furthermore, we can project the learned semantics onto extracted mesh surfaces for real-time interaction. With the state-of-the-art Segment Anything Model (SAM), our framework accelerates segmentation by 16 times with comparable mask quality. The experimental results demonstrate the efficacy and computational advantages of our approach. Project page: \url{https://me.kiui.moe/san/}.
本论文研究如何将神经网络辐射场(NeRF)与语义增强其应用范围。尽管NeRF在虚拟现实和数字创造等实际应用领域已经被证明有用,但缺乏语义会阻碍复杂场景下与物体的互动。我们提议仿效现有的感知模型的主干特性,以通过直接渲染语义特征来实现NeRF的零次元语义分割。我们的框架重写了分割过程,仅从感知模型中应用解码器,从而消除了昂贵的主干需求并实现了3D一致性。此外,我们可以将学到的语义投影到提取的网格表面,实现实时交互。利用最先进的分割任意模型(SAM),我们的框架将分割速度提高了16倍,与同等掩模质量相比。实验结果证明了我们方法的有效性和计算优势。项目页面: \url{https://me.kiui.moe/san/}。
https://arxiv.org/abs/2305.16233
Semi-supervised medical image segmentation offers a promising solution for large-scale medical image analysis by significantly reducing the annotation burden while achieving comparable performance. Employing this method exhibits a high degree of potential for optimizing the segmentation process and increasing its feasibility in clinical settings during translational investigations. Recently, cross-supervised training based on different co-training sub-networks has become a standard paradigm for this task. Still, the critical issues of sub-network disagreement and label-noise suppression require further attention and progress in cross-supervised training. This paper proposes a cross-supervised learning framework based on dual classifiers (DC-Net), including an evidential classifier and a vanilla classifier. The two classifiers exhibit complementary characteristics, enabling them to handle disagreement effectively and generate more robust and accurate pseudo-labels for unlabeled data. We also incorporate the uncertainty estimation from the evidential classifier into cross-supervised training to alleviate the negative effect of the error supervision signal. The extensive experiments on LA and Pancreas-CT dataset illustrate that DC-Net outperforms other state-of-the-art methods for semi-supervised segmentation. The code will be released soon.
半监督医学图像分割提供了一个有前途的解决方案,通过显著减少标注负担而实现类似的性能。使用这种方法可以展示高度的潜力,以优化分割过程并增加在临床实验期间 Translational 研究期间的实践可行性。最近,基于不同的协同训练子网络的交叉监督训练已经成为该任务的标准范式。然而,子网络不同意和标签噪声抑制等关键问题需要进一步的关注和进展的交叉监督训练。本文提出了基于双重分类器(DC-Net)的交叉监督学习框架,包括证据分类器和无分类分类器。两个分类器具有互补的特征,使他们能够有效地处理不同意并生成未标记数据更为稳健和准确的伪标签。我们还将证据分类器的不确定估计引入交叉监督训练,以减轻错误监督信号的负面影响。在LA和肝脏CT数据集上的广泛实验表明,DC-Net在半监督分割方面优于其他先进的方法。代码将很快发布。
https://arxiv.org/abs/2305.16216
Consistency learning plays a crucial role in semi-supervised medical image segmentation as it enables the effective utilization of limited annotated data while leveraging the abundance of unannotated data. The effectiveness and efficiency of consistency learning are challenged by prediction diversity and training stability, which are often overlooked by existing studies. Meanwhile, the limited quantity of labeled data for training often proves inadequate for formulating intra-class compactness and inter-class discrepancy of pseudo labels. To address these issues, we propose a self-aware and cross-sample prototypical learning method (SCP-Net) to enhance the diversity of prediction in consistency learning by utilizing a broader range of semantic information derived from multiple inputs. Furthermore, we introduce a self-aware consistency learning method that exploits unlabeled data to improve the compactness of pseudo labels within each class. Moreover, a dual loss re-weighting method is integrated into the cross-sample prototypical consistency learning method to improve the reliability and stability of our model. Extensive experiments on ACDC dataset and PROMISE12 dataset validate that SCP-Net outperforms other state-of-the-art semi-supervised segmentation methods and achieves significant performance gains compared to the limited supervised training. Our code will come soon.
一致性学习在半监督医学图像分割中发挥着关键作用,因为它能够充分利用有限的标注数据,同时利用未标注数据的丰富性。一致性学习的性能和效率受到预测多样性和训练稳定性的挑战,往往被现有研究忽视。与此同时,训练数据的标注数量往往不足以满足形成伪标签班级内部凝聚力和班级间差异的充分表达。为了解决这些问题,我们提出了一种具有自我意识的交叉样本典型学习方法(SCP-Net),通过利用多个输入来源的更广泛的语义信息,增强一致性学习的预测多样性。我们还引入了一种具有自我意识的一致性学习方法,利用未标注数据改善每个班级伪标签班级内部凝聚力。此外,我们还将双重损失重新加权方法集成到交叉样本典型一致性学习方法中,以提高我们的模型的可靠性和稳定性。在ACDC数据和PROMISE12数据集上进行广泛的实验验证,SCP-Net相比于有限的监督训练,在性能上表现更好。我们的代码即将发布。
https://arxiv.org/abs/2305.16214
Semantic occupancy prediction aims to infer dense geometry and semantics of surroundings for an autonomous agent to operate safely in the 3D environment. Existing occupancy prediction methods are almost entirely trained on human-annotated volumetric data. Although of high quality, the generation of such 3D annotations is laborious and costly, restricting them to a few specific object categories in the training dataset. To address this limitation, this paper proposes Open Vocabulary Occupancy (OVO), a novel approach that allows semantic occupancy prediction of arbitrary classes but without the need for 3D annotations during training. Keys to our approach are (1) knowledge distillation from a pre-trained 2D open-vocabulary segmentation model to the 3D occupancy network, and (2) pixel-voxel filtering for high-quality training data generation. The resulting framework is simple, compact, and compatible with most state-of-the-art semantic occupancy prediction models. On NYUv2 and SemanticKITTI datasets, OVO achieves competitive performance compared to supervised semantic occupancy prediction approaches. Furthermore, we conduct extensive analyses and ablation studies to offer insights into the design of the proposed framework.
语义占用预测旨在推断密集几何和环境语义,以便自主代理在3D环境中安全地操作。现有的占用预测方法几乎完全基于人类标注的体积数据训练。尽管质量很高,但生成这样的3D注释非常繁琐和昂贵,只能限制在训练数据集中的少数具体对象类别中。为了克服这个限制,本文提出了Open Vocabulary Occupancy(OVO),一种新的方法,可以预测任意类语义占用,但在训练期间不需要3D注释。我们的方法是关键点(1)是从预训练的2D开放词汇分割模型中提取知识蒸馏到3D占用网络,(2)是用于生成高质量训练数据的像素-立方体过滤。结果框架简单、紧凑,并与大多数最先进的语义占用预测模型兼容。在NYUv2和SemanticKITTI数据集上,OVO与监督的语义占用预测方法相比表现出竞争力。此外,我们进行了广泛的分析和剔除研究,以提供对所提出框架的设计启示。
https://arxiv.org/abs/2305.16133
Autonomous vehicles rely on LiDAR sensors to perceive the environment. Adverse weather conditions like rain, snow, and fog negatively affect these sensors, reducing their reliability by introducing unwanted noise in the measurements. In this work, we tackle this problem by proposing a novel approach for detecting adverse weather effects in LiDAR data. We reformulate this problem as an outlier detection task and use an energy-based framework to detect outliers in point clouds. More specifically, our method learns to associate low energy scores with inlier points and high energy scores with outliers allowing for robust detection of adverse weather effects. In extensive experiments, we show that our method performs better in adverse weather detection and has higher robustness to unseen weather effects than previous state-of-the-art methods. Furthermore, we show how our method can be used to perform simultaneous outlier detection and semantic segmentation. Finally, to help expand the research field of LiDAR perception in adverse weather, we release the SemanticSpray dataset, which contains labeled vehicle spray data in highway-like scenarios.
无人驾驶车辆依赖激光雷达传感器感知环境。如雨、雪和雾等不良天气条件会消极影响这些传感器,通过在测量中引入不必要的噪声,降低其可靠性。在本研究中,我们解决这个问题并提出了一种新的方法来检测激光雷达数据中的不良天气效应。我们将这个问题重新定义为异常检测任务,并使用基于能量的框架来检测点云中的异常。更具体地说,我们的算法学习将低能量评分与正常 points 关联,并将高能量评分与异常点关联,以 robust 地检测不良天气效应。在广泛的实验中,我们表明,我们的算法在不良天气检测方面表现更好,对未观测到的天气效应的鲁棒性比先前的先进方法更高。此外,我们展示如何应用我们的算法进行同时的异常检测和语义分割。最后,为了扩大激光雷达在不良天气条件下的感知研究领域,我们发布了语义喷雾数据集,其中包含道路类似场景中标注的车辆行驶喷雾数据。
https://arxiv.org/abs/2305.16129
End-to-end simultaneous speech translation (SimulST) outputs translation while receiving the streaming speech inputs (a.k.a. streaming speech translation), and hence needs to segment the speech inputs and then translate based on the current received speech. However, segmenting the speech inputs at unfavorable moments can disrupt the acoustic integrity and adversely affect the performance of the translation model. Therefore, learning to segment the speech inputs at those moments that are beneficial for the translation model to produce high-quality translation is the key to SimulST. Existing SimulST methods, either using the fixed-length segmentation or external segmentation model, always separate segmentation from the underlying translation model, where the gap results in segmentation outcomes that are not necessarily beneficial for the translation process. In this paper, we propose Differentiable Segmentation (DiSeg) for SimulST to directly learn segmentation from the underlying translation model. DiSeg turns hard segmentation into differentiable through the proposed expectation training, enabling it to be jointly trained with the translation model and thereby learn translation-beneficial segmentation. Experimental results demonstrate that DiSeg achieves state-of-the-art performance and exhibits superior segmentation capability.
端到端同时语音翻译(SimulST)在接收流式语音输入(也称为流式语音翻译)时输出翻译,因此需要对语音输入进行分词,然后根据当前接收的语音进行翻译。然而,在不利的时刻分词会损坏语音完整性并不良影响翻译模型的性能。因此,学习在翻译模型有益的时刻分词是SimulST的关键。现有的SimulST方法,无论是使用固定长度分词还是外部分词模型,总是将分词与 underlying 翻译模型分离,导致分词结果不一定对翻译过程有益。在本文中,我们提出了可区分的分词(DiSeg)为SimulST,以便直接从 underlying 翻译模型学习分词。DiSeg 通过 proposed 期望训练将坚分词变为可区分的,使其能够与翻译模型一起联合训练,从而学习对翻译有益的分词。实验结果显示,DiSeg实现了最先进的性能并表现出卓越的分词能力。
https://arxiv.org/abs/2305.16093
Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.
大型预训练模型通过实现多模态学习对计算机视觉产生了重大影响。CLIP模型在图像分类、对象检测和语义分割等方面取得了令人印象深刻的结果。然而,模型在3D点云处理任务方面的性能受到3D投影和CLIP训练图像之间的域差的限制。本文提出了DiffCLIP,一个新的预训练框架,结合稳定的扩散控制Net,最小化视觉分支中的域差。此外,在文本分支中引入了少量的任务风格prompt generation模块。在ModelNet10、ModelNet40和扫描对象NN数据集上进行广泛的实验表明,DiffCLIP具有很强的3D理解能力。通过稳定的扩散和风格prompt generation,DiffCLIP实现了对扫描对象NN中 obj_bg 对象零样本分类的准确率为43.2%,这是当前最先进的性能,而ModelNet10中的对象零样本分类的准确率为80.6%,与当前最先进的性能相当。
https://arxiv.org/abs/2305.15957
Convolutional neural networks (CNN) and Transformer variants have emerged as the leading medical image segmentation backbones. Nonetheless, due to their limitations in either preserving global image context or efficiently processing irregular shapes in visual objects, these backbones struggle to effectively integrate information from diverse anatomical regions and reduce inter-individual variability, particularly for the vasculature. Motivated by the successful breakthroughs of graph neural networks (GNN) in capturing topological properties and non-Euclidean relationships across various fields, we propose NexToU, a novel hybrid architecture for medical image segmentation. NexToU comprises improved Pool GNN and Swin GNN modules from Vision GNN (ViG) for learning both global and local topological representations while minimizing computational costs. To address the containment and exclusion relationships among various anatomical structures, we reformulate the topological interaction (TI) module based on the nature of binary trees, rapidly encoding the topological constraints into NexToU. Extensive experiments conducted on three datasets (including distinct imaging dimensions, disease types, and imaging modalities) demonstrate that our method consistently outperforms other state-of-the-art (SOTA) architectures. All the code is publicly available at this https URL.
卷积神经网络(CNN)和Transformer变体已成为医学图像分割的主要骨架。然而,由于它们要么无法保留全局图像上下文,要么无法有效地处理视觉对象中的不规则形状,这些骨架 struggles 有效地整合来自不同解剖学区域的信息,并减少个体间差异,特别是针对血管。鉴于Graph Neural Networks(GNN)在捕捉拓扑属性和非欧几何关系在不同领域的成功突破,我们提出了NexToU,一种 novel 混合架构,用于医学图像分割。NexToU由改进的池化GNN和 Swin GNN模块(来自视觉GNN(ViG)),用于学习全局和 local 拓扑表示,同时最小化计算成本。为了解决不同结构之间的包含和排除关系,我们基于二进制树的性质重新规范了拓扑交互(TI)模块,迅速编码拓扑约束 into NexToU。在三个数据集(包括明确的成像维度、疾病类型和成像模式)上进行广泛的实验表明,我们的方法 consistently outperforms other state-of-the-art (SOTA) architectures。所有代码都在 this https URL 上公开可用。
https://arxiv.org/abs/2305.15911
Pseudo-labels are widely employed in weakly supervised 3D segmentation tasks where only sparse ground-truth labels are available for learning. Existing methods often rely on empirical label selection strategies, such as confidence thresholding, to generate beneficial pseudo-labels for model training. This approach may, however, hinder the comprehensive exploitation of unlabeled data points. We hypothesize that this selective usage arises from the noise in pseudo-labels generated on unlabeled data. The noise in pseudo-labels may result in significant discrepancies between pseudo-labels and model predictions, thus confusing and affecting the model training greatly. To address this issue, we propose a novel learning strategy to regularize the generated pseudo-labels and effectively narrow the gaps between pseudo-labels and model predictions. More specifically, our method introduces an Entropy Regularization loss and a Distribution Alignment loss for weakly supervised learning in 3D segmentation tasks, resulting in an ERDA learning strategy. Interestingly, by using KL distance to formulate the distribution alignment loss, it reduces to a deceptively simple cross-entropy-based loss which optimizes both the pseudo-label generation network and the 3D segmentation network simultaneously. Despite the simplicity, our method promisingly improves the performance. We validate the effectiveness through extensive experiments on various baselines and large-scale datasets. Results show that ERDA effectively enables the effective usage of all unlabeled data points for learning and achieves state-of-the-art performance under different settings. Remarkably, our method can outperform fully-supervised baselines using only 1% of true annotations. Code and model will be made publicly available.
在弱监督的三维分割任务中,大量的 ground-truth 标签只可用来学习,而训练数据集非常稀疏。现有的方法往往依赖于经验标签选择策略,如置信阈值,来生成有益的伪标签用于模型训练。然而,这种方法可能会妨碍全面利用未标注数据点。我们假设这种选择是源于伪标签在未标注数据上的噪声。伪标签上的噪声可能导致伪标签和模型预测之间的显著差异,因此会混淆和影响模型训练。为了解决这一问题,我们提出了一种新的学习策略, regularize 生成的伪标签,并有效地缩小伪标签和模型预测之间的差距。具体来说,我们的方法引入了熵Regularization Loss和分布对齐 Loss,以弱监督的三维分割任务为例,生成 ERDA 学习策略。有趣的是,通过使用KL距离来制定分布对齐 Loss,它简化成一个看似简单的交叉熵基函数 loss,同时优化伪标签生成网络和三维分割网络。尽管简单,我们的方法却显著提高了性能。我们通过广泛的实验,对各种不同的基准值和大型数据集进行了验证。结果显示,ERDA 有效地使所有未标注数据点的学习有效利用,并在不同设置下实现最先进的性能。值得注意的是,我们的方法只需使用1%的真实标注数据就能超越完全监督基准线。代码和模型将公开可用。
https://arxiv.org/abs/2305.15832
Medical image data are often limited due to the expensive acquisition and annotation process. Hence, training a deep-learning model with only raw data can easily lead to overfitting. One solution to this problem is to augment the raw data with various transformations, improving the model's ability to generalize to new data. However, manually configuring a generic augmentation combination and parameters for different datasets is non-trivial due to inconsistent acquisition approaches and data distributions. Therefore, automatic data augmentation is proposed to learn favorable augmentation strategies for different datasets while incurring large GPU overhead. To this end, we present a novel method, called Dynamic Data Augmentation (DDAug), which is efficient and has negligible computation cost. Our DDAug develops a hierarchical tree structure to represent various augmentations and utilizes an efficient Monte-Carlo tree searching algorithm to update, prune, and sample the tree. As a result, the augmentation pipeline can be optimized for each dataset automatically. Experiments on multiple Prostate MRI datasets show that our method outperforms the current state-of-the-art data augmentation strategies.
医学图像数据通常由于昂贵的数据采集和标注过程而受到限制。因此,仅仅使用原始数据训练深度学习模型很容易会导致过拟合。解决这个问题的一种方法是通过添加各种变换来增加原始数据,提高模型对新数据泛化的能力。然而,由于不同数据集的不一致性采集方法和数据分布,手动配置通用的增强组合和参数非常困难。因此,我们提出了一种名为Dynamic Data Augmentation(DD Aug)的新方法,它高效且计算成本为零。我们的DD Aug开发了层级树结构来表示各种增强,并使用高效的蒙特卡罗树搜索算法来更新、修剪和采样树。因此,每个数据集都可以自动优化增强流程。对多个前列腺癌MRI数据集的实验表明,我们的方法比当前最先进的增强策略表现更好。
https://arxiv.org/abs/2305.15777
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD). To reduce the gap and improve the performance, current methods often resort to complicated training schemes, loss functions, and feature alignments, which are task-specific and feature-specific. In this paper, we state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature, and propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models. Our approach is based on the observation that student features typically contain more noises than teacher features due to the smaller capacity of student model. To address this, we propose to denoise student features using a diffusion model trained by teacher features. This allows us to perform better distillation between the refined clean feature and teacher feature. Additionally, we introduce a light-weight diffusion model with a linear autoencoder to reduce the computation cost and an adpative noise matching module to improve the denoising performance. Extensive experiments demonstrate that DiffKD is effective across various types of features and achieves state-of-the-art performance consistently on image classification, object detection, and semantic segmentation tasks. Code will be available at this https URL.
教师和学生之间的表示差距是知识蒸馏(KD)领域的一个新兴话题。为了缩小差距并提高性能,当前的方法常常采用复杂的训练计划、损失函数和特征对齐,这些任务和特征特定的。在本文中,我们指出这些方法的核心是排除噪声信息并蒸馏特征中的有价值信息,并提出了一种新的KD方法称为DiffKD,使用扩散模型来明确消除特征。我们的的方法是基于观察,学生特征通常包含比教师特征更多的噪声,因为学生模型的容量较小。为了解决这一问题,我们提议使用教师特征训练的扩散模型来消除学生特征。这允许我们在 refined clean feature 和教师特征之间的蒸馏任务中更好地进行知识蒸馏。此外,我们介绍了一种轻量级扩散模型,并配置了一个线性自编码器,以降低计算成本,并引入了一种自适应噪声匹配模块,以提高去噪性能。广泛的实验表明,DiffKD 适用于各种特征类型,并在图像分类、对象检测和语义分割任务中实现了最先进的性能。代码将在本链接中提供。
https://arxiv.org/abs/2305.15712
In light of the significant progress made in the development and application of semantic segmentation tasks, there has been increasing attention towards improving the robustness of segmentation models against natural degradation factors (e.g., rain streaks) or artificially attack factors (e.g., adversarial attack). Whereas, most existing methods are designed to address a single degradation factor and are tailored to specific application scenarios. In this work, we present the first attempt to improve the robustness of semantic segmentation tasks by simultaneously handling different types of degradation factors. Specifically, we introduce the Preprocessing Enhanced Adversarial Robust Learning (PEARL) framework based on the analysis of our proposed Naive Adversarial Training (NAT) framework. Our approach effectively handles both rain streaks and adversarial perturbation by transferring the robustness of the segmentation model to the image derain model. Furthermore, as opposed to the commonly used Negative Adversarial Attack (NAA), we design the Auxiliary Mirror Attack (AMA) to introduce positive information prior to the training of the PEARL framework, which improves defense capability and segmentation performance. Our extensive experiments and ablation studies based on different derain methods and segmentation models have demonstrated the significant performance improvement of PEARL with AMA in defense against various adversarial attacks and rain streaks while maintaining high generalization performance across different datasets.
由于在语义分割任务的发展和应用方面取得了显著进展,人们对分割模型的鲁棒性提出了越来越多的关注。相对于大多数现有方法,它们设计用于解决一个特定的退化因子,并针对特定的应用场景进行定制,上述工作提出了一种改进语义分割任务鲁棒性的新方法,即处理不同类型的退化因子。具体而言,我们提出了预处理增强对抗性鲁棒学习框架(PEARL)框架,该框架基于我们提出的简单对抗训练框架(NAT)的分析。我们的方法有效地处理了雨 streaks 和对抗性扰动,通过将分割模型的鲁棒性转移到图像生成模型中,实现了对图像生成模型的增强。此外,与常见的消极对抗攻击(NAA)相比,我们设计了一种辅助 Mirror 攻击(AMA),在 pearL 框架的训练之前引入了积极信息,从而提高了防御能力和分割性能。基于不同生成模型和分割模型的不同生成方法以及评估方法进行了广泛的实验和凝练研究,证明了 AMA 在抵御各种对抗攻击和雨 streaks 方面显著性能改进,同时在不同数据集上保持了高泛化性能。
https://arxiv.org/abs/2305.15709
Continual semantic segmentation aims to learn new classes while maintaining the information from the previous classes. Although prior studies have shown impressive progress in recent years, the fairness concern in the continual semantic segmentation needs to be better addressed. Meanwhile, fairness is one of the most vital factors in deploying the deep learning model, especially in human-related or safety applications. In this paper, we present a novel Fairness Continual Learning approach to the semantic segmentation problem. In particular, under the fairness objective, a new fairness continual learning framework is proposed based on class distributions. Then, a novel Prototypical Contrastive Clustering loss is proposed to address the significant challenges in continual learning, i.e., catastrophic forgetting and background shift. Our proposed loss has also been proven as a novel, generalized learning paradigm of knowledge distillation commonly used in continual learning. Moreover, the proposed Conditional Structural Consistency loss further regularized the structural constraint of the predicted segmentation. Our proposed approach has achieved State-of-the-Art performance on three standard scene understanding benchmarks, i.e., ADE20K, Cityscapes, and Pascal VOC, and promoted the fairness of the segmentation model.
持续语义分割的目标是在学习新类别的同时保留之前类别的信息。虽然过去的研究已经表明近年来取得了令人印象深刻的进展,但在持续语义分割中公平性的问题需要更好的解决。与此同时,公平性是部署深度学习模型中最重要的因素之一,特别是在与人类相关的或安全应用领域。在本文中,我们提出了一种新颖的公平持续学习方法来解决语义分割问题。特别是,在公平性目标下,我们基于类别分布提出了一种新的公平持续学习框架。然后,我们提出了一种独特的典型对比聚类 loss,以解决持续学习中的重大挑战,即灾难性遗忘和背景移动。我们提出的 loss 也被证明是常用于持续学习中的知识蒸馏的学习范式,具有普遍的通用性。此外,我们提出的条件结构一致性 loss 进一步 regularized 预测分割的结构限制,我们的新方法在三个标准场景理解基准测试中取得了最先进的表现,即 ADE20K、城市景观和Pascal VOC 测试,并促进了分割模型的公平性。
https://arxiv.org/abs/2305.15700
As the adoption of AI systems within the clinical setup grows, limitations in bandwidth could create communication bottlenecks when streaming imaging data, leading to delays in patient diagnosis and treatment. As such, healthcare providers and AI vendors will require greater computational infrastructure, therefore dramatically increasing costs. To that end, we developed intelligent streaming, a state-of-the-art framework to enable accelerated, cost-effective, bandwidth-optimized, and computationally efficient AI inference for clinical decision making at scale. For classification, intelligent streaming reduced the data transmission by 99.01% and decoding time by 98.58%, while increasing throughput by 27.43x. For segmentation, our framework reduced data transmission by 90.32%, decoding time by 90.26%, while increasing throughput by 4.20x. Our work demonstrates that intelligent streaming results in faster turnaround times, and reduced overall cost of data and transmission, without negatively impacting clinical decision making using AI systems.
随着在临床setup中采用人工智能技术的增加,带宽的限制可能会在流式图像数据时形成通信瓶颈,导致 patient 诊断和治疗延误。因此,医疗保健提供者和 AI 供应商将需要更大的计算基础设施,因此大大提高了成本。为此,我们开发了智能流式处理框架,这是一种先进的框架,以实现加速、成本效益高、带宽优化和计算高效的 AI 推断,并在大规模临床决策中实现。对于分类,智能流式处理框架将数据传输效率降低了99.01%,解码时间降低了98.58%,同时提高了吞吐量27.43倍。对于分割,我们的框架将数据传输效率降低了90.32%,解码时间降低了90.26%,同时提高了吞吐量4.20倍。我们的工作表明,智能流式处理结果加快了处理速度,减少了数据和传输的总体成本,而不会负面影响使用 AI 系统进行临床决策。
https://arxiv.org/abs/2305.15617
Semantic segmentation is a critical task in computer vision that aims to identify and classify individual pixels in an image, with numerous applications for example autonomous driving and medical image analysis. However, semantic segmentation can be super challenging particularly due to the need for large amounts of annotated data. Annotating images is a time-consuming and costly process, often requiring expert knowledge and significant effort. In this paper, we propose a novel approach for semantic segmentation by eliminating the need of ground-truth segmentation maps. Instead, our approach requires only the rough information of individual semantic class proportions, shortened as semantic proportions. It greatly simplifies the data annotation process and thus will significantly reduce the annotation time and cost, making it more feasible for large-scale applications. Moreover, it opens up new possibilities for semantic segmentation tasks where obtaining the full ground-truth segmentation maps may not be feasible or practical. Extensive experimental results demonstrate that our approach can achieve comparable and sometimes even better performance against the benchmark method that relies on the ground-truth segmentation maps. Utilising semantic proportions suggested in this work offers a promising direction for future research in the field of semantic segmentation.
语义分割是计算机视觉中一个重要的任务,旨在识别和分类图像中的单个像素,有许多应用,例如自动驾驶和医学图像分析。然而,语义分割可能会非常具有挑战性,特别是需要大量标注数据。标注图像是一个耗时且昂贵的过程,通常需要专业知识和大量努力。在本文中,我们提出了一种 novel 的语义分割方法,通过消除实际分割图的需求。相反,我们只需要每个语义类别的比例的粗略信息,称为语义比例。它极大地简化了数据标注过程,从而将显著减少标注时间和成本,使大规模应用更加可行。此外,它为获得完整实际分割图可能不可行或实用的语义分割任务打开了新的可能性。广泛的实验结果表明,我们的方法可以与依赖实际分割图的基准方法实现相似的甚至更好的性能。利用本文中提出的语义比例方法为语义分割领域的未来研究提供了一个有前途的方向。
https://arxiv.org/abs/2305.15608