The progress in deep learning solutions for disease diagnosis and prognosis based on cardiac magnetic resonance imaging is hindered by highly imbalanced and biased training data. To address this issue, we propose a method to alleviate imbalances inherent in datasets through the generation of synthetic data based on sensitive attributes such as sex, age, body mass index, and health condition. We adopt ControlNet based on a denoising diffusion probabilistic model to condition on text assembled from patient metadata and cardiac geometry derived from segmentation masks using a large-cohort study, specifically, the UK Biobank. We assess our method by evaluating the realism of the generated images using established quantitative metrics. Furthermore, we conduct a downstream classification task aimed at debiasing a classifier by rectifying imbalances within underrepresented groups through synthetically generated samples. Our experiments demonstrate the effectiveness of the proposed approach in mitigating dataset imbalances, such as the scarcity of younger patients or individuals with normal BMI level suffering from heart failure. This work represents a major step towards the adoption of synthetic data for the development of fair and generalizable models for medical classification tasks. Notably, we conduct all our experiments using a single, consumer-level GPU to highlight the feasibility of our approach within resource-constrained environments. Our code is available at this https URL.
基于心脏磁共振成像的疾病诊断和预后深度学习解决方案的进步受到高度不平衡和有偏见的数据训练的阻碍。为解决这一问题,我们提出了一种通过生成基于敏感属性(如性别、年龄、身体质量指数和健康状况)的合成数据来减轻数据不平衡的方法。我们采用基于去噪扩散概率模型的控制网络来对从患者元数据和分割掩码获得的 cardiac 几何进行训练,特别是 UK Biobank 大队列研究。我们通过评估生成的图像的逼真度来评估我们的方法。此外,我们还进行了一项下游分类任务,旨在通过合成数据对分类器进行平滑,以纠正数据中弱势群体的不平衡。我们的实验证明,与数据不平衡相关的建议方法在减轻数据不平衡方面非常有效,例如年轻患者或BMI正常的人患有心力衰竭等。这项工作代表了一个重要步骤,即在医疗分类任务中采用合成数据的发展方向。值得注意的是,我们所有实验都使用单个消费者级GPU来突出这种方法在资源受限环境中的可行性。我们的代码可在此处访问:https://www.thunnyard.io/。
https://arxiv.org/abs/2403.19508
Generating human motion from text has been dominated by denoising motion models either through diffusion or generative masking process. However, these models face great limitations in usability by requiring prior knowledge of the motion length. Conversely, autoregressive motion models address this limitation by adaptively predicting motion endpoints, at the cost of degraded generation quality and editing capabilities. To address these challenges, we propose Bidirectional Autoregressive Motion Model (BAMM), a novel text-to-motion generation framework. BAMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into discrete tokens in latent space, and (2) a masked self-attention transformer that autoregressively predicts randomly masked tokens via a hybrid attention masking strategy. By unifying generative masked modeling and autoregressive modeling, BAMM captures rich and bidirectional dependencies among motion tokens, while learning the probabilistic mapping from textual inputs to motion outputs with dynamically-adjusted motion sequence length. This feature enables BAMM to simultaneously achieving high-quality motion generation with enhanced usability and built-in motion editability. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that BAMM surpasses current state-of-the-art methods in both qualitative and quantitative measures.
从文本中生成人类运动一直以来都被去噪运动模型主导,这些模型通过扩散或生成掩码过程来解决。然而,这些模型在可用性上存在很大限制,需要具备运动长度的前知识。相反,自回归运动模型通过自适应预测运动终点来解决这个问题,代价是降低生成质量和编辑功能。为了应对这些挑战,我们提出了双向自回归运动模型(BAMM),一种新颖的文本到运动生成框架。BAMM由两个关键组件组成:(1)一个运动词元,将3D人类运动转换为在潜在空间中的离散词,(2)一个遮罩自注意变换器,通过混合注意掩码策略自适应地预测随机遮罩词。通过统一生成掩码建模和自回归建模,BAMM捕捉了运动词之间的丰富双向依赖关系,同时通过动态调整运动序列长度学习文本输入到运动输出的概率映射。这一特性使得BAMM能够同时实现高质量的运动生成与增强可用性和内置运动编辑功能。在HumanML3D和KIT-ML数据集上的大量实验证明,BAMM在质量和数量上超过了当前最先进的方法。
https://arxiv.org/abs/2403.19435
Treatment planning, which is a critical component of the radiotherapy workflow, is typically carried out by a medical physicist in a time-consuming trial-and-error manner. Previous studies have proposed knowledge-based or deep-learning-based methods for predicting dose distribution maps to assist medical physicists in improving the efficiency of treatment planning. However, these dose prediction methods usually fail to effectively utilize distance information between surrounding tissues and targets or organs-at-risk (OARs). Moreover, they are poor at maintaining the distribution characteristics of ray paths in the predicted dose distribution maps, resulting in a loss of valuable information. In this paper, we propose a distance-aware diffusion model (DoseDiff) for precise prediction of dose distribution. We define dose prediction as a sequence of denoising steps, wherein the predicted dose distribution map is generated with the conditions of the computed tomography (CT) image and signed distance maps (SDMs). The SDMs are obtained by distance transformation from the masks of targets or OARs, which provide the distance from each pixel in the image to the outline of the targets or OARs. We further propose a multi-encoder and multi-scale fusion network (MMFNet) that incorporates multi-scale and transformer-based fusion modules to enhance information fusion between the CT image and SDMs at the feature level. We evaluate our model on two in-house datasets and a public dataset, respectively. The results demonstrate that our DoseDiff method outperforms state-of-the-art dose prediction methods in terms of both quantitative performance and visual quality.
治疗计划,作为放射治疗工作流程的关键组成部分,通常是由医学物理学家以耗时且反复试验的方式进行的。之前的研究提出了基于知识的方法或基于深度学习的方法来进行剂量预测,以帮助医学物理学家提高治疗计划的有效性。然而,这些剂量预测方法通常无法有效地利用周围组织与靶标或危险区域(OARs)之间的距离信息。此外,它们在预测剂量分布图的分布特征方面表现不佳,导致 valuable information 的损失。在本文中,我们提出了一个距离感知扩散模型(DoseDiff)用于精确预测剂量分布。我们将剂量预测定义为去噪步骤序列,其中预测剂量分布图根据计算断层扫描(CT)图像的条件和签名距离图(SDMs)生成。SDMs是通过从靶标或OAR的 mask 中获得距离信息来获得的,提供了图像中每个像素到目标或OAR轮廓的距离。我们进一步提出了一个多编码器多尺度融合网络(MMFNet),该网络结合了多尺度和解构器基础的融合模块,以提高CT图像和SDMs之间的特征水平的信息融合效果。我们对我们的模型在两个内部数据集和公开数据集上进行了评估。结果表明,我们的DoseDiff方法在定量性能和视觉质量方面都优于最先进的剂量预测方法。
https://arxiv.org/abs/2306.16324
While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centered images, novel challenges arise with a nuanced task of "identity fine editing": precisely modifying specific features of a subject while maintaining its inherent identity and context. Existing personalization methods either require time-consuming optimization or learning additional encoders, adept in "identity re-contextualization". However, they often struggle with detailed and sensitive tasks like human face editing. To address these challenges, we introduce DreamSalon, a noise-guided, staged-editing framework, uniquely focusing on detailed image manipulations and identity-context preservation. By discerning editing and boosting stages via the frequency and gradient of predicted noises, DreamSalon first performs detailed manipulations on specific features in the editing stage, guided by high-frequency information, and then employs stochastic denoising in the boosting stage to improve image quality. For more precise editing, DreamSalon semantically mixes source and target textual prompts, guided by differences in their embedding covariances, to direct the model's focus on specific manipulation areas. Our experiments demonstrate DreamSalon's ability to efficiently and faithfully edit fine details on human faces, outperforming existing methods both qualitatively and quantitatively.
虽然大规模预训练的文本图像模型可以合成多样且高质量的人为中心图像,但具有细微任务的“身份微调”会带来新颖的挑战:在保持个体身份和上下文的同时,精确修改特定特征。现有的个性化方法要么需要耗时优化,要么需要学习额外的编码器,擅长“身份重新contextualization”。然而,它们通常在处理复杂且敏感的任务(如人脸编辑)时遇到困难。为解决这些挑战,我们引入了DreamSalon,一种以噪音为导向的分阶段编辑框架,专注于详细图像操作和身份上下文保留。通过通过预测噪声的频率和梯度来判断编辑和提升阶段,DreamSalon在编辑阶段对编辑的特定特征进行详细操作,并利用随机去噪在提升阶段提高图像质量。为了实现更精确的编辑,DreamSalon在源文本提示和目标文本提示之间进行语义混合,根据它们嵌入协方差差异来指导模型关注特定编辑区域。我们的实验证明了DreamSalon能够高效且忠实地编辑精细的人脸细节,超越现有方法无论是质量还是数量。
https://arxiv.org/abs/2403.19235
Diffusion models have revolutionized image synthesis, setting new benchmarks in quality and creativity. However, their widespread adoption is hindered by the intensive computation required during the iterative denoising process. Post-training quantization (PTQ) presents a solution to accelerate sampling, aibeit at the expense of sample quality, extremely in low-bit settings. Addressing this, our study introduces a unified Quantization Noise Correction Scheme (QNCD), aimed at minishing quantization noise throughout the sampling process. We identify two primary quantization challenges: intra and inter quantization noise. Intra quantization noise, mainly exacerbated by embeddings in the resblock module, extends activation quantization ranges, increasing disturbances in each single denosing step. Besides, inter quantization noise stems from cumulative quantization deviations across the entire denoising process, altering data distributions step-by-step. QNCD combats these through embedding-derived feature smoothing for eliminating intra quantization noise and an effective runtime noise estimatiation module for dynamicly filtering inter quantization noise. Extensive experiments demonstrate that our method outperforms previous quantization methods for diffusion models, achieving lossless results in W4A8 and W8A8 quantization settings on ImageNet (LDM-4). Code is available at: this https URL
扩散模型已经推动了图像合成领域的革命,设置了新的质量和创意基准。然而,广泛采用这些模型在迭代去噪过程中需要进行密集计算,这会阻碍其应用。后训练量化(PTQ)提出了一种加速抽样的解决方案,尽管牺牲了样本质量,但在低位设置中表现出色。为解决这一问题,我们的研究引入了一个统一的量化噪声纠正方案(QNCD),旨在在整个抽样过程中减少量化噪声。我们确定了两个主要的量化挑战:内和间量化噪声。内量化噪声,主要是由resblock模块中的嵌入导致的,扩展了激活量程,增加了每个抽样步骤的干扰。此外,间量化噪声源于整个去噪过程的累积量化偏差,改变了数据分布的逐步变化。QNCD通过嵌入导出的特征平滑来消除内量化噪声,并引入了动态滤波器来有效估计间量化噪声。大量实验证明,我们的方法在扩散模型上优于以前的量化方法,在ImageNet上的W4A8和W8A8量化设置上实现了无损失的结果(LDM-4)。代码可在此处下载:https://this URL
https://arxiv.org/abs/2403.19140
Our work addresses limitations seen in previous approaches for object-centric editing problems, such as unrealistic results due to shape discrepancies and limited control in object replacement or insertion. To this end, we introduce FlexEdit, a flexible and controllable editing framework for objects where we iteratively adjust latents at each denoising step using our FlexEdit block. Initially, we optimize latents at test time to align with specified object constraints. Then, our framework employs an adaptive mask, automatically extracted during denoising, to protect the background while seamlessly blending new content into the target image. We demonstrate the versatility of FlexEdit in various object editing tasks and curate an evaluation test suite with samples from both real and synthetic images, along with novel evaluation metrics designed for object-centric editing. We conduct extensive experiments on different editing scenarios, demonstrating the superiority of our editing framework over recent advanced text-guided image editing methods. Our project page is published at this https URL.
我们的工作 addressing了之前基于对象的编辑问题中的局限性,例如由于形状差异的不现实结果和对象替换或插入的控制有限。为此,我们引入了FlexEdit,一个灵活且可控制的对象编辑框架,通过在去噪步骤中逐步调整潜在变量来实现。最初,我们在测试时间优化潜在变量以与指定对象约束对齐。然后,我们的框架采用自适应掩码,在去噪过程中自动提取,以保护背景,同时将新内容平滑地融合到目标图像中。我们展示了FlexEdit在各种对象编辑任务中的多才性,并策划了一个评价测试套件,包括来自真实和合成图像的样本以及为对象中心编辑设计的新的评估指标。我们对不同的编辑场景进行了广泛的实验,证明了我们的编辑框架相对于最近的高级文本指导图像编辑方法具有优越性。我们的项目页面发布在https:// this URL上。
https://arxiv.org/abs/2403.18605
Voice assistants are now widely available, and to activate them a keyword spotting (KWS) algorithm is used. Modern KWS systems are mainly trained using supervised learning methods and require a large amount of labelled data to achieve a good performance. Leveraging unlabelled data through self-supervised learning (SSL) has been shown to increase the accuracy in clean conditions. This paper explores how SSL pretraining such as Data2Vec can be used to enhance the robustness of KWS models in noisy conditions, which is under-explored. Models of three different sizes are pretrained using different pretraining approaches and then fine-tuned for KWS. These models are then tested and compared to models trained using two baseline supervised learning methods, one being standard training using clean data and the other one being multi-style training (MTR). The results show that pretraining and fine-tuning on clean data is superior to supervised learning on clean data across all testing conditions, and superior to supervised MTR for testing conditions of SNR above 5 dB. This indicates that pretraining alone can increase the model's robustness. Finally, it is found that using noisy data for pretraining models, especially with the Data2Vec-denoising approach, significantly enhances the robustness of KWS models in noisy conditions.
语音助手现在已经广泛可用,要激活它们,需要使用关键词检测(KWS)算法。现代KWS系统主要使用监督学习方法进行训练,需要大量标记数据来实现良好的性能。通过通过自监督学习(SSL)利用未标记数据,已经在干净条件下增加了KWS模型的准确性。本文探讨了使用类似于Data2Vec的自监督预训练方法如何增强在嘈杂条件下KWS模型的鲁棒性,这在现有研究中尚未被充分探讨。使用不同的预训练方法对三个不同大小的模型进行预训练,然后对KWS进行微调。这些模型然后进行测试并与使用两种基线监督学习方法(干净数据的标准训练和多风格训练)训练的模型进行比较。结果显示,在所有测试条件下,使用干净数据进行预训练和微调的KWS模型都优于使用干净数据进行监督学习的模型,而在信噪比(SNR)超过5 dB的测试条件下,使用干净数据进行预训练的KWS模型优于使用多风格训练的模型。这表明,仅通过预训练就可以增加模型的鲁棒性。最后,发现使用嘈杂数据对预训练模型进行训练,特别是使用Data2Vec去噪方法,可以在嘈杂条件下显著增强KWS模型的鲁棒性。
https://arxiv.org/abs/2403.18560
Image style transfer aims to imbue digital imagery with the distinctive attributes of style targets, such as colors, brushstrokes, shapes, whilst concurrently preserving the semantic integrity of the content. Despite the advancements in arbitrary style transfer methods, a prevalent challenge remains the delicate equilibrium between content semantics and style attributes. Recent developments in large-scale text-to-image diffusion models have heralded unprecedented synthesis capabilities, albeit at the expense of relying on extensive and often imprecise textual descriptions to delineate artistic styles. Addressing these limitations, this paper introduces DiffStyler, a novel approach that facilitates efficient and precise arbitrary image style transfer. DiffStyler lies the utilization of a text-to-image Stable Diffusion model-based LoRA to encapsulate the essence of style targets. This approach, coupled with strategic cross-LoRA feature and attention injection, guides the style transfer process. The foundation of our methodology is rooted in the observation that LoRA maintains the spatial feature consistency of UNet, a discovery that further inspired the development of a mask-wise style transfer technique. This technique employs masks extracted through a pre-trained FastSAM model, utilizing mask prompts to facilitate feature fusion during the denoising process, thereby enabling localized style transfer that preserves the original image's unaffected regions. Moreover, our approach accommodates multiple style targets through the use of corresponding masks. Through extensive experimentation, we demonstrate that DiffStyler surpasses previous methods in achieving a more harmonious balance between content preservation and style integration.
图像风格迁移的目的是将数字图像赋予具有风格目标特征的鲜艳色彩、笔触、形状等,同时保留内容的语义完整性。尽管在任意风格迁移方法上取得了进步,但仍然存在一个普遍的挑战,即内容语义和风格属性之间的微妙的平衡。近年来,大规模文本到图像扩散模型的发展预示着前所未有的合成能力,但代价是依赖广泛的且经常不精确的文本描述来定义艺术风格。为解决这些局限,本文引入了DiffStyler,一种新方法,可实现高效且精确的任意图像风格迁移。DiffStyler利用基于文本到图像的稳定扩散模型(LoRA)来封装风格目标的本质。这种方法与策略的跨LoRA特征和注意注入相结合,引导风格迁移过程。我们方法的基础是观察到LoRA保持UNet的空间特征一致性,这一发现进一步激发了通过掩码级风格迁移技术的发展。这种技术利用预训练的FastSAM模型提取掩码,在去噪过程中利用掩码提示促进特征融合,从而实现局部风格迁移,保留原始图像不受影响区域。此外,通过使用相应的掩码,我们的方法可以适应多种风格目标。通过大量实验,我们证明了DiffStyler在实现内容保护和风格整合的更和谐平衡方面超越了以前的方法。
https://arxiv.org/abs/2403.18461
There are five types of trajectory prediction tasks: deterministic, stochastic, domain adaptation, momentary observation, and few-shot. These associated tasks are defined by various factors, such as the length of input paths, data split and pre-processing methods. Interestingly, even though they commonly take sequential coordinates of observations as input and infer future paths in the same coordinates as output, designing specialized architectures for each task is still necessary. For the other task, generality issues can lead to sub-optimal performances. In this paper, we propose SingularTrajectory, a diffusion-based universal trajectory prediction framework to reduce the performance gap across the five tasks. The core of SingularTrajectory is to unify a variety of human dynamics representations on the associated tasks. To do this, we first build a Singular space to project all types of motion patterns from each task into one embedding space. We next propose an adaptive anchor working in the Singular space. Unlike traditional fixed anchor methods that sometimes yield unacceptable paths, our adaptive anchor enables correct anchors, which are put into a wrong location, based on a traversability map. Finally, we adopt a diffusion-based predictor to further enhance the prototype paths using a cascaded denoising process. Our unified framework ensures the generality across various benchmark settings such as input modality, and trajectory lengths. Extensive experiments on five public benchmarks demonstrate that SingularTrajectory substantially outperforms existing models, highlighting its effectiveness in estimating general dynamics of human movements. Code is publicly available at this https URL .
轨迹预测任务分为五种类型:确定性、随机性、领域适应、时刻观察和少样本。这些相关任务由各种因素定义,例如输入路径的长度、数据划分和预处理方法。有趣的是,尽管它们通常以观察序列的坐标作为输入,并在相同坐标中推断未来路径,但为每个任务设计专用架构仍然必要。对于其他任务,普遍性问题可能导致较低的性能。在本文中,我们提出了SingularTrajectory,一种基于扩散的通用轨迹预测框架,以缩小这五种任务之间的性能差距。SingularTrajectory的核心是将各种人类运动模式从每个任务统一到一个嵌入空间中。为此,我们首先建立了一个Singular空间,将所有类型的运动模式从每个任务投影到同一个嵌入空间中。然后,我们提出了一个自适应锚点,在Singular空间中工作。与传统的固定锚点方法不同,我们的自适应锚点能在基于可达到性图的错误位置上放置正确的锚点。最后,我们采用扩散预测器进一步增强原型路径,通过级联去噪过程实现。我们的统一框架确保在各种基准设置中具有普适性,如输入模态和轨迹长度。在五个公开基准上的大量实验表明,SingularTrajectory显著优于现有模型,突出了其在估计人类运动通用动态方面的有效性。代码公开在https://这个URL上。
https://arxiv.org/abs/2403.18452
Diffusion models have demonstrated remarkable performance in text-to-image synthesis, producing realistic and high resolution images that faithfully adhere to the corresponding text-prompts. Despite their great success, they still fall behind in sketch-to-image synthesis tasks, where in addition to text-prompts, the spatial layout of the generated images has to closely follow the outlines of certain reference sketches. Employing an MLP latent edge predictor to guide the spatial layout of the synthesized image by predicting edge maps at each denoising step has been recently proposed. Despite yielding promising results, the pixel-wise operation of the MLP does not take into account the spatial layout as a whole, and demands numerous denoising iterations to produce satisfactory images, leading to time inefficiency. To this end, we introduce U-Sketch, a framework featuring a U-Net type latent edge predictor, which is capable of efficiently capturing both local and global features, as well as spatial correlations between pixels. Moreover, we propose the addition of a sketch simplification network that offers the user the choice of preprocessing and simplifying input sketches for enhanced outputs. The experimental results, corroborated by user feedback, demonstrate that our proposed U-Net latent edge predictor leads to more realistic results, that are better aligned with the spatial outlines of the reference sketches, while drastically reducing the number of required denoising steps and, consequently, the overall execution time.
扩散模型在文本到图像合成方面的表现已经引人注目,可以生成真实且高分辨率图像,这些图像忠实于相应的文本提示。尽管它们取得了巨大的成功,但在基于文本的图像合成任务中,它们仍然在轮廓到图像合成任务中落后,在那里除了文本提示外,生成的图像的空间布局还必须紧密遵循某些参考图的轮廓。最近,有人提出了一种利用MLP潜在边缘预测器通过预测每个去噪步骤的边缘图来指导生成图像的空间布局。然而,MLP的像素级操作没有考虑到整个空间布局,需要进行大量的去噪迭代才能产生满意的图像,导致效率低下。为此,我们引入了U-Sketch框架,该框架采用U-Net类型的潜在边缘预测器,能够高效地捕捉局部和全局特征,以及像素之间的空间关联。此外,我们还提出了一个简化网络,用户可以根据需要选择预处理和简化输入轮廓,以获得更好的输出结果。实验结果,经用户反馈证实,表明我们提出的U-Net潜在边缘预测器产生了更真实的结果,更符合参考图的轮廓,而极大地减少了所需的去噪步骤,从而降低了整体执行时间。
https://arxiv.org/abs/2403.18425
The conditional text-to-image diffusion models have garnered significant attention in recent years. However, the precision of these models is often compromised mainly for two reasons, ambiguous condition input and inadequate condition guidance over single denoising loss. To address the challenges, we introduce two innovative solutions. Firstly, we propose a Spatial Guidance Injector (SGI) which enhances conditional detail by encoding text inputs with precise annotation information. This method directly tackles the issue of ambiguous control inputs by providing clear, annotated guidance to the model. Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss (DCL), which applies supervision on the denoised latent code at any given time step. This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output. The combination of SGI and DCL results in our Effective Controllable Network (ECNet), which offers a more accurate controllable end-to-end text-to-image generation framework with a more precise conditioning input and stronger controllable supervision. We validate our approach through extensive experiments on generation under various conditions, such as human body skeletons, facial landmarks, and sketches of general objects. The results consistently demonstrate that our method significantly enhances the controllability and robustness of the generated images, outperforming existing state-of-the-art controllable text-to-image models.
近年来,条件文本到图像扩散模型已经引起了广泛的关注。然而,这些模型的精度往往因为两个主要原因而妥协,即模糊的条件输入和单向去噪损失中的条件指导不足。为了应对这些挑战,我们提出了两种创新解决方案。首先,我们提出了一种空间引导器(SGI),通过为文本输入提供精确的注释信息来增强条件细节。这种方法直接解决了模糊控制输入的问题,并为模型提供了明确的、注释的指导。其次,为了克服有限的条件监督问题,我们引入了扩散一致损失(DCL),对任意给定时间点的去噪潜在码进行监督。这鼓励了每个时间步的潜在码与输入信号之间的一致性,从而提高了输出的稳健性和准确性。SGI和DCL的结合我们产生了有效的可控制网络(ECNet),它具有更精确的控制输入和更强的可控制监督,从而实现了更准确、更可靠的文本到图像生成。我们通过在各种条件下进行广泛的实验来验证我们的方法。实验结果一致表明,我们的方法显著增强了生成图像的可控性和稳健性,超过了现有最先进的可控制文本到图像模型。
https://arxiv.org/abs/2403.18417
In recent years, remarkable advancements have been achieved in the field of image generation, primarily driven by the escalating demand for high-quality outcomes across various image generation subtasks, such as inpainting, denoising, and super resolution. A major effort is devoted to exploring the application of super-resolution techniques to enhance the quality of low-resolution images. In this context, our method explores in depth the problem of ship image super resolution, which is crucial for coastal and port surveillance. We investigate the opportunity given by the growing interest in text-to-image diffusion models, taking advantage of the prior knowledge that such foundation models have already learned. In particular, we present a diffusion-model-based architecture that leverages text conditioning during training while being class-aware, to best preserve the crucial details of the ships during the generation of the super-resoluted image. Since the specificity of this task and the scarcity availability of off-the-shelf data, we also introduce a large labeled ship dataset scraped from online ship images, mostly from ShipSpotting\footnote{\url{this http URL}} website. Our method achieves more robust results than other deep learning models previously employed for super resolution, as proven by the multiple experiments performed. Moreover, we investigate how this model can benefit downstream tasks, such as classification and object detection, thus emphasizing practical implementation in a real-world scenario. Experimental results show flexibility, reliability, and impressive performance of the proposed framework over state-of-the-art methods for different tasks. The code is available at: this https URL .
近年来,在图像生成领域取得了显著的进步,主要受到高质量图像成果需求的不断增长推动,尤其是在修复、去噪和超分辨率等图像生成子任务方面。大量精力致力于探讨将超分辨率技术应用于增强低分辨率图像质量。在这种情况下,我们的方法深入研究了船舶图像超分辨率问题,这对沿海和港口监视至关重要。我们研究了对于文本到图像扩散模型的增长兴趣,利用其已经获得的知识。特别是,我们提出了一个基于扩散模型的架构,在训练过程中利用文本条件,以保留超分辨率图像中船舶的关键细节。由于这项任务的独特性和可用数据的稀缺性,我们还引入了一个从在线图像网站如ShipSpotting网站收集的大规模标注船舶数据集。我们的方法在超分辨率模型的应用中实现了比之前使用的更稳健的结果,这是通过多次实验证明的。此外,我们还研究了这种模型如何为下游任务(如分类和目标检测)带来好处,从而强调在现实场景中的实际实现。实验结果表明,该框架具有灵活性、可靠性和令人印象深刻的性能,超过目前最先进的方法。代码可在此处下载:https://this URL 。
https://arxiv.org/abs/2403.18370
The quality of images captured outdoors is often affected by the weather. One factor that interferes with sight is rain, which can obstruct the view of observers and computer vision applications that rely on those images. The work aims to recover rain images by removing rain streaks via Self-supervised Reinforcement Learning (RL) for image deraining (SRL-Derain). We locate rain streak pixels from the input rain image via dictionary learning and use pixel-wise RL agents to take multiple inpainting actions to remove rain progressively. To our knowledge, this work is the first attempt where self-supervised RL is applied to image deraining. Experimental results on several benchmark image-deraining datasets show that the proposed SRL-Derain performs favorably against state-of-the-art few-shot and self-supervised deraining and denoising methods.
户外捕获图像的质量通常受到天气的影响。一个影响观察者视线的因素是雨,它可能会遮挡依靠这些图像进行视觉检测的应用程序。本研究旨在通过自监督强化学习(RL)去除雨条纹来恢复雨图像。我们通过字典学习从输入雨图像中定位雨条纹像素,并使用像素级的RL代理进行多次修复操作,以逐渐去除雨。据我们所知,这是第一个将自监督强化学习应用于图像去雨的尝试。在多个基准图像去雨数据集上进行的实验结果表明,与最先进的少样本和自监督去雨方法相比,所提出的SRL-Derain具有优势。
https://arxiv.org/abs/2403.18270
Finding a suitable layout represents a crucial task for diverse applications in graphic design. Motivated by simpler and smoother sampling trajectories, we explore the use of Flow Matching as an alternative to current diffusion-based layout generation models. Specifically, we propose LayoutFlow, an efficient flow-based model capable of generating high-quality layouts. Instead of progressively denoising the elements of a noisy layout, our method learns to gradually move, or flow, the elements of an initial sample until it reaches its final prediction. In addition, we employ a conditioning scheme that allows us to handle various generation tasks with varying degrees of conditioning with a single model. Empirically, LayoutFlow performs on par with state-of-the-art models while being significantly faster.
找到合适的布局对于图形设计中的各种应用程序至关重要。受到简单和光滑的采样轨迹的启发,我们探讨使用流匹配作为当前扩散基布局生成模型的替代方案。具体来说,我们提出了LayoutFlow,一种高效基于流的布局生成模型,能够生成高质量的布局。我们通过逐步移动或流化初始样本中的元素,直到达到最终预测,来学习逐步移动元素。此外,我们还使用一种条件方案,使我们能够用单一模型处理各种生成任务,且具有不同的条件程度。实验证明,LayoutFlow在性能上与最先进的模型相当,同时速度显著更快。
https://arxiv.org/abs/2403.18187
Three-dimensional data registration is an established yet challenging problem that is key in many different applications, such as mapping the environment for autonomous vehicles, and modeling objects and people for avatar creation, among many others. Registration refers to the process of mapping multiple data into the same coordinate system by means of matching correspondences and transformation estimation. Novel proposals exploit the benefits of deep learning architectures for this purpose, as they learn the best features for the data, providing better matches and hence results. However, the state of the art is usually focused on cases of relatively small transformations, although in certain applications and in a real and practical environment, large transformations are very common. In this paper, we present ReLaTo (Registration for Large Transformations), an architecture that faces the cases where large transformations happen while maintaining good performance for local transformations. This proposal uses a novel Softmax pooling layer to find correspondences in a bilateral consensus manner between two point sets, sampling the most confident matches. These matches are used to estimate a coarse and global registration using weighted Singular Value Decomposition (SVD). A target-guided denoising step is then applied to both the obtained matches and latent features, estimating the final fine registration considering the local geometry. All these steps are carried out following an end-to-end approach, which has been shown to improve 10 state-of-the-art registration methods in two datasets commonly used for this task (ModelNet40 and KITTI), especially in the case of large transformations.
三维数据配准是一个已经确立但具有挑战性的问题,在许多不同的应用中具有重要价值,例如为自动驾驶车辆绘制环境地图,为虚拟形象建模等。配准是指通过匹配对应关系和转换估计的过程,将多个数据映射到同一个坐标系中。新颖的想法利用深度学习架构的优势,因为它们可以学习数据的最佳特征,从而提供更好的匹配结果。然而,通常只关注相对较小的变换情况,尽管在某些应用和实际环境中,大的变换非常常见。在本文中,我们提出了ReLaTo(大型变换注册)架构,该架构在保持对局部变换良好性能的同时,处理大型变换。这一建议使用了一种新颖的软度量池化层来在两点集之间以双边一致方式找到对应关系,并采样最可靠的匹配。这些匹配用于使用加权奇异值分解(SVD)估计粗粒度和全局配准。接下来,对获得的匹配和潜在特征应用目标引导的去噪步骤,估计最终的细粒度配准,所有这些步骤都采用端到端方法进行。已经在两个常用于这项任务的公开数据集(ModelNet40和KITTI)上证明了这是一个改进10个现有配准方法的结果,尤其是在大型变换的情况下。
https://arxiv.org/abs/2403.18040
Diffusion models (DMs) are capable of generating remarkably high-quality samples by iteratively denoising a random vector, a process that corresponds to moving along the probability flow ordinary differential equation (PF ODE). Interestingly, DMs can also invert an input image to noise by moving backward along the PF ODE, a key operation for downstream tasks such as interpolation and image editing. However, the iterative nature of this process restricts its speed, hindering its broader application. Recently, Consistency Models (CMs) have emerged to address this challenge by approximating the integral of the PF ODE, thereby bypassing the need to iterate. Yet, the absence of an explicit ODE solver complicates the inversion process. To resolve this, we introduce the Bidirectional Consistency Model (BCM), which learns a single neural network that enables both forward and backward traversal along the PF ODE, efficiently unifying generation and inversion tasks within one framework. Notably, our proposed method enables one-step generation and inversion while also allowing the use of additional steps to enhance generation quality or reduce reconstruction error. Furthermore, by leveraging our model's bidirectional consistency, we introduce a sampling strategy that can enhance FID while preserving the generated image content. We further showcase our model's capabilities in several downstream tasks, such as interpolation and inpainting, and present demonstrations of potential applications, including blind restoration of compressed images and defending black-box adversarial attacks.
扩散模型(DMs)通过迭代地消噪随机向量来生成高质量的样本,这个过程相当于沿着概率流普通微分方程(PF ODE)移动。有趣的是,DMs还可以通过沿着PF ODE向前移动来反转输入图像,这是下游任务(如插值和图像编辑)的关键操作。然而,这个过程的迭代性质限制了其速度,阻碍了更广泛的应用。最近,一致性模型(CMs)应运而生,通过近似PF ODE的积分来解决这一挑战,从而绕过了迭代需求。然而,缺乏显式的ODE求解器使反向过程变得复杂。为了解决这个问题,我们引入了双向一致性模型(BCM),该模型学习了一个单个神经网络,可以在PF ODE上进行前向和反向遍历,将生成和反向遍历任务在同一个框架内高效地统一起来。值得注意的是,我们所提出的方法可以在一步生成和反向遍历的同时,允许使用额外的步骤来提高生成质量或减少重构误差。此外,通过利用我们模型的双向一致性,我们引入了一种采样策略,可以在保留生成图像内容的同时增强FID。我们还展示了我们模型的能力在多个下游任务中,如插值和修复,并展示了潜在应用的演示,包括恢复压缩图像的盲修复和防御黑盒攻击。
https://arxiv.org/abs/2403.18035
We propose a new approach for non-Cartesian magnetic resonance image reconstruction. While unrolled architectures provide robustness via data-consistency layers, embedding measurement operators in Deep Neural Network (DNN) can become impractical at large scale. Alternative Plug-and-Play (PnP) approaches, where the denoising DNNs are blind to the measurement setting, are not affected by this limitation and have also proven effective, but their highly iterative nature also affects scalability. To address this scalability challenge, we leverage the "Residual-to-Residual DNN series for high-Dynamic range imaging (R2D2)" approach recently introduced in astronomical imaging. R2D2's reconstruction is formed as a series of residual images, iteratively estimated as outputs of DNNs taking the previous iteration's image estimate and associated data residual as inputs. The method can be interpreted as a learned version of the Matching Pursuit algorithm. We demonstrate R2D2 in simulation, considering radial k-space sampling acquisition sequences. Our preliminary results suggest that R2D2 achieves: (i) suboptimal performance compared to its unrolled incarnation R2D2-Net, which is however non-scalable due to the necessary embedding of NUFFT-based data-consistency layers; (ii) superior reconstruction quality to a scalable version of R2D2-Net embedding an FFT-based approximation for data consistency; (iii) superior reconstruction quality to PnP, while only requiring few iterations.
我们提出了一个新的非笛卡尔磁共振图像重建方法。虽然展开架构通过数据一致性层提供稳健性,但深度神经网络(DNN)中的嵌入测量操作在大规模时会变得实用困难。替代的插件和播放(PnP)方法,其中去噪的DNN对测量设置一无所知,不受此限制,但也已经证明有效,但它们的高迭代性质也会影响可扩展性。为了应对可扩展性挑战,我们利用最近在天文成像中提出的“残差-到-残差DNN系列(R2D2)”方法。R2D2的重建是由一系列残差图像组成的,这些图像通过前一个迭代期的图像估计和相关数据残差作为输入进行迭代估计。该方法可以解释为学习版的匹配追求算法。我们在仿真中展示了R2D2,考虑径向k空间采样采集序列。我们的初步结果表明,R2D2实现了: (i)与展开的R2D2-Net相比,性能劣化;(ii)比可扩展的R2D2-Net嵌入更优秀的重构质量;(iii)在只需几次迭代的情况下,比PnP具有更优秀的重构质量。
https://arxiv.org/abs/2403.17905
Diffusion models have recently brought a powerful revolution in image generation. Despite showing impressive generative capabilities, most of these models rely on the current sample to denoise the next one, possibly resulting in denoising instability. In this paper, we reinterpret the iterative denoising process as model optimization and leverage a moving average mechanism to ensemble all the prior samples. Instead of simply applying moving average to the denoised samples at different timesteps, we first map the denoised samples to data space and then perform moving average to avoid distribution shift across timesteps. In view that diffusion models evolve the recovery from low-frequency components to high-frequency details, we further decompose the samples into different frequency components and execute moving average separately on each component. We name the complete approach "Moving Average Sampling in Frequency domain (MASF)". MASF could be seamlessly integrated into mainstream pre-trained diffusion models and sampling schedules. Extensive experiments on both unconditional and conditional diffusion models demonstrate that our MASF leads to superior performances compared to the baselines, with almost negligible additional complexity cost.
扩散模型最近在图像生成领域带来了一场强大的革命。尽管展示了令人印象深刻的生成能力,但大多数这些模型依赖于当前样本对下一个样本进行去噪,可能导致去噪不稳定。在本文中,我们将迭代去噪过程重新解释为模型优化,并利用移动平均机制对所有先前的样本进行聚类。我们不是简单地将移动平均应用到不同时间步的去噪样本上,而是首先将去噪样本映射到数据空间,然后分别对每个组件执行移动平均以避免时间步之间的分布漂移。鉴于扩散模型从低频成分到高频细节的恢复,我们进一步将样本分解为不同频率成分,并在每个组件上分别执行移动平均。我们将这种完整的方法称为“在频率域中的移动平均采样”(MASF)。MASF可以轻松地集成到主流预训练扩散模型和采样计划中。在无条件扩散模型和有条件扩散模型的广泛实验中,我们证明了与基线相比,我们的MASF具有卓越的性能,几乎可以忽略不计的复杂度成本。
https://arxiv.org/abs/2403.17870
The segmentation and tracking of living cells play a vital role within the biomedical domain, particularly in cancer research, drug development, and developmental biology. These are usually tedious and time-consuming tasks that are traditionally done by biomedical experts. Recently, to automatize these processes, deep learning based segmentation and tracking methods have been proposed. These methods require large-scale datasets and their full potential is constrained by the scarcity of annotated data in the biomedical imaging domain. To address this limitation, we propose Biomedical Video Diffusion Model (BVDM), capable of generating realistic-looking synthetic microscopy videos. Trained only on a single real video, BVDM can generate videos of arbitrary length with pixel-level annotations that can be used for training data-hungry models. It is composed of a denoising diffusion probabilistic model (DDPM) generating high-fidelity synthetic cell microscopy images and a flow prediction model (FPM) predicting the non-rigid transformation between consecutive video frames. During inference, initially, the DDPM imposes realistic cell textures on synthetic cell masks which are generated based on real data statistics. The flow prediction model predicts the flow field between consecutive masks and applies that to the DDPM output from the previous time frame to create the next one while keeping temporal consistency. BVDM outperforms state-of-the-art synthetic live cell microscopy video generation models. Furthermore, we demonstrate that a sufficiently large synthetic dataset enhances the performance of cell segmentation and tracking models compared to using a limited amount of available real data.
细胞分割和跟踪在生物医学领域中具有至关重要的作用,特别是在癌症研究、药物开发和发展生物学中。这些通常是费时且耗时的任务,通常由生物医学专家传统上完成。最近,为了自动化这些过程,基于深度学习的分割和跟踪方法被提出。这些方法需要大规模数据集,而它们在生物医学成像领域的标注数据稀缺性的限制,限制了它们的潜力。为了克服这一限制,我们提出了生物医学视频扩散模型(BVDM),一种能够生成逼真的合成显微镜视频的模型。仅基于单个真实视频进行训练,BVDM可以生成任意长度的带有级联级像素级注释的视频,这些视频可以用于训练需求大量数据的模型。它由一个去噪扩散概率模型(DDPM)和一个预测连续帧之间非刚性变换的流预测模型(FPM)组成。在推理过程中,最初,DDPM对基于真实数据统计的合成细胞掩码施加逼真的细胞纹理。流预测模型预测连续掩码之间的流场,并将该流场应用于DDPM从前一个时间帧的输出,以创建下一个时间帧,同时保持时间一致性。BVDM在现有合成活细胞显微镜视频生成模型中表现出色。此外,我们还证明了,足够大的合成数据集比使用有限量的可用真实数据增强了细胞分割和跟踪模型的性能。
https://arxiv.org/abs/2403.17808
We present GenesisTex, a novel method for synthesizing textures for 3D geometries from text descriptions. GenesisTex adapts the pretrained image diffusion model to texture space by texture space sampling. Specifically, we maintain a latent texture map for each viewpoint, which is updated with predicted noise on the rendering of the corresponding viewpoint. The sampled latent texture maps are then decoded into a final texture map. During the sampling process, we focus on both global and local consistency across multiple viewpoints: global consistency is achieved through the integration of style consistency mechanisms within the noise prediction network, and low-level consistency is achieved by dynamically aligning latent textures. Finally, we apply reference-based inpainting and img2img on denser views for texture refinement. Our approach overcomes the limitations of slow optimization in distillation-based methods and instability in inpainting-based methods. Experiments on meshes from various sources demonstrate that our method surpasses the baseline methods quantitatively and qualitatively.
我们提出了GenesisTex,一种从文本描述中合成3D几何纹理的新方法。GenesisTex通过纹理空间采样来适应预训练的图像扩散模型。具体来说,我们为每个视点维护一个潜在纹理映射,该映射在对应视点的渲染预测噪声上进行更新。采样过程包括将采样到的潜在纹理映射解密为最终纹理映射。在采样过程中,我们关注多个视点之间的全局和局部一致性:全局一致性通过噪声预测网络内的风格一致性机制实现,而低级一致性通过动态对齐潜在纹理实现。最后,我们将基于参考的修复方法和img2img应用于密度较高的纹理精饰中。我们对各种源的mesh进行的实验表明,我们的方法在数量和质量上超过了基线方法。
https://arxiv.org/abs/2403.17782