3D Gaussian Splatting (GS) have achieved considerable improvement over Neural Radiance Fields in terms of 3D fitting fidelity and rendering speed. However, this unstructured representation with scattered Gaussians poses a significant challenge for generative modeling. To address the problem, we introduce GaussianCube, a structured GS representation that is both powerful and efficient for generative modeling. We achieve this by first proposing a modified densification-constrained GS fitting algorithm which can yield high-quality fitting results using a fixed number of free Gaussians, and then re-arranging the Gaussians into a predefined voxel grid via Optimal Transport. The structured grid representation allows us to use standard 3D U-Net as our backbone in diffusion generative modeling without elaborate designs. Extensive experiments conducted on ShapeNet and OmniObject3D show that our model achieves state-of-the-art generation results both qualitatively and quantitatively, underscoring the potential of GaussianCube as a powerful and versatile 3D representation.
3D高斯平铺(GS)在3D拟合精度和渲染速度方面显著优于神经网络辐射场。然而,这种无结构表示中散射的高斯分布对生成建模带来了 significant的挑战。为解决这个问题,我们引入了GaussianCube,一种结构化的高斯平铺表示,既具有强大的生成建模能力,又具有高效的计算能力。我们通过首先提出了一种使用固定数量自由高斯进行平滑约束的高斯平铺拟合算法,使得高质量拟合结果仅使用固定数量的高斯实现,然后通过最优传输将高斯重新排列到预定义的体素网格中,实现了这一目标。结构化网格表示允许我们在扩散生成建模中使用标准的3D U-Net作为基础,而无需进行复杂的设计。在ShapeNet和OmniObject3D上进行的大量实验证明,我们的模型在质和量上均实现了最先进的高等生成结果,这充分证明了GaussianCube作为一种强大而多才多艺的3D表示具有巨大的潜力和价值。
https://arxiv.org/abs/2403.19655
Modern text-to-image (T2I) diffusion models can generate images with remarkable realism and creativity. These advancements have sparked research in fake image detection and attribution, yet prior studies have not fully explored the practical and scientific dimensions of this task. In addition to attributing images to 12 state-of-the-art T2I generators, we provide extensive analyses on what inference stage hyperparameters and image modifications are discernible. Our experiments reveal that initialization seeds are highly detectable, along with other subtle variations in the image generation process to some extent. We further investigate what visual traces are leveraged in image attribution by perturbing high-frequency details and employing mid-level representations of image style and structure. Notably, altering high-frequency information causes only slight reductions in accuracy, and training an attributor on style representations outperforms training on RGB images. Our analyses underscore that fake images are detectable and attributable at various levels of visual granularity than previously explored.
现代文本-图像(T2I)扩散模型可以生成非常逼真和富有创造力的图像。这些进步引发了关于假图像检测和归因的研究,然而之前的研究并没有完全探索这项任务的实际和科学维度。除了将图像归因于12个最先进的T2I生成器外,我们还对推理阶段超参数和图像修改的可见性进行了广泛的分析。我们的实验表明,初始化种子很容易被检测到,以及其他图像生成过程的微小变化。我们进一步研究了通过扰动高频信息并使用图像风格的低级别表示来归因图像。值得注意的是,改变高频信息只会导致略微的准确性下降,而将归因器训练为风格表示比训练为RGB图像效果更好。我们的分析强调,假图像在视觉粒度上可以被检测和归因。
https://arxiv.org/abs/2403.19653
Text-conditioned human motion generation has experienced significant advancements with diffusion models trained on extensive motion capture data and corresponding textual annotations. However, extending such success to 3D dynamic human-object interaction (HOI) generation faces notable challenges, primarily due to the lack of large-scale interaction data and comprehensive descriptions that align with these interactions. This paper takes the initiative and showcases the potential of generating human-object interactions without direct training on text-interaction pair data. Our key insight in achieving this is that interaction semantics and dynamics can be decoupled. Being unable to learn interaction semantics through supervised training, we instead leverage pre-trained large models, synergizing knowledge from a large language model and a text-to-motion model. While such knowledge offers high-level control over interaction semantics, it cannot grasp the intricacies of low-level interaction dynamics. To overcome this issue, we further introduce a world model designed to comprehend simple physics, modeling how human actions influence object motion. By integrating these components, our novel framework, InterDreamer, is able to generate text-aligned 3D HOI sequences in a zero-shot manner. We apply InterDreamer to the BEHAVE and CHAIRS datasets, and our comprehensive experimental analysis demonstrates its capability to generate realistic and coherent interaction sequences that seamlessly align with the text directives.
基于文本交互数据进行训练的扩散模型已经在广泛的动作捕捉数据和相关文本注释方面取得了显著的进步。然而,将这种成功扩展到3D动态人机交互(HOI)生成面临着显著的挑战,主要原因是缺乏大规模交互数据和与这些交互相一致的全面描述。本文主动发起,展示了在没有直接基于文本交互对数据进行训练的情况下生成人机交互的可能性。我们关键的见解是,交互语义和动态可以解耦。由于无法通过监督训练学习交互语义,我们 instead 利用预训练的大模型,实现知识来自大型语言模型和文本到运动模型的协同。虽然这种知识提供了高层次的交互语义控制,但它无法理解低层次的交互动态。为了克服这个问题,我们进一步引入了一个世界模型,旨在理解简单物理,建模人类动作如何影响物体运动。通过整合这些组件,我们的新框架InterDreamer能够在零散的训练数据中生成文本对齐的3D HOI序列。我们将InterDreamer应用于BEHAVE和CHAIRS数据集,我们的全面实验分析证明了它能够生成与文本指令无缝衔接的逼真交互序列。
https://arxiv.org/abs/2403.19652
The rapid advancement in image generation models has predominantly been driven by diffusion models, which have demonstrated unparalleled success in generating high-fidelity, diverse images from textual prompts. Despite their success, diffusion models encounter substantial challenges in the domain of image editing, particularly in executing disentangled edits-changes that target specific attributes of an image while leaving irrelevant parts untouched. In contrast, Generative Adversarial Networks (GANs) have been recognized for their success in disentangled edits through their interpretable latent spaces. We introduce GANTASTIC, a novel framework that takes existing directions from pre-trained GAN models-representative of specific, controllable attributes-and transfers these directions into diffusion-based models. This novel approach not only maintains the generative quality and diversity that diffusion models are known for but also significantly enhances their capability to perform precise, targeted image edits, thereby leveraging the best of both worlds.
图像生成模型的快速发展主要是由扩散模型推动的,这些模型在从文本提示中生成高保真、多样化的图像方面表现出色。尽管它们取得了成功,但在图像编辑领域,扩散模型仍然面临着巨大的挑战,特别是在执行分离编辑以针对图像特定属性,同时保留无关部分不变方面。相比之下,生成对抗网络(GANs)通过它们可解释的潜在空间的成功而闻名。我们介绍了一种新框架GANTASTIC,它从预训练的GAN模型(具有特定、可控制属性的代表方向)中提取方向,并将其转移到扩散模型中。这种新颖的方法不仅保留了扩散模型以生成高质量、多样化的图像而闻名的生成质量,还显著增强了它们进行精确、目标图像编辑的能力,从而利用了两种力量的最好表现。
https://arxiv.org/abs/2403.19645
Text-to-image (T2I) generative models have recently emerged as a powerful tool, enabling the creation of photo-realistic images and giving rise to a multitude of applications. However, the effective integration of T2I models into fundamental image classification tasks remains an open question. A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models. In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques. Our analysis reveals that these methods struggle to produce images that are both faithful (in terms of foreground objects) and diverse (in terms of background contexts) for domain-specific concepts. To tackle this challenge, we introduce an innovative inter-class data augmentation method known as Diff-Mix (this https URL), which enriches the dataset by performing image translations between classes. Our empirical results demonstrate that Diff-Mix achieves a better balance between faithfulness and diversity, leading to a marked improvement in performance across diverse image classification scenarios, including few-shot, conventional, and long-tail classifications for domain-specific datasets.
文本到图像(T2I)生成模型最近 emergence 成为了一个强大的工具, enable 了创建 photo-realistic 图像和产生大量的应用程序。然而,将 T2I 模型有效地集成到基本图像分类任务中仍然是一个开放问题。增强训练集通过由 T2I 模型生成的合成图像来加强图像分类性能是一种普遍策略。然而,我们的分析揭示了现有数据增强技术的一个缺陷,即它们很难生成既忠实(在主题特定概念上)又多样(在背景上下文方面)的图像。为了解决这个挑战,我们引入了一种名为 Diff-Mix 的创新跨类数据增强方法,该方法通过在类之间进行图像平移来丰富数据集。我们的实证结果表明,Diff-Mix 实现了信仰与多样性的更好平衡,在多样图像分类场景中取得了较好的性能,包括少样本、传统和长尾分类数据集。
https://arxiv.org/abs/2403.19600
Building on the momentum of image generation diffusion models, there is an increasing interest in video-based diffusion models. However, video generation poses greater challenges due to its higher-dimensional nature, the scarcity of training data, and the complex spatiotemporal relationships involved. Image generation models, due to their extensive data requirements, have already strained computational resources to their limits. There have been instances of these models reproducing elements from the training samples, leading to concerns and even legal disputes over sample replication. Video diffusion models, which operate with even more constrained datasets and are tasked with generating both spatial and temporal content, may be more prone to replicating samples from their training sets. Compounding the issue, these models are often evaluated using metrics that inadvertently reward replication. In our paper, we present a systematic investigation into the phenomenon of sample replication in video diffusion models. We scrutinize various recent diffusion models for video synthesis, assessing their tendency to replicate spatial and temporal content in both unconditional and conditional generation scenarios. Our study identifies strategies that are less likely to lead to replication. Furthermore, we propose new evaluation strategies that take replication into account, offering a more accurate measure of a model's ability to generate the original content.
在图像生成扩散模型的基础上,人们对基于视频的扩散模型越来越感兴趣。然而,由于其高维性质、训练数据稀缺以及涉及到的复杂时空关系,视频生成带来了更大的挑战。由于这些模型需要大量的数据,它们已经达到了计算能力的极限。已经有一些模型从训练样本中复制元素,导致担忧甚至法律纠纷。视频扩散模型,由于它们 operate 在更受约束的数据集上,并负责生成 both 空间和时间内容,可能更容易从训练集中复制样本。进一步加剧问题的是,这些模型通常使用指标来评估,无意中奖励了复制。在我们的论文中,我们进行了一项系统性的调查,研究了视频扩散模型中样本复制现象。我们审视了各种最近的视频合成扩散模型,评估它们在无条件 和有条件生成场景中复制空间和时间内容的趋势。我们的研究识别出了一些不太可能导致复制的策略。此外,我们还提出了新的评估策略,考虑了复制,更准确地衡量了模型生成原始内容的能力。
https://arxiv.org/abs/2403.19593
We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning, mapping visual observations to action sequences that emulate the demonstrator's behaviour. We achieve this by transforming visual observations (inputs) and trajectories of actions (outputs) into sequences of tokens that a text-pretrained Transformer (GPT-4 Turbo) can ingest and generate, via a framework we call Keypoint Action Tokens (KAT). Despite being trained only on language, we show that these Transformers excel at translating tokenised visual keypoint observations into action trajectories, performing on par or better than state-of-the-art imitation learning (diffusion policies) in the low-data regime on a suite of real-world, everyday tasks. Rather than operating in the language domain as is typical, KAT leverages text-based Transformers to operate in the vision and action domains to learn general patterns in demonstration data for highly efficient imitation learning, indicating promising new avenues for repurposing natural language models for embodied tasks. Videos are available at this https URL.
我们证明了,无需额外训练,基于文本的Transformer模型可以实现少样本的上下文视觉模仿学习,将视觉观察映射到模仿者行为的动作序列。我们通过一个我们称之为键点动作词(KAT)的框架实现了这一点。尽管这些模型仅在语言领域训练,但我们证明了这些Transformer在将标记的视觉关键点观察映射为动作序列方面表现出色,在低数据情况下与状态最先进的模仿学习(扩散策略)相当或者更好。与典型的在语言域操作不同,KAT利用基于文本的Transformer在视觉和动作域操作,以学习演示数据中高度有效的模仿学习,表明了将自然语言模型用于 embodied 任务的新的途径。视频可在此处访问:https://www.youtube.com/watch?v=uRstRQZ0Q7g。
https://arxiv.org/abs/2403.19578
The progress in deep learning solutions for disease diagnosis and prognosis based on cardiac magnetic resonance imaging is hindered by highly imbalanced and biased training data. To address this issue, we propose a method to alleviate imbalances inherent in datasets through the generation of synthetic data based on sensitive attributes such as sex, age, body mass index, and health condition. We adopt ControlNet based on a denoising diffusion probabilistic model to condition on text assembled from patient metadata and cardiac geometry derived from segmentation masks using a large-cohort study, specifically, the UK Biobank. We assess our method by evaluating the realism of the generated images using established quantitative metrics. Furthermore, we conduct a downstream classification task aimed at debiasing a classifier by rectifying imbalances within underrepresented groups through synthetically generated samples. Our experiments demonstrate the effectiveness of the proposed approach in mitigating dataset imbalances, such as the scarcity of younger patients or individuals with normal BMI level suffering from heart failure. This work represents a major step towards the adoption of synthetic data for the development of fair and generalizable models for medical classification tasks. Notably, we conduct all our experiments using a single, consumer-level GPU to highlight the feasibility of our approach within resource-constrained environments. Our code is available at this https URL.
基于心脏磁共振成像的疾病诊断和预后深度学习解决方案的进步受到高度不平衡和有偏见的数据训练的阻碍。为解决这一问题,我们提出了一种通过生成基于敏感属性(如性别、年龄、身体质量指数和健康状况)的合成数据来减轻数据不平衡的方法。我们采用基于去噪扩散概率模型的控制网络来对从患者元数据和分割掩码获得的 cardiac 几何进行训练,特别是 UK Biobank 大队列研究。我们通过评估生成的图像的逼真度来评估我们的方法。此外,我们还进行了一项下游分类任务,旨在通过合成数据对分类器进行平滑,以纠正数据中弱势群体的不平衡。我们的实验证明,与数据不平衡相关的建议方法在减轻数据不平衡方面非常有效,例如年轻患者或BMI正常的人患有心力衰竭等。这项工作代表了一个重要步骤,即在医疗分类任务中采用合成数据的发展方向。值得注意的是,我们所有实验都使用单个消费者级GPU来突出这种方法在资源受限环境中的可行性。我们的代码可在此处访问:https://www.thunnyard.io/。
https://arxiv.org/abs/2403.19508
Generating human motion from text has been dominated by denoising motion models either through diffusion or generative masking process. However, these models face great limitations in usability by requiring prior knowledge of the motion length. Conversely, autoregressive motion models address this limitation by adaptively predicting motion endpoints, at the cost of degraded generation quality and editing capabilities. To address these challenges, we propose Bidirectional Autoregressive Motion Model (BAMM), a novel text-to-motion generation framework. BAMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into discrete tokens in latent space, and (2) a masked self-attention transformer that autoregressively predicts randomly masked tokens via a hybrid attention masking strategy. By unifying generative masked modeling and autoregressive modeling, BAMM captures rich and bidirectional dependencies among motion tokens, while learning the probabilistic mapping from textual inputs to motion outputs with dynamically-adjusted motion sequence length. This feature enables BAMM to simultaneously achieving high-quality motion generation with enhanced usability and built-in motion editability. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that BAMM surpasses current state-of-the-art methods in both qualitative and quantitative measures.
从文本中生成人类运动一直以来都被去噪运动模型主导,这些模型通过扩散或生成掩码过程来解决。然而,这些模型在可用性上存在很大限制,需要具备运动长度的前知识。相反,自回归运动模型通过自适应预测运动终点来解决这个问题,代价是降低生成质量和编辑功能。为了应对这些挑战,我们提出了双向自回归运动模型(BAMM),一种新颖的文本到运动生成框架。BAMM由两个关键组件组成:(1)一个运动词元,将3D人类运动转换为在潜在空间中的离散词,(2)一个遮罩自注意变换器,通过混合注意掩码策略自适应地预测随机遮罩词。通过统一生成掩码建模和自回归建模,BAMM捕捉了运动词之间的丰富双向依赖关系,同时通过动态调整运动序列长度学习文本输入到运动输出的概率映射。这一特性使得BAMM能够同时实现高质量的运动生成与增强可用性和内置运动编辑功能。在HumanML3D和KIT-ML数据集上的大量实验证明,BAMM在质量和数量上超过了当前最先进的方法。
https://arxiv.org/abs/2403.19435
While burst LR images are useful for improving the SR image quality compared with a single LR image, prior SR networks accepting the burst LR images are trained in a deterministic manner, which is known to produce a blurry SR image. In addition, it is difficult to perfectly align the burst LR images, making the SR image more blurry. Since such blurry images are perceptually degraded, we aim to reconstruct the sharp high-fidelity boundaries. Such high-fidelity images can be reconstructed by diffusion models. However, prior SR methods using the diffusion model are not properly optimized for the burst SR task. Specifically, the reverse process starting from a random sample is not optimized for image enhancement and restoration methods, including burst SR. In our proposed method, on the other hand, burst LR features are used to reconstruct the initial burst SR image that is fed into an intermediate step in the diffusion model. This reverse process from the intermediate step 1) skips diffusion steps for reconstructing the global structure of the image and 2) focuses on steps for refining detailed textures. Our experimental results demonstrate that our method can improve the scores of the perceptual quality metrics. Code: this https URL
虽然 burst LR 图像在改善与单个 LR 图像的 SR 图像质量方面是有用的,但接受 burst LR 图像的早期 SR 网络是在确定性方式下训练的,这已经被知道会生成模糊的 SR 图像。此外,很难完美对齐 burst LR 图像,使得 SR 图像变得更模糊。由于这些模糊的图像在感知上退化,我们试图通过扩散模型重构尖锐的高保真度边界。通过扩散模型可以重构高保真度图像。然而,早期 SR 方法使用扩散模型并未对 burst SR 任务进行优化。具体来说,从随机样本开始的反向过程没有优化图像增强和恢复方法,包括 burst SR。在我们的方法中,另一方面,使用 burst LR 特征重构输入到扩散模型中间步骤的初始 burst SR 图像。这种反向过程从中间步骤 1) 跳过扩散步骤以重构图像的整体结构,2) 专注于微纹理的优化步骤。我们的实验结果表明,我们的方法可以提高感知质量指标的得分。代码:https:// this URL
https://arxiv.org/abs/2403.19428
Diffusion-weighted MRI (DWI) is essential for stroke diagnosis, treatment decisions, and prognosis. However, image and disease variability hinder the development of generalizable AI algorithms with clinical value. We address this gap by presenting a novel ensemble algorithm derived from the 2022 Ischemic Stroke Lesion Segmentation (ISLES) challenge. ISLES'22 provided 400 patient scans with ischemic stroke from various medical centers, facilitating the development of a wide range of cutting-edge segmentation algorithms by the research community. Through collaboration with leading teams, we combined top-performing algorithms into an ensemble model that overcomes the limitations of individual solutions. Our ensemble model achieved superior ischemic lesion detection and segmentation accuracy on our internal test set compared to individual algorithms. This accuracy generalized well across diverse image and disease variables. Furthermore, the model excelled in extracting clinical biomarkers. Notably, in a Turing-like test, neuroradiologists consistently preferred the algorithm's segmentations over manual expert efforts, highlighting increased comprehensiveness and precision. Validation using a real-world external dataset (N=1686) confirmed the model's generalizability. The algorithm's outputs also demonstrated strong correlations with clinical scores (admission NIHSS and 90-day mRS) on par with or exceeding expert-derived results, underlining its clinical relevance. This study offers two key findings. First, we present an ensemble algorithm (this https URL) that detects and segments ischemic stroke lesions on DWI across diverse scenarios on par with expert (neuro)radiologists. Second, we show the potential for biomedical challenge outputs to extend beyond the challenge's initial objectives, demonstrating their real-world clinical applicability.
扩散加权MRI(DWI)对于中风诊断、治疗决策和预后至关重要。然而,图像和疾病变异阻碍了具有临床价值的通用人工智能算法的开发。我们通过介绍2022年Ischemic Stroke Lesion Segmentation(ISLES)挑战中的新元分析算法来填补这一空白。ISLES'22为来自各种医疗中心的400名患者提供了缺血性中风患者扫描,为研究社区开发了广泛的分割算法。通过与领先团队的合作,我们将最优秀的算法集成到一个元分析模型中,该模型克服了个体解决方案的局限性。我们的元分析模型在我们内部测试集上实现了比单独算法更出色的缺血性病变检测和分割准确性。这种准确性在多样性的图像和疾病变量上得到了很好的泛化。此外,该模型在提取临床生物标记物方面表现出色。值得注意的是,在Turing测试中,神经放射科医生一致认为该算法的分割结果优于手动专家努力,强调了其临床相关性。这项研究提供了两个关键发现。首先,我们介绍了一种元分析算法(此https://URL),该算法在各种情景下对DWI进行缺血性中风病变检测和分割。其次,我们展示了生物医学挑战输出扩展到挑战最初目标的可能性,证明了其实际临床应用价值。
https://arxiv.org/abs/2403.19425
Treatment planning, which is a critical component of the radiotherapy workflow, is typically carried out by a medical physicist in a time-consuming trial-and-error manner. Previous studies have proposed knowledge-based or deep-learning-based methods for predicting dose distribution maps to assist medical physicists in improving the efficiency of treatment planning. However, these dose prediction methods usually fail to effectively utilize distance information between surrounding tissues and targets or organs-at-risk (OARs). Moreover, they are poor at maintaining the distribution characteristics of ray paths in the predicted dose distribution maps, resulting in a loss of valuable information. In this paper, we propose a distance-aware diffusion model (DoseDiff) for precise prediction of dose distribution. We define dose prediction as a sequence of denoising steps, wherein the predicted dose distribution map is generated with the conditions of the computed tomography (CT) image and signed distance maps (SDMs). The SDMs are obtained by distance transformation from the masks of targets or OARs, which provide the distance from each pixel in the image to the outline of the targets or OARs. We further propose a multi-encoder and multi-scale fusion network (MMFNet) that incorporates multi-scale and transformer-based fusion modules to enhance information fusion between the CT image and SDMs at the feature level. We evaluate our model on two in-house datasets and a public dataset, respectively. The results demonstrate that our DoseDiff method outperforms state-of-the-art dose prediction methods in terms of both quantitative performance and visual quality.
治疗计划,作为放射治疗工作流程的关键组成部分,通常是由医学物理学家以耗时且反复试验的方式进行的。之前的研究提出了基于知识的方法或基于深度学习的方法来进行剂量预测,以帮助医学物理学家提高治疗计划的有效性。然而,这些剂量预测方法通常无法有效地利用周围组织与靶标或危险区域(OARs)之间的距离信息。此外,它们在预测剂量分布图的分布特征方面表现不佳,导致 valuable information 的损失。在本文中,我们提出了一个距离感知扩散模型(DoseDiff)用于精确预测剂量分布。我们将剂量预测定义为去噪步骤序列,其中预测剂量分布图根据计算断层扫描(CT)图像的条件和签名距离图(SDMs)生成。SDMs是通过从靶标或OAR的 mask 中获得距离信息来获得的,提供了图像中每个像素到目标或OAR轮廓的距离。我们进一步提出了一个多编码器多尺度融合网络(MMFNet),该网络结合了多尺度和解构器基础的融合模块,以提高CT图像和SDMs之间的特征水平的信息融合效果。我们对我们的模型在两个内部数据集和公开数据集上进行了评估。结果表明,我们的DoseDiff方法在定量性能和视觉质量方面都优于最先进的剂量预测方法。
https://arxiv.org/abs/2306.16324
Recent progress in diffusion models has profoundly enhanced the fidelity of image generation. However, this has raised concerns about copyright infringements. While prior methods have introduced adversarial perturbations to prevent style imitation, most are accompanied by the degradation of artworks' visual quality. Recognizing the importance of maintaining this, we develop a visually improved protection method that preserves its protection capability. To this end, we create a perceptual map to identify areas most sensitive to human eyes. We then adjust the protection intensity guided by an instance-aware refinement. We also integrate a perceptual constraints bank to further improve the imperceptibility. Results show that our method substantially elevates the quality of the protected image without compromising on protection efficacy.
近年来,扩散模型的进步极大地提高了图像生成的保真度。然而,这也引发了对版权侵犯的担忧。虽然以前的方法引入了对抗性扰动以防止风格抄袭,但大多数都伴随着艺术作品视觉质量的下降。认识到保持这种保护能力的重要性,我们开发了一种视觉上改进的保护方法,可以保留其保护能力。为此,我们创建了一个感知图,以确定人类眼睛最敏感的区域。然后,我们通过实例感知平滑优化调整保护强度。我们还集成了一个感知约束库,以进一步改善不可感知性。结果表明,我们的方法极大地提高了受保护图像的质量,而没有牺牲保护效果。
https://arxiv.org/abs/2403.19254
While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centered images, novel challenges arise with a nuanced task of "identity fine editing": precisely modifying specific features of a subject while maintaining its inherent identity and context. Existing personalization methods either require time-consuming optimization or learning additional encoders, adept in "identity re-contextualization". However, they often struggle with detailed and sensitive tasks like human face editing. To address these challenges, we introduce DreamSalon, a noise-guided, staged-editing framework, uniquely focusing on detailed image manipulations and identity-context preservation. By discerning editing and boosting stages via the frequency and gradient of predicted noises, DreamSalon first performs detailed manipulations on specific features in the editing stage, guided by high-frequency information, and then employs stochastic denoising in the boosting stage to improve image quality. For more precise editing, DreamSalon semantically mixes source and target textual prompts, guided by differences in their embedding covariances, to direct the model's focus on specific manipulation areas. Our experiments demonstrate DreamSalon's ability to efficiently and faithfully edit fine details on human faces, outperforming existing methods both qualitatively and quantitatively.
虽然大规模预训练的文本图像模型可以合成多样且高质量的人为中心图像,但具有细微任务的“身份微调”会带来新颖的挑战:在保持个体身份和上下文的同时,精确修改特定特征。现有的个性化方法要么需要耗时优化,要么需要学习额外的编码器,擅长“身份重新contextualization”。然而,它们通常在处理复杂且敏感的任务(如人脸编辑)时遇到困难。为解决这些挑战,我们引入了DreamSalon,一种以噪音为导向的分阶段编辑框架,专注于详细图像操作和身份上下文保留。通过通过预测噪声的频率和梯度来判断编辑和提升阶段,DreamSalon在编辑阶段对编辑的特定特征进行详细操作,并利用随机去噪在提升阶段提高图像质量。为了实现更精确的编辑,DreamSalon在源文本提示和目标文本提示之间进行语义混合,根据它们嵌入协方差差异来指导模型关注特定编辑区域。我们的实验证明了DreamSalon能够高效且忠实地编辑精细的人脸细节,超越现有方法无论是质量还是数量。
https://arxiv.org/abs/2403.19235
Image stitching from different captures often results in non-rectangular boundaries, which is often considered unappealing. To solve non-rectangular boundaries, current solutions involve cropping, which discards image content, inpainting, which can introduce unrelated content, or warping, which can distort non-linear features and introduce artifacts. To overcome these issues, we introduce a novel diffusion-based learning framework, \textbf{RecDiffusion}, for image stitching rectangling. This framework combines Motion Diffusion Models (MDM) to generate motion fields, effectively transitioning from the stitched image's irregular borders to a geometrically corrected intermediary. Followed by Content Diffusion Models (CDM) for image detail refinement. Notably, our sampling process utilizes a weighted map to identify regions needing correction during each iteration of CDM. Our RecDiffusion ensures geometric accuracy and overall visual appeal, surpassing all previous methods in both quantitative and qualitative measures when evaluated on public benchmarks. Code is released at this https URL.
图像拼接从不同捕获通常会导致非矩形边界,这通常被认为是不吸引人的。为解决非矩形边界,目前的解决方案包括裁剪、修复和扭曲,这些方法都会放弃图像内容或引入无关内容,或扭曲,从而扭曲非线性特征并引入伪影。为了克服这些问题,我们引入了一个新的扩散为基础的学习框架, RecDiffusion,用于图像拼接和矩形化。该框架结合了运动扩散模型 (MDM) 来生成运动场,有效地将拼接图像的不规则边界转换为几何校正的中间结果。然后是内容扩散模型 (CDM) 来修复图像细节。值得注意的是,我们的采样过程利用加权图在每次迭代 CDM 时确定需要修复的区域。我们的 RecDiffusion 确保几何准确性和整体视觉吸引力,在评估公共基准时超越了所有以前方法。代码发布在 这个链接上:
https://arxiv.org/abs/2403.19164
Conventional GAN-based models for talking head generation often suffer from limited quality and unstable training. Recent approaches based on diffusion models aimed to address these limitations and improve fidelity. However, they still face challenges, including extensive sampling times and difficulties in maintaining temporal consistency due to the high stochasticity of diffusion models. To overcome these challenges, we propose a novel motion-disentangled diffusion model for high-quality talking head generation, dubbed MoDiTalker. We introduce the two modules: audio-to-motion (AToM), designed to generate a synchronized lip motion from audio, and motion-to-video (MToV), designed to produce high-quality head video following the generated motion. AToM excels in capturing subtle lip movements by leveraging an audio attention mechanism. In addition, MToV enhances temporal consistency by leveraging an efficient tri-plane representation. Our experiments conducted on standard benchmarks demonstrate that our model achieves superior performance compared to existing models. We also provide comprehensive ablation studies and user study results.
传统的基于GAN的嘴形生成模型通常存在质量和训练不稳定的问题。为了克服这些问题,近年来基于扩散模型的方法试图解决这些限制并提高保真度。然而,它们仍然面临着一些挑战,包括扩展的采样时间和扩散模型中高随机性的困难。为了克服这些挑战,我们提出了一个名为MoDiTalker的新运动去噪扩散模型,旨在实现高品质嘴形生成。我们引入了两个模块:音频到运动(AToM)和运动到视频(MToV),分别用于生成同步的嘴部运动和高质量的嘴部视频。AToM通过利用音频注意机制捕捉微妙的嘴部运动。此外,MToV通过利用高效的三角平面表示来增强时间一致性。我们对标准基准进行实验证明,我们的模型与现有模型相比具有卓越的性能。我们还提供了全面的消融研究和用户研究结果。
https://arxiv.org/abs/2403.19144
Diffusion models have revolutionized image synthesis, setting new benchmarks in quality and creativity. However, their widespread adoption is hindered by the intensive computation required during the iterative denoising process. Post-training quantization (PTQ) presents a solution to accelerate sampling, aibeit at the expense of sample quality, extremely in low-bit settings. Addressing this, our study introduces a unified Quantization Noise Correction Scheme (QNCD), aimed at minishing quantization noise throughout the sampling process. We identify two primary quantization challenges: intra and inter quantization noise. Intra quantization noise, mainly exacerbated by embeddings in the resblock module, extends activation quantization ranges, increasing disturbances in each single denosing step. Besides, inter quantization noise stems from cumulative quantization deviations across the entire denoising process, altering data distributions step-by-step. QNCD combats these through embedding-derived feature smoothing for eliminating intra quantization noise and an effective runtime noise estimatiation module for dynamicly filtering inter quantization noise. Extensive experiments demonstrate that our method outperforms previous quantization methods for diffusion models, achieving lossless results in W4A8 and W8A8 quantization settings on ImageNet (LDM-4). Code is available at: this https URL
扩散模型已经推动了图像合成领域的革命,设置了新的质量和创意基准。然而,广泛采用这些模型在迭代去噪过程中需要进行密集计算,这会阻碍其应用。后训练量化(PTQ)提出了一种加速抽样的解决方案,尽管牺牲了样本质量,但在低位设置中表现出色。为解决这一问题,我们的研究引入了一个统一的量化噪声纠正方案(QNCD),旨在在整个抽样过程中减少量化噪声。我们确定了两个主要的量化挑战:内和间量化噪声。内量化噪声,主要是由resblock模块中的嵌入导致的,扩展了激活量程,增加了每个抽样步骤的干扰。此外,间量化噪声源于整个去噪过程的累积量化偏差,改变了数据分布的逐步变化。QNCD通过嵌入导出的特征平滑来消除内量化噪声,并引入了动态滤波器来有效估计间量化噪声。大量实验证明,我们的方法在扩散模型上优于以前的量化方法,在ImageNet上的W4A8和W8A8量化设置上实现了无损失的结果(LDM-4)。代码可在此处下载:https://this URL
https://arxiv.org/abs/2403.19140
Prompt engineering is effective for controlling the output of text-to-image (T2I) generative models, but it is also laborious due to the need for manually crafted prompts. This challenge has spurred the development of algorithms for automated prompt generation. However, these methods often struggle with transferability across T2I models, require white-box access to the underlying model, and produce non-intuitive prompts. In this work, we introduce PRISM, an algorithm that automatically identifies human-interpretable and transferable prompts that can effectively generate desired concepts given only black-box access to T2I models. Inspired by large language model (LLM) jailbreaking, PRISM leverages the in-context learning ability of LLMs to iteratively refine the candidate prompts distribution for given reference images. Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles and images across multiple T2I models, including Stable Diffusion, DALL-E, and Midjourney.
提示工程是对文本到图像(T2I)生成模型的输出进行控制的有效方法,但需要手动创建提示,这使得该方法费力。这一挑战促使了自动提示生成算法的开发。然而,这些方法往往难以在T2I模型之间进行可迁移性,需要对底层模型具有白盒访问权限,并产生不直观的提示。在这项工作中,我们引入了PRISM算法,一种可以自动识别人类可解释和可迁移的提示,以生成仅基于T2I模型黑盒访问的所需概念的有效提示。受到大语言模型(LLM)破解的启发,PRISM利用LLM在上下文学习能力,迭代优化给定参考图像的候选提示分布。我们的实验证明了PRISM在生成准确提示对象、风格和图像跨多个T2I模型方面的多样性和有效性,包括Stable Diffusion、DALL-E和Midjourney。
https://arxiv.org/abs/2403.19103
Wearable collaborative robots stand to assist human wearers who need fall prevention assistance or wear exoskeletons. Such a robot needs to be able to predict the ego motion of the wearer based on egocentric vision and the surrounding scene. In this work, we leveraged body-mounted cameras and sensors to anticipate the trajectory of human wearers through complex surroundings. To facilitate research in ego-motion prediction, we have collected a comprehensive walking scene navigation dataset centered on the user's perspective. We present a method to predict human motion conditioning on the surrounding static scene. Our method leverages a diffusion model to produce a distribution of potential future trajectories, taking into account the user's observation of the environment. We introduce a compact representation to encode the user's visual memory of the surroundings, as well as an efficient sample-generating technique to speed up real-time inference of a diffusion model. We ablate our model and compare it to baselines, and results show that our model outperforms existing methods on key metrics of collision avoidance and trajectory mode coverage.
可穿戴式协同机器人有望帮助需要帮助进行防摔或者穿外骨骼的人类穿戴者。这种机器人需要能够根据采用者的自旋视觉和周围环境预测穿戴者的 ego 运动。在这项工作中,我们利用了穿戴式摄像头和传感器来预测人类穿戴者在复杂环境中的轨迹。为了促进关于自旋预测的研究,我们收集了一个以用户视角为中心的全面行走场景导航数据集。我们提出了在周围静态场景中预测人类运动的方法。我们的方法利用扩散模型生成潜在未来轨迹,考虑了用户观察环境中对环境的观察。我们引入了一种紧凑的表示来编码用户对周围环境的视觉记忆,以及一种高效的样本生成技术来加速扩散模型的实时推理。我们消融我们的模型并与基线进行比较,结果显示我们的模型在碰撞避免关键指标和轨迹模式覆盖方面优于现有方法。
https://arxiv.org/abs/2403.19026
Shape plays an important role in computer graphics, offering informative features to convey an object's morphology and functionality. Shape analysis in brain imaging can help interpret structural and functionality correlations of the human brain. In this work, we investigate the shape of the brain's 3D white matter connections and its potential predictive relationship to human cognitive function. We reconstruct brain connections as sequences of 3D points using diffusion magnetic resonance imaging (dMRI) tractography. To describe each connection, we extract 12 shape descriptors in addition to traditional dMRI connectivity and tissue microstructure features. We introduce a novel framework, Shape--fused Fiber Cluster Transformer (SFFormer), that leverages a multi-head cross-attention feature fusion module to predict subject-specific language performance based on dMRI tractography. We assess the performance of the method on a large dataset including 1065 healthy young adults. The results demonstrate that both the transformer-based SFFormer model and its inter/intra feature fusion with shape, microstructure, and connectivity are informative, and together, they improve the prediction of subject-specific language performance scores. Overall, our results indicate that the shape of the brain's connections is predictive of human language function.
形状在计算机图形中扮演着重要的角色,为传达对象的形态和功能提供了有用的特征。在大脑成像中进行形状分析可以帮助解释人脑的结构和功能相关性。在这项工作中,我们研究了大脑3D白质连接的形状及其与人类认知功能的关系。我们使用扩散磁共振成像(dMRI) tractography 将脑连接重构为3D点的序列。为了描述每个连接,我们除了传统的dMRI连接和组织显微结构特征外,还提取了12个形状描述符。我们引入了一个新框架,Shape--fused Fiber Cluster Transformer (SFFormer),它利用多头交叉注意功能融合模块根据dMRI tractography预测受试者的特定语言表现。我们在包括1065名年轻健康成人的大型数据集上评估了该方法的表现。结果表明,基于Transformer的SFFormer模型以及与形状、显微结构、连接的互/互作用都是有益的,并且一起,它们可以提高对受试者特定语言表现分数的预测。总的来说,我们的结果表明,大脑连接的形状预测了人类语言功能。
https://arxiv.org/abs/2403.19001