Large-scale text-to-image diffusion models can generate high-fidelity images with powerful compositional ability. However, these models are typically trained on an enormous amount of Internet data, often containing copyrighted material, licensed images, and personal photos. Furthermore, they have been found to replicate the style of various living artists or memorize exact training samples. How can we remove such copyrighted concepts or images without retraining the model from scratch? To achieve this goal, we propose an efficient method of ablating concepts in the pretrained model, i.e., preventing the generation of a target concept. Our algorithm learns to match the image distribution for a target style, instance, or text prompt we wish to ablate to the distribution corresponding to an anchor concept. This prevents the model from generating target concepts given its text condition. Extensive experiments show that our method can successfully prevent the generation of the ablated concept while preserving closely related concepts in the model.
大规模文本到图像扩散模型可以生成高保真度的图像,并具有强大的组合能力。然而,这些模型通常训练在大量互联网数据上,通常包含版权材料、授权图像和个人照片。此外,它们已被发现复制各种生活艺术家的风格或记住确切的训练样本。我们怎样才能在没有从头训练模型的情况下删除这些版权概念或图像,而无需重新训练模型?为了实现这一目标,我们提出了一种高效的模型初始化方法,即防止生成目标概念。我们的算法学习将图像分布匹配为我们希望初始化的样式、实例或文本提示对应的分布。这防止了模型生成目标概念,由于其文本条件。广泛的实验表明,我们的方法可以成功防止生成初始化中删除的概念,同时保留模型中密切相关的概念。
https://arxiv.org/abs/2303.13516
Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion models from exemplar images. However, existing inversion methods mainly focus on capturing object appearances. How to invert object relations, another important pillar in the visual world, remains unexplored. In this work, we propose ReVersion for the Relation Inversion task, which aims to learn a specific relation (represented as "relation prompt") from exemplar images. Specifically, we learn a relation prompt from a frozen pre-trained text-to-image diffusion model. The learned relation prompt can then be applied to generate relation-specific images with new objects, backgrounds, and styles. Our key insight is the "preposition prior" - real-world relation prompts can be sparsely activated upon a set of basis prepositional words. Specifically, we propose a novel relation-steering contrastive learning scheme to impose two critical properties of the relation prompt: 1) The relation prompt should capture the interaction between objects, enforced by the preposition prior. 2) The relation prompt should be disentangled away from object appearances. We further devise relation-focal importance sampling to emphasize high-level interactions over low-level appearances (e.g., texture, color). To comprehensively evaluate this new task, we contribute ReVersion Benchmark, which provides various exemplar images with diverse relations. Extensive experiments validate the superiority of our approach over existing methods across a wide range of visual relations.
扩散模型因其生成能力而日益受欢迎。最近,有日益增长的需求,通过反转示例图像中的扩散模型来生成定制图像。然而,现有的反转方法主要关注捕捉对象外观。如何反转对象关系,视觉世界中的另一个重要支柱,仍未被探索。在本研究中,我们提出了关系反转任务 ReVersion,旨在从示例图像中学习特定关系(表示为“关系 prompt”)。具体来说,我们从 frozen 预训练文本到图像扩散模型中学习关系 prompt。学习的关系 prompt 可以应用于生成新对象、背景和风格的关系特定图像。我们的关键发现是“前置条件” - 真实的关系提示可以在一组基词的基础上稀疏激活。具体来说,我们提出了一种关系引导的Contrastive学习策略,以强加关系提示的两个关键特性:1) 关系提示应该捕捉对象之间的交互,由前置条件强制实施。2) 关系提示应该从对象外观中分离出来。我们还设计了关系焦点重要性采样策略,强调高层次交互胜过低层次外观(例如纹理和颜色)。为了全面评估这个新任务,我们贡献了 ReVersion 基准,提供了各种具有不同关系示例的图像。广泛的实验验证了我们方法相对于现有方法在多种视觉关系方面的优越性。
https://arxiv.org/abs/2303.13495
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data. Our code will be open sourced at: this https URL .
近年来,文本到视频生成方法主要依赖于计算密集型训练,并需要大规模的视频数据集。在这篇文章中,我们介绍了一种全新的任务:零样本文本到视频生成,并提出了一种低成本的方法(不需要训练或优化),通过利用现有的文本到图像生成方法(例如稳定扩散)的力量,将它们适用于视频领域。我们的 key 修改包括(一)丰富生成帧的潜在编码,使其与运动动态相结合,以保持全局场景和背景时间一致;(二)重新编程帧级别的自注意力,使用第一帧上的新交叉帧注意力,以保留前景对象的背景、外观和身份。实验表明,这导致低 overhead 且高品质、一致性极高的视频生成。此外,我们的方法不仅局限于文本到视频合成,还可以适用于其他任务,例如条件视频生成、内容和专业化视频生成,以及视频指令-Pix2Pix,即指导视频编辑。实验表明,我们的方法和最近的方法表现相似或有时更好,尽管没有训练额外的视频数据。我们的代码将开源在此处:这个 https URL 上。
https://arxiv.org/abs/2303.13439
Diffusion-based models for text-to-image generation have gained immense popularity due to recent advancements in efficiency, accessibility, and quality. Although it is becoming increasingly feasible to perform inference with these systems using consumer-grade GPUs, training them from scratch still requires access to large datasets and significant computational resources. In the case of medical image generation, the availability of large, publicly accessible datasets that include text reports is limited due to legal and ethical concerns. While training a diffusion model on a private dataset may address this issue, it is not always feasible for institutions lacking the necessary computational resources. This work demonstrates that pre-trained Stable Diffusion models, originally trained on natural images, can be adapted to various medical imaging modalities by training text embeddings with textual inversion. In this study, we conducted experiments using medical datasets comprising only 100 samples from three medical modalities. Embeddings were trained in a matter of hours, while still retaining diagnostic relevance in image generation. Experiments were designed to achieve several objectives. Firstly, we fine-tuned the training and inference processes of textual inversion, revealing that larger embeddings and more examples are required. Secondly, we validated our approach by demonstrating a 2\% increase in the diagnostic accuracy (AUC) for detecting prostate cancer on MRI, which is a challenging multi-modal imaging modality, from 0.78 to 0.80. Thirdly, we performed simulations by interpolating between healthy and diseased states, combining multiple pathologies, and inpainting to show embedding flexibility and control of disease appearance. Finally, the embeddings trained in this study are small (less than 1 MB), which facilitates easy sharing of medical data with reduced privacy concerns.
散射模型用于文本到图像生成已经因其效率和可用性的进步而获得了极大的流行度。尽管使用消费者级别的GPU进行推断已经成为越来越可行的方法,但对于训练从 scratch 开始的全新模型仍然需要访问大量的数据集和重要的计算资源。在医学图像生成方面,包含文本报告的大规模公共数据集的可用性因为法律和伦理问题而受到限制。虽然训练一个私有数据集可能可以解决这一问题,但对于缺乏必要的计算资源的机构来说并不是 always 可行的。这项工作证明了训练先前训练于自然图像上的稳定扩散模型,可以将其适应各种医学成像模式,通过训练文本嵌入来实现。在本研究中,我们使用仅包含 100 个样本的医疗数据集训练了文本嵌入,仅仅需要几个小时,但仍然能够在图像生成中保留诊断相关性。实验旨在实现多个目标。首先,我们优化了文本逆置的训练和推断过程,揭示了需要更多的嵌入和更多的示例才能实现。其次,我们证明了我们的方法的有效性,通过演示在 MRI 中检测前列腺癌时,诊断准确性(AUC)的提高,从 0.78 提高到了 0.80。第三,我们使用平滑过渡在不同健康和患病状态之间进行建模,结合多种病理学,并进行了涂色,以展示嵌入的灵活性和控制疾病的外观。最后,训练在 this 研究中使用的嵌入非常小(小于 1 MB),这便于更轻松地分享医学数据,同时减少隐私担忧。
https://arxiv.org/abs/2303.13430
Human mesh recovery (HMR) provides rich human body information for various real-world applications such as gaming, human-computer interaction, and virtual reality. Compared to single image-based methods, video-based methods can utilize temporal information to further improve performance by incorporating human body motion priors. However, many-to-many approaches such as VIBE suffer from motion smoothness and temporal inconsistency. While many-to-one approaches such as TCMR and MPS-Net rely on the future frames, which is non-causal and time inefficient during inference. To address these challenges, a novel Diffusion-Driven Transformer-based framework (DDT) for video-based HMR is presented. DDT is designed to decode specific motion patterns from the input sequence, enhancing motion smoothness and temporal consistency. As a many-to-many approach, the decoder of our DDT outputs the human mesh of all the frames, making DDT more viable for real-world applications where time efficiency is crucial and a causal model is desired. Extensive experiments are conducted on the widely used datasets (Human3.6M, MPI-INF-3DHP, and 3DPW), which demonstrated the effectiveness and efficiency of our DDT.
人类网格恢复(HMR)为各种实际应用程序,如游戏、人机交互和虚拟现实,提供了丰富的人体信息。与单张图像方法相比,视频方法可以利用时间信息进一步改善性能,通过引入人体运动先验。然而,许多对许多(VIBE)方法存在运动流畅度和时间一致性的问题。而许多对一的方法(TCMR和MPS-Net)则依赖于未来的帧,在推理期间具有非因果性和时间效率不高。为了解决这些挑战,我们提出了一种新的扩散驱动Transformer-based框架(DDT),用于视频人类网格恢复。DDT旨在从输入序列中解码特定的运动模式,增强运动流畅度和时间一致性。作为一种许多对许多的方法,我们的DDT解码器输出所有帧的人网格,使得DDT更适合那些时间效率是至关重要的且希望使用具有因果模型的应用。广泛的实验在广泛使用的数据集(人类3.6M、MPI-INF-3DHP和3DPW)上进行,证明了我们的DDT的有效性和效率。
https://arxiv.org/abs/2303.13397
Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction. With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement. This work conducts a survey on audio diffusion model, which is complementary to existing surveys that either lack the recent progress of diffusion-based speech synthesis or highlight an overall picture of applying diffusion model in multiple fields. Specifically, this work first briefly introduces the background of audio and diffusion model. As for the text-to-speech task, we divide the methods into three categories based on the stage where diffusion model is adopted: acoustic model, vocoder and end-to-end framework. Moreover, we categorize various speech enhancement tasks by either certain signals are removed or added into the input speech. Comparisons of experimental results and discussions are also covered in this survey.
生成型人工智能在各个领域表现出令人印象深刻的表现,其中语音合成是一个有趣的方向。随着扩散模型成为最流行的生成模型,许多工作都尝试了两种活跃的任务:文本到语音和语音增强。这项工作对音频扩散模型进行了调查,是现有调查的互补,它们要么缺乏基于扩散的语音合成的最新进展,要么突出了在多个领域应用扩散模型的整体情况。具体来说,这项工作首先简要介绍了音频和扩散模型的背景。至于文本到语音任务,我们根据采用扩散模型的阶段将方法分为三个类别:声学模型、语音合成器和端到端框架。此外,我们根据输入语音中某些信号的去除或添加将各种语音增强任务分类。这项工作还涵盖了实验结果和讨论的比较。
https://arxiv.org/abs/2303.13336
The vulnerability of machine learning models to adversarial attacks has been attracting considerable attention in recent years. Most existing studies focus on the behavior of stand-alone single-agent learners. In comparison, this work studies adversarial training over graphs, where individual agents are subjected to perturbations of varied strength levels across space. It is expected that interactions by linked agents, and the heterogeneity of the attack models that are possible over the graph, can help enhance robustness in view of the coordination power of the group. Using a min-max formulation of diffusion learning, we develop a decentralized adversarial training framework for multi-agent systems. We analyze the convergence properties of the proposed scheme for both convex and non-convex environments, and illustrate the enhanced robustness to adversarial attacks.
机器学习模型对对抗攻击的脆弱性近年来吸引了相当大的注意力。大多数现有研究都关注单个独立学习器的行为。相比之下,本研究研究在图形上的对抗训练,个体 agents 在空间中受到各种强度水平的变化影响。预计通过连接agents的互动,以及在图形上的攻击模型的多样性,可以改善群体协调力,从而增强鲁棒性。使用扩散学习的最小最大定义,我们开发了分布式对抗训练框架,为多Agent系统。我们对 proposed scheme 在凸环境和非凸环境下的收敛性质进行了分析,并展示了对对抗攻击的增强鲁棒性。
https://arxiv.org/abs/2303.13326
The advent of open-source AI communities has produced a cornucopia of powerful text-guided diffusion models that are trained on various datasets. While few explorations have been conducted on ensembling such models to combine their strengths. In this work, we propose a simple yet effective method called Saliency-aware Noise Blending (SNB) that can empower the fused text-guided diffusion models to achieve more controllable generation. Specifically, we experimentally find that the responses of classifier-free guidance are highly related to the saliency of generated images. Thus we propose to trust different models in their areas of expertise by blending the predicted noises of two diffusion models in a saliency-aware manner. SNB is training-free and can be completed within a DDIM sampling process. Additionally, it can automatically align the semantics of two noise spaces without requiring additional annotations such as masks. Extensive experiments show the impressive effectiveness of SNB in various applications. Project page is available at this https URL.
开源人工智能社区的出现,创造了大量基于各种数据集的强大文本引导扩散模型,这些模型在训练过程中使用了各种数据集。然而,只有少数研究涉及到结合这些模型的强项。在本文中,我们提出了一种简单但有效的方法,称为亮度感知噪声混合(SNB),可以增强融合文本引导扩散模型的能力,实现更加可控的生成。具体而言,我们实验发现,无分类器指导的反应与生成图像的亮度感受度高度相关。因此,我们建议信任不同模型在其专业领域内的表现,通过将两个扩散模型的预测噪声以亮度感知方式混合,来信任它们。SNB不需要训练,可以在DDIM采样过程内完成。此外,它可以通过自动对齐两个噪声空间中的语义,而不需要额外的标注,如口罩。广泛的实验表明,SNB在各种应用中的惊人效果。项目页面可在本链接上找到。
https://arxiv.org/abs/2303.13126
Deep generative models dominate the existing literature in layout pattern generation. However, leaving the guarantee of legality to an inexplicable neural network could be problematic in several applications. In this paper, we propose \tool{DiffPattern} to generate reliable layout patterns. \tool{DiffPattern} introduces a novel diverse topology generation method via a discrete diffusion model with compute-efficiently lossless layout pattern representation. Then a white-box pattern assessment is utilized to generate legal patterns given desired design rules. Our experiments on several benchmark settings show that \tool{DiffPattern} significantly outperforms existing baselines and is capable of synthesizing reliable layout patterns.
Deep生成模型在布局模式生成现有文献中占据主导地位。然而,将合法性保障留给不可理解神经网络在一些应用中可能会有问题。在本文中,我们提出了 \tool{DiffPattern} 来生成可靠的布局模式。 \tool{DiffPattern} 通过一种离散扩散模型引入了一种新颖的多样化拓扑生成方法。然后,使用一个白盒模式评估方法,根据预期的设计规则生成合法模式。我们对多个基准设置的实验结果表明, \tool{DiffPattern} 显著优于现有基准模型,并能够合成可靠的布局模式。
https://arxiv.org/abs/2303.13060
Face recognition models embed a face image into a low-dimensional identity vector containing abstract encodings of identity-specific facial features that allow individuals to be distinguished from one another. We tackle the challenging task of inverting the latent space of pre-trained face recognition models without full model access (i.e. black-box setting). A variety of methods have been proposed in literature for this task, but they have serious shortcomings such as a lack of realistic outputs, long inference times, and strong requirements for the data set and accessibility of the face recognition model. Through an analysis of the black-box inversion problem, we show that the conditional diffusion model loss naturally emerges and that we can effectively sample from the inverse distribution even without an identity-specific loss. Our method, named identity denoising diffusion probabilistic model (ID3PM), leverages the stochastic nature of the denoising diffusion process to produce high-quality, identity-preserving face images with various backgrounds, lighting, poses, and expressions. We demonstrate state-of-the-art performance in terms of identity preservation and diversity both qualitatively and quantitatively. Our method is the first black-box face recognition model inversion method that offers intuitive control over the generation process and does not suffer from any of the common shortcomings from competing methods.
人脸识别模型将面部图像嵌入低维身份向量中,其中包含身份特定的面部特征抽象编码,以便能够区分个体。我们解决了没有全模型访问(即黑盒设置)的挑战,即反转训练后的面部识别模型的潜在空间。在文献中提出了多种方法来完成这项工作,但它们都有严重的缺陷,例如缺乏实际输出、推理时间长、以及对数据集和面部识别模型访问的强烈要求。通过对黑盒反转问题的分析,我们表明条件扩散模型损失自然地出现,并且即使没有身份特定的损失,我们仍然可以从逆分布中有效地采样。我们的方法名为身份去噪扩散概率模型(ID3PM),利用去噪扩散过程的随机性质,以产生各种背景、照明、姿势和面部表情的高质量、身份保留的面部图像。我们证明了身份保留和多样性的高水平、高定量表现。我们的方法是第一个黑盒面部识别模型反转方法,提供了直观的控制生成过程,并且没有与竞争方法的普遍缺点的任何影响。
https://arxiv.org/abs/2303.13006
One major challenge of disentanglement learning with variational autoencoders is the trade-off between disentanglement and reconstruction fidelity. Previous incremental methods with only on latent space cannot optimize these two targets simultaneously, so they expand the Information Bottleneck while training to {optimize from disentanglement to reconstruction. However, a large bottleneck will lose the constraint of disentanglement, causing the information diffusion problem. To tackle this issue, we present a novel decremental variational autoencoder with disentanglement-invariant transformations to optimize multiple objectives in different layers, termed DeVAE, for balancing disentanglement and reconstruction fidelity by decreasing the information bottleneck of diverse latent spaces gradually. Benefiting from the multiple latent spaces, DeVAE allows simultaneous optimization of multiple objectives to optimize reconstruction while keeping the constraint of disentanglement, avoiding information diffusion. DeVAE is also compatible with large models with high-dimension latent space. Experimental results on dSprites and Shapes3D that DeVAE achieves \fix{R2q6}{a good balance between disentanglement and reconstruction.DeVAE shows high tolerant of hyperparameters and on high-dimensional latent spaces.
与变分自编码器进行分离学习的一个主要挑战是分离和重构精度之间的权衡。以前的增量方法只能涉及潜在空间,无法同时优化这两个目标,因此他们在训练时扩展了信息瓶颈,以从分离到重构进行优化。然而,一个大的瓶颈将失去分离的限制,导致信息扩散问题。为了解决这个问题,我们提出了一种 decremental 变分自编码器,具有不变性变换,在不同层中优化多个目标,名为 DeVAE,以平衡分离和重构的精度,通过逐渐减少不同潜在空间的信息瓶颈。从多个潜在空间中受益,DeVAE允许同时优化多个目标,优化重构,同时保持分离的限制,避免信息扩散。DeVAE也适用于具有高维度潜在空间的大型模型。在 dSprites 和 Shapes3D 的实验中,DeVAE 在分离和重构之间实现了良好的平衡。DeVAE 表现出对超参数和高维度潜在空间的容忍度。
https://arxiv.org/abs/2303.12959
Crowd counting is a key aspect of crowd analysis and has been typically accomplished by estimating a crowd-density map and summing over the density values. However, this approach suffers from background noise accumulation and loss of density due to the use of broad Gaussian kernels to create the ground truth density maps. This issue can be overcome by narrowing the Gaussian kernel. However, existing approaches perform poorly when trained with such ground truth density maps. To overcome this limitation, we propose using conditional diffusion models to predict density maps, as diffusion models are known to model complex distributions well and show high fidelity to training data during crowd-density map generation. Furthermore, as the intermediate time steps of the diffusion process are noisy, we incorporate a regression branch for direct crowd estimation only during training to improve the feature learning. In addition, owing to the stochastic nature of the diffusion model, we introduce producing multiple density maps to improve the counting performance contrary to the existing crowd counting pipelines. Further, we also differ from the density summation and introduce contour detection followed by summation as the counting operation, which is more immune to background noise. We conduct extensive experiments on public datasets to validate the effectiveness of our method. Specifically, our novel crowd-counting pipeline improves the error of crowd-counting by up to $6\%$ on JHU-CROWD++ and up to $7\%$ on UCF-QNRF.
人群计数是人群分析的关键方面,通常通过估计人群密度图并累加密度值来实现。然而,这种方法受到背景噪声的积累和密度损失的影响,因为使用广泛的高斯曲率Kernel来创建基准密度图。解决这个问题可以通过减小高斯曲率Kernel来实现。然而,现有的方法在训练时表现并不理想,与使用基准密度图训练的情况相比。为了克服这一限制,我们建议使用条件扩散模型来预测密度图,因为扩散模型已知能够很好地模拟复杂的分布,并在生成人群密度图时表现出与训练数据高度相似的特点。此外,由于扩散过程的中间时间步骤是噪声的,仅在训练期间才引入直接人群估计的回归分支,以改善特征学习。此外,由于扩散模型的随机性质,我们引入了生成多个密度图来改善计数性能,与现有的人群计数管道相反。此外,我们还与密度累加不同,引入了轮廓检测和累加作为计数操作,更能够抵御背景噪声。我们在公共数据集上进行广泛的实验来验证我们方法的有效性。具体来说,我们的新型人群计数管道在JHU-CROWD++数据集上提高了人群计数误差高达6%,在UCF-QNRF数据集上高达7%。
https://arxiv.org/abs/2303.12790
We propose a method for editing NeRF scenes with text-instructions. Given a NeRF of a scene and the collection of images used to reconstruct it, our method uses an image-conditioned diffusion model (InstructPix2Pix) to iteratively edit the input images while optimizing the underlying scene, resulting in an optimized 3D scene that respects the edit instruction. We demonstrate that our proposed method is able to edit large-scale, real-world scenes, and is able to accomplish more realistic, targeted edits than prior work.
我们提出了一种方法,用于以文本指令编辑NeRF场景。给定一个场景的NeRF图像和用于重建它的一组图像,我们使用一种图像适应扩散模型(InstructPix2Pix)迭代地编辑输入图像,同时优化底层场景,最终生成符合编辑指令的优化的3D场景。我们证明了我们提出的方法能够编辑大规模的现实世界场景,并且能够比先前的工作实现更逼真、有针对性的编辑。
https://arxiv.org/abs/2303.12789
Recent works on generalizable NeRFs have shown promising results on novel view synthesis from single or few images. However, such models have rarely been applied on other downstream tasks beyond synthesis such as semantic understanding and parsing. In this paper, we propose a novel framework named FeatureNeRF to learn generalizable NeRFs by distilling pre-trained vision foundation models (e.g., DINO, Latent Diffusion). FeatureNeRF leverages 2D pre-trained foundation models to 3D space via neural rendering, and then extract deep features for 3D query points from NeRF MLPs. Consequently, it allows to map 2D images to continuous 3D semantic feature volumes, which can be used for various downstream tasks. We evaluate FeatureNeRF on tasks of 2D/3D semantic keypoint transfer and 2D/3D object part segmentation. Our extensive experiments demonstrate the effectiveness of FeatureNeRF as a generalizable 3D semantic feature extractor. Our project page is available at this https URL .
最近,关于可泛化的NeRF的研究取得了在单个或少量图像中生成全新视角的 promising 成果。然而,这类模型在除合成以外的其他下游任务方面的应用却极其罕见,例如语义理解和解构。在本文中,我们提出了一个名为FeatureNeRF的新框架,通过蒸馏预先训练的视觉基元模型(例如DiNO和Latent Diffusion)来学习可泛化的NeRF。FeatureNeRF利用2D预先训练基元模型到3D空间中,然后从NeRFMLP中提取深度特征以3D查询点。因此,它允许将2D图像映射到连续的3D语义特征体积中,这些体积可以用于各种下游任务。我们针对2D/3D语义关键点转移和2D/3D物体部分分割等任务进行了评估,我们的广泛实验证明了FeatureNeRF作为可泛化的3D语义特征提取器的 effectiveness。我们的项目页面可用在此httpsURL上。
https://arxiv.org/abs/2303.12786
Image diffusion models, trained on massive image collections, have emerged as the most versatile image generator model in terms of quality and diversity. They support inverting real images and conditional (e.g., text) generation, making them attractive for high-quality image editing applications. We investigate how to use such pre-trained image models for text-guided video editing. The critical challenge is to achieve the target edits while still preserving the content of the source video. Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame; then, in the key step, we progressively propagate the changes to the future frames via self-attention feature injection to adapt the core denoising step of the diffusion model. We then consolidate the changes by adjusting the latent code for the frame before continuing the process. Our approach is training-free and generalizes to a wide range of edits. We demonstrate the effectiveness of the approach by extensive experimentation and compare it against four different prior and parallel efforts (on ArXiv). We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.
图像扩散模型是在大量图像集合上训练的,因此它们在质量和多样性方面最为多才多艺的图像生成模型。它们支持逆真实图像和条件生成(如文本生成)的功能,使得它们对于高质量的图像编辑应用程序非常有吸引力。我们研究如何使用这些预训练的图像模型来进行文本引导的视频编辑。关键挑战是既要实现目标编辑,又要保留源视频的内容。我们的方法有两个简单的步骤:第一个步骤是使用预训练的结构引导(如深度)图像扩散模型,对目标帧进行文本引导编辑;第二个步骤是通过自我注意力特征注入逐步传播变化到未来的帧,以适应扩散模型的核心去噪步骤。然后我们通过调整帧的隐写代码来巩固变化,并在继续过程之前进行。我们的方法没有训练,并适用于广泛的编辑操作。我们通过广泛的实验比较了这些方法的效果(在ArXiv上)。我们证明,没有计算密集型预处理或针对视频的精细调整,可以实现真实的文本引导视频编辑。
https://arxiv.org/abs/2303.12688
Image synthesis is expected to provide value for the translation of machine learning methods into clinical practice. Fundamental problems like model robustness, domain transfer, causal modelling, and operator training become approachable through synthetic data. Especially, heavily operator-dependant modalities like Ultrasound imaging require robust frameworks for image and video generation. So far, video generation has only been possible by providing input data that is as rich as the output data, e.g., image sequence plus conditioning in, video out. However, clinical documentation is usually scarce and only single images are reported and stored, thus retrospective patient-specific analysis or the generation of rich training data becomes impossible with current approaches. In this paper, we extend elucidated diffusion models for video modelling to generate plausible video sequences from single images and arbitrary conditioning with clinical parameters. We explore this idea within the context of echocardiograms by looking into the variation of the Left Ventricle Ejection Fraction, the most essential clinical metric gained from these examinations. We use the publicly available EchoNet-Dynamic dataset for all our experiments. Our image to sequence approach achieves an R2 score of 93%, which is 38 points higher than recently proposed sequence to sequence generation methods. A public demo is available here: this http URL. Code and models will be available at: this https URL.
图像合成预计将为将机器学习方法应用于临床实践提供价值。像模型鲁棒性、领域迁移、因果建模和操作训练等基本问题可以通过合成数据来解决。特别是,像超声波成像等高度依赖操作的特征需要稳定的框架来生成图像和视频。到目前为止,视频生成只能通过提供与输出数据一样丰富的输入数据来实现,例如图像序列加上条件化输入和视频输出。然而,临床记录通常非常缺乏,只有单个图像报告并存储,因此回顾性患者特定分析或生成丰富的训练数据在当前方法中是不可能的。在本文中,我们扩展了阐明扩散模型用于视频建模,以从单个图像生成可能的视频序列,并使用临床参数任意条件化。我们在心脏超声学的背景下探索了这个想法,通过观察左心室收缩率的变化,了解了这些检查的最重要临床指标的变化。我们使用了公开可用的EchoNet-动态数据集的所有实验数据。我们的图像到序列方法获得93%的R2得分,比最近提出的序列到序列生成方法高38个点。一个公开演示在这里:这个httpURL。代码和模型在这里:这个httpsURL。
https://arxiv.org/abs/2303.12644
\underline{AI} \underline{G}enerated \underline{C}ontent (\textbf{AIGC}) has gained widespread attention with the increasing efficiency of deep learning in content creation. AIGC, created with the assistance of artificial intelligence technology, includes various forms of content, among which the AI-generated images (AGIs) have brought significant impact to society and have been applied to various fields such as entertainment, education, social media, etc. However, due to hardware limitations and technical proficiency, the quality of AIGC images (AGIs) varies, necessitating refinement and filtering before practical use. Consequently, there is an urgent need for developing objective models to assess the quality of AGIs. Unfortunately, no research has been carried out to investigate the perceptual quality assessment for AGIs specifically. Therefore, in this paper, we first discuss the major evaluation aspects such as technical issues, AI artifacts, unnaturalness, discrepancy, and aesthetics for AGI quality assessment. Then we present the first perceptual AGI quality assessment database, AGIQA-1K, which consists of 1,080 AGIs generated from diffusion models. A well-organized subjective experiment is followed to collect the quality labels of the AGIs. Finally, we conduct a benchmark experiment to evaluate the performance of current image quality assessment (IQA) models.
随着深度学习在内容创建中的效率提高,AIGC(AI生成内容)已经引起了广泛关注。AIGC是由人工智能技术协助创建的,包括各种形式的内容,其中AI生成的图像(AGI)对社会影响非常大,已经应用于娱乐、教育、社交媒体等领域。然而,由于硬件限制和技术水平,AIGC图像(AGI)的质量 vary,需要在实际应用前进行精化和过滤。因此,迫切需要开发 objective 模型来评估 AGI 的质量。不幸的是,尚未有任何研究专门研究对 AGI 的视觉质量评估。因此,在本文中,我们首先讨论了主要的评估方面,如技术问题、AI 工具包、自然ness、差异和美学,以及对 AGI 质量评估的技术。然后,我们介绍了第一个对 AGI 的视觉质量评估数据库,AGIQA-1K,它由扩散模型生成的 1,080 个 AGI 组成。一个组织良好的主观实验随后用于收集 AGI 的质量标签。最后,我们进行了一个基准实验来评估当前图像质量评估(IQA)模型的性能。
https://arxiv.org/abs/2303.12618
Embodied agents operate in a structured world, often solving tasks with spatial, temporal, and permutation symmetries. Most algorithms for planning and model-based reinforcement learning (MBRL) do not take this rich geometric structure into account, leading to sample inefficiency and poor generalization. We introduce the Equivariant Diffuser for Generating Interactions (EDGI), an algorithm for MBRL and planning that is equivariant with respect to the product of the spatial symmetry group $\mathrm{SE(3)}$, the discrete-time translation group $\mathbb{Z}$, and the object permutation group $\mathrm{S}_n$. EDGI follows the Diffuser framework (Janner et al. 2022) in treating both learning a world model and planning in it as a conditional generative modeling problem, training a diffusion model on an offline trajectory dataset. We introduce a new $\mathrm{SE(3)} \times \mathbb{Z} \times \mathrm{S}_n$-equivariant diffusion model that supports multiple representations. We integrate this model in a planning loop, where conditioning and classifier-based guidance allow us to softly break the symmetry for specific tasks as needed. On navigation and object manipulation tasks, EDGI improves sample efficiency and generalization.
实体代理在有结构的世界中运行,经常解决具有空间、时间以及置换对称性的任务。大多数规划和基于模型的强化学习算法(MBRL)并未考虑到这种丰富的几何结构,导致样本效率不佳且泛化性能差。我们引入了 Equivariant Diffuser for Generating Interactions (EDGI),这是一款 MBRL 和规划算法,其对于空间对称群 $\mathrm{SE(3)}$、离散时间翻译群 $\mathbb{Z}$ 和对象置换群 $\mathrm{S}_n$ 的等变算法。EDGI 遵循Diffuser框架(Janner等人,2022),将学习世界模型和在其中规划视为条件生成建模问题,在 offline 轨迹数据集上训练扩散模型。我们引入了一种新的 $\mathrm{SE(3)} \times \mathbb{Z} \times \mathrm{S}_n$-等变扩散模型,支持多种表示。我们将该模型集成到规划循环中,在 conditioning 和分类器引导下,我们可以根据需要轻轻打破特定任务中的对称。在导航和对象操作任务中,EDGI提高了样本效率和泛化性能。
https://arxiv.org/abs/2303.12410
In this paper, we propose NUWA-XL, a novel Diffusion over Diffusion architecture for eXtremely Long video generation. Most current work generates long videos segment by segment sequentially, which normally leads to the gap between training on short videos and inferring long videos, and the sequential generation is inefficient. Instead, our approach adopts a ``coarse-to-fine'' process, in which the video can be generated in parallel at the same granularity. A global diffusion model is applied to generate the keyframes across the entire time range, and then local diffusion models recursively fill in the content between nearby frames. This simple yet effective strategy allows us to directly train on long videos (3376 frames) to reduce the training-inference gap, and makes it possible to generate all segments in parallel. To evaluate our model, we build FlintstonesHD dataset, a new benchmark for long video generation. Experiments show that our model not only generates high-quality long videos with both global and local coherence, but also decreases the average inference time from 7.55min to 26s (by 94.26\%) at the same hardware setting when generating 1024 frames. The homepage link is \url{this https URL}
在本文中,我们提出了 NUWA-XL,一种 novel Diffusion over Diffusion 架构,用于极端Long视频生成的新方法。当前大多数工作都是按照片段序列依次生成Long视频片段,这通常会导致训练视频和推断Long视频之间的间隔,Sequential generation 效率低。相反,我们采用了一种“粗到细”的过程,即视频可以在相同的粒度级别上同时生成。全球扩散模型被应用来生成整个时间范围内的关键点,然后当地扩散模型则递归地填充相邻帧之间的内容。这个简单而有效的策略使我们能够直接训练(生成3376帧的)Long视频,以减少训练-推断差距,并使所有片段能够同时生成。为了评估我们的模型,我们建立了FlintstonesHD数据集,成为Long视频生成新的基准。实验表明,我们的模型不仅生成具有全球和 local 一致性高质量的Long视频,而且当生成1024帧时,在相同的硬件设置下,平均推断时间从7.55分钟减少到26秒(94.26%)。主页链接为 \url{this https URL}。
https://arxiv.org/abs/2303.12346
We present a technique for segmenting real and AI-generated images using latent diffusion models (LDMs) trained on internet-scale datasets. First, we show that the latent space of LDMs (z-space) is a better input representation compared to other feature representations like RGB images or CLIP encodings for text-based image segmentation. By training the segmentation models on the latent z-space, which creates a compressed representation across several domains like different forms of art, cartoons, illustrations, and photographs, we are also able to bridge the domain gap between real and AI-generated images. We show that the internal features of LDMs contain rich semantic information and present a technique in the form of LD-ZNet to further boost the performance of text-based segmentation. Overall, we show up to 6% improvement over standard baselines for text-to-image segmentation on natural images. For AI-generated imagery, we show close to 20% improvement compared to state-of-the-art techniques.
我们提出了一种利用基于互联网大规模数据集训练的潜在扩散模型(LDMs)来分割真实和AI生成图像的方法。首先,我们证明了LDMs的潜在空间(z-空间)是一种更好的输入表示,相比之下,与基于文本图像分割的RGB图像或CLIP编码相比,基于文本图像分割的分割模型需要更多的特征表示。通过在潜在z-空间上训练分割模型,该模型可以在多个领域,如不同艺术形式、漫画、插图和照片,创造压缩的表示,从而能够解决真实和AI生成图像之间的领域差距。我们展示了LDMs的内部特征包含丰富的语义信息,并提出了以LD-ZNet的形式进一步提高文本分割性能的技术。总的来说,我们在自然图像上的文本到图像分割标准基准上实现了6%的改进。对于AI生成图像,我们实现了与先进技术相比接近20%的改进。
https://arxiv.org/abs/2303.12343