Text-to-Image diffusion models have made tremendous progress over the past two years, enabling the generation of highly realistic images based on open-domain text descriptions. However, despite their success, text descriptions often struggle to adequately convey detailed controls, even when composed of long and complex texts. Moreover, recent studies have also shown that these models face challenges in understanding such complex texts and generating the corresponding images. Therefore, there is a growing need to enable more control modes beyond text description. In this paper, we introduce Uni-ControlNet, a novel approach that allows for the simultaneous utilization of different local controls (e.g., edge maps, depth map, segmentation masks) and global controls (e.g., CLIP image embeddings) in a flexible and composable manner within one model. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models, eliminating the huge cost of training from scratch. Moreover, thanks to some dedicated adapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2) of adapters, regardless of the number of local or global controls used. This not only reduces the fine-tuning costs and model size, making it more suitable for real-world deployment, but also facilitate composability of different conditions. Through both quantitative and qualitative comparisons, Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. Code is available at \url{this https URL}.
文本到图像扩散模型在过去两年中取得了巨大的进展,基于公开领域的文本描述能够生成高度逼真的图像。然而,尽管它们取得了成功,文本描述常常难以充分传达详细的控制,即使包含长且复杂的文本。此外,最近的研究表明,这些模型在理解这些复杂的文本和生成相应的图像方面面临挑战。因此,越来越需要实现更多的控制模式超越文本描述。在本文中,我们介绍了 Uni-ControlNet,一种新颖的方法,能够同时利用不同的局部控制(例如边缘图、深度图、分块掩码)和全局控制(例如CLIP图像嵌入)在一个模型中以灵活和可组合的方式使用。与现有方法不同,Uni-ControlNet只需要在训练前冻存的文本到图像扩散模型进行微调,消除从头训练的巨大成本。此外,得益于一些专门的适配器设计,Uni-ControlNet只需要一个恒定的数量(即2)的适配器,无论使用多少局部或全局控制。这不仅减少了微调成本和模型大小,使其更适用于现实世界的部署,而且还促进了不同条件的可组合性。通过量化和定性比较,Uni-ControlNet证明了它在控制性和生成质量以及可组合性方面的优势。代码可在 \url{this https URL} 找到。
https://arxiv.org/abs/2305.16322
Decomposing an object's appearance into representations of its materials and the surrounding illumination is difficult, even when the object's 3D shape is known beforehand. This problem is ill-conditioned because diffuse materials severely blur incoming light, and is ill-posed because diffuse materials under high-frequency lighting can be indistinguishable from shiny materials under low-frequency lighting. We show that it is possible to recover precise materials and illumination -- even from diffuse objects -- by exploiting unintended shadows, like the ones cast onto an object by the photographer who moves around it. These shadows are a nuisance in most previous inverse rendering pipelines, but here we exploit them as signals that improve conditioning and help resolve material-lighting ambiguities. We present a method based on differentiable Monte Carlo ray tracing that uses images of an object to jointly recover its spatially-varying materials, the surrounding illumination environment, and the shapes of the unseen light occluders who inadvertently cast shadows upon it.
将物体的外观分解成其材料和周围环境的表示方法是困难的,即使物体的三维形状已知。这个问题是Conditioning不足的,因为扩散材料严重模糊入射光,也因为高频照明下的扩散材料可以与低频照明下的闪亮材料分辨不清。我们表明,可以利用意外生成的 shadows,例如摄影师围绕物体移动时生成的 shadows。这些 shadows 在大多数先前的反渲染管道中都是令人困扰的,但在这里我们利用它们作为改善 conditioning 和解决材料照明混淆的信号。我们提出了基于不同变的蒙特卡罗射线渲染的方法,该方法使用物体的图像一起恢复其空间 varying 的材料、周围的照明环境,以及无意中对物体生成的光遮蔽器的形状。
https://arxiv.org/abs/2305.16321
This paper reveals that every image can be understood as a first-order norm+linear autoregressive process, referred to as FINOLA, where norm+linear denotes the use of normalization before the linear model. We demonstrate that images of size 256$\times$256 can be reconstructed from a compressed vector using autoregression up to a 16$\times$16 feature map, followed by upsampling and convolution. This discovery sheds light on the underlying partial differential equations (PDEs) governing the latent feature space. Additionally, we investigate the application of FINOLA for self-supervised learning through a simple masked prediction technique. By encoding a single unmasked quadrant block, we can autoregressively predict the surrounding masked region. Remarkably, this pre-trained representation proves effective for image classification and object detection tasks, even in lightweight networks, without requiring fine-tuning. The code will be made publicly available.
这篇文章表明,每个图像都可以被视为一个第一阶 norms+线性 autoregressive 过程,也称为 FINOLA,其中 norms+线性表示在线性模型之前使用标准化。我们证明了,大小为 256x256 的图像可以通过自回归从压缩向量重构到 16x16 特征图,然后进行增广和卷积。这个发现揭示了支配潜在特征空间的基 partial differential equations (PDEs)。此外,我们通过简单的蒙面预测技术研究了 FINOLA 对自监督学习的应用。通过编码一个未暴露的 Quadrant 块,我们可以自回归预测周围的蒙面区域。令人惊讶地,这个预训练表示证明对于图像分类和物体检测任务有效,即使在轻量级网络中,也不需要微调。代码将公开可用。
https://arxiv.org/abs/2305.16319
For computer vision tasks, Vision Transformers (ViTs) have become one of the go-to deep net architectures. Despite being inspired by Convolutional Neural Networks (CNNs), ViTs remain sensitive to small shifts in the input image. To address this, we introduce novel designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding. With our proposed modules, we achieve truly shift-equivariant ViTs on four well-established models, namely, Swin, SwinV2, MViTv2, and CvT, both in theory and practice. Empirically, we tested these models on image classification and semantic segmentation, achieving competitive performance across three different datasets while maintaining 100% shift consistency.
对计算机视觉任务而言,视觉转换器(ViTs)已成为深度学习架构的首选之一。尽管受到了卷积神经网络(CNNs)的启发,ViTs仍然对输入图像的微小变化非常敏感。为了解决这一问题,我们提出了ViTs中的每个模块的全新设计,例如 tokenization、自注意力、块融合和位置编码。利用我们提出的模块,我们实现了真正的变换同构ViTs,对四个已知模型(Swin、SwinV2、MViTv2和CvT)进行了验证,在理论和实践上都实现了100%的变换一致性。具体来说,我们在实践中测试了这些模型的图像分类和语义分割性能,在不同数据集上取得了竞争表现,同时保持了100%的变换一致性。
https://arxiv.org/abs/2305.16316
We propose Neural 3D Articulation Prior (NAP), the first 3D deep generative model to synthesize 3D articulated object models. Despite the extensive research on generating 3D objects, compositions, or scenes, there remains a lack of focus on capturing the distribution of articulated objects, a common object category for human and robot interaction. To generate articulated objects, we first design a novel articulation tree/graph parameterization and then apply a diffusion-denoising probabilistic model over this representation where articulated objects can be generated via denoising from random complete graphs. In order to capture both the geometry and the motion structure whose distribution will affect each other, we design a graph-attention denoising network for learning the reverse diffusion process. We propose a novel distance that adapts widely used 3D generation metrics to our novel task to evaluate generation quality, and experiments demonstrate our high performance in articulated object generation. We also demonstrate several conditioned generation applications, including Part2Motion, PartNet-Imagination, Motion2Part, and GAPart2Object.
我们提出了神经网络3D关节构造前奏(NAP),这是合成3D关节对象模型的第一种3D深度生成模型。尽管研究了生成3D物体、组合或场景的广泛研究,但仍缺乏关注捕捉关节对象分布的重点,这是人类和机器人交互的常见对象类别。生成关节对象,我们首先设计了一个 novel 关节树/图参数化,然后应用一个扩散除噪的probabilistic模型,在这个表示上,可以从随机完整图生成关节对象。为了捕捉 both the geometry 和运动结构,Whose distribution will affect each other,我们设计了图注意力除噪网络,以学习逆扩散过程。我们提出了一种新的距离,该距离适应广泛使用的3D生成度量任务,以评估生成质量,并实验表明我们在关节对象生成方面表现出高性能。我们还展示了多个条件生成应用,包括Part2Motion、PartNet-Imagination、Motion2Part和 GAPart2Object。
https://arxiv.org/abs/2305.16315
Equivariance has gained strong interest as a desirable network property that inherently ensures robust generalization. However, when dealing with complex systems such as articulated objects or multi-object scenes, effectively capturing inter-part transformations poses a challenge, as it becomes entangled with the overall structure and local transformations. The interdependence of part assignment and per-part group action necessitates a novel equivariance formulation that allows for their co-evolution. In this paper, we present Banana, a Banach fixed-point network for equivariant segmentation with inter-part equivariance by construction. Our key insight is to iteratively solve a fixed-point problem, where point-part assignment labels and per-part SE(3)-equivariance co-evolve simultaneously. We provide theoretical derivations of both per-step equivariance and global convergence, which induces an equivariant final convergent state. Our formulation naturally provides a strict definition of inter-part equivariance that generalizes to unseen inter-part configurations. Through experiments conducted on both articulated objects and multi-object scans, we demonstrate the efficacy of our approach in achieving strong generalization under inter-part transformations, even when confronted with substantial changes in pointcloud geometry and topology.
一致性作为一个重要的网络属性,本身就确保了稳健泛化。然而,在与复杂的系统,如多关节疼痛对象或多对象场景时,有效地捕捉部分之间的变换面临着挑战,因为它与整体结构和局部变换交织在一起。部分分配和每个部分的独立行动需要一种独特的一致性定义,以便它们的协同演化。在本文中,我们介绍了banana,一个由Banach fixed-point网络构建的一致性分割网络,具有部分一致性。我们的关键发现是解决一个固定点问题,该问题点部分分配标签和每个部分的SE(3)一致性同时演化。我们提供了每个步骤的一致性和全球收敛的理论推导,导致一致性最终的收敛状态。我们的定义自然地提供了部门一致性的严格定义,可以泛化到未知的部门配置。通过在多关节疼痛对象和多对象扫描的实验中实施,我们证明了我们的方法在部分变换下实现稳健泛化的有效性,即使面对点云几何和拓扑的重大变化。
https://arxiv.org/abs/2305.16314
Text-to-image model personalization aims to introduce a user-provided concept to the model, allowing its synthesis in diverse contexts. However, current methods primarily focus on the case of learning a single concept from multiple images with variations in backgrounds and poses, and struggle when adapted to a different scenario. In this work, we introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method. Project page is available at: this https URL
文本到图像模型个性化的目标是将用户提供的概念引入模型,并在多种情境下进行合成。然而,当前的方法主要关注从多个图像中学习单一概念的情况,并在适应不同情境时面临困难。在本文中,我们介绍了文本场景分解任务:给定一张可能包含多个概念的图像,我们的目标是提取每个概念的 distinct 文本 token,从而实现对生成的场景的精细控制。为此,我们建议增加输入图像上的掩码,以指示目标概念的存在。这些掩码可以由用户提供或由预先训练的分割模型自动生成。然后我们介绍了一种独特的两阶段定制过程,该过程优化了一组专门化文本嵌入(handles)和模型权重,实现在准确捕捉概念和避免过拟合之间的微妙平衡。我们采用Masked Diffusion Loss来实现handles 生成其指定的概念,并添加Cross-Attention Loss以防止纠缠,我们还介绍了合并采样训练策略,旨在提高生成图像中多个概念的合并能力。我们使用多个自动指标对方法和多个基准进行比较,并使用用户研究进一步确认结果。最后,我们展示了我们方法的多个应用。项目页面可用如下: this https URL
https://arxiv.org/abs/2305.16311
Recent advances in deep generative models have led to the development of methods capable of synthesizing high-quality, realistic images. These models pose threats to society due to their potential misuse. Prior research attempted to mitigate these threats by detecting generated images, but the varying traces left by different generative models make it challenging to create a universal detector capable of generalizing to new, unseen generative models. In this paper, we propose to inject a universal adversarial signature into an arbitrary pre-trained generative model, in order to make its generated contents more detectable and traceable. First, the imperceptible optimal signature for each image can be found by a signature injector through adversarial training. Subsequently, the signature can be incorporated into an arbitrary generator by fine-tuning it with the images processed by the signature injector. In this way, the detector corresponding to the signature can be reused for any fine-tuned generator for tracking the generator identity. The proposed method is validated on the FFHQ and ImageNet datasets with various state-of-the-art generative models, consistently showing a promising detection rate. Code will be made publicly available at \url{this https URL}.
深度学习模型的最新发展导致能够合成高质量、现实感强的图像的方法的开发。这些模型对社会构成了威胁,因为它们的潜在滥用可能性。先前的研究试图通过检测生成图像来减轻这些威胁,但不同生成模型留下的差异痕迹使得创建一个能够普遍适用于新、未见面的生成模型的通用检测器变得困难。在本文中,我们提议将一种通用的对抗性签名注入任意训练好的生成模型中,以使其生成的内容更容易检测和追踪。首先,通过对抗训练,每个图像的可见最优签名可以通过签名注入器找到。随后,签名可以与由签名注入器处理的图像进行微调,并将其注入任意生成器中。这样,与签名对应的检测器就可以用于任何微调生成器的跟踪生成器身份。该提议方法在FFHQ和ImageNet等各种先进生成模型的多种数据集上进行了验证, consistently 显示有 promising 的检测率。代码将在\url{this https URL}上公开发布。
https://arxiv.org/abs/2305.16310
Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task.
Composed image retrieval的目标是找到与给定的多项式用户查询包含参考图像和文本一对的最优匹配图像。现有的方法通常会对整个语料库进行图像嵌入的预处理,并在测试时比较参考图像嵌入由查询文本修改后的结果。这种管道在测试时非常高效,因为可以快速计算向量距离来评估候选人,但仅通过简短的文本描述指导修改参考图像嵌入可能会很困难,特别是与潜在候选人独立的。另一种方法是允许查询和每个可能候选人之间的交互,即参考文本候选人三件套,并从中选择最好的。尽管这种方法更加歧视性,但对于大型数据集,计算成本过高,因为预计算候选人嵌入不再可行。我们提议使用两个阶段的模型将两种方案的优点结合起来。我们的第一阶段采用传统的向量距离度量,并快速修剪候选人之间的中间结果。与此同时,我们的第二阶段采用双编码架构,有效地关注输入的参考文本-候选人三件套并重新评估候选人。两个阶段都使用视觉和语言预训练网络,已经证明对于各种后续任务有益。我们的方法在任务的标准基准测试中 consistently 优于最先进的方法。
https://arxiv.org/abs/2305.16304
Controllable scene synthesis aims to create interactive environments for various industrial use cases. Scene graphs provide a highly suitable interface to facilitate these applications by abstracting the scene context in a compact manner. Existing methods, reliant on retrieval from extensive databases or pre-trained shape embeddings, often overlook scene-object and object-object relationships, leading to inconsistent results due to their limited generation capacity. To address this issue, we present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes, which are semantically realistic and conform to commonsense. Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes via latent diffusion, capturing global scene-object and local inter-object relationships while preserving shape diversity. The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model. Due to lacking a scene graph dataset offering high-quality object-level meshes with relations, we also construct SG-FRONT, enriching the off-the-shelf indoor dataset 3D-FRONT with additional scene graph labels. Extensive experiments are conducted on SG-FRONT where CommonScenes shows clear advantages over other methods regarding generation consistency, quality, and diversity. Codes and the dataset will be released upon acceptance.
可控场景合成的目标是为各种工业应用创建交互环境。场景图提供了高度合适的接口,以通过紧凑的方式抽象场景上下文,方便这些应用。现有的方法依赖于从广泛的数据库或预先训练的形状嵌入中检索,往往忽略场景对象和对象之间的关系,导致由于它们的生成能力有限而产生不一致的结果。为了解决这一问题,我们提出了CommonScenes,这是一个全生成模型,将场景图转换为相应的可控3D场景,语义上真实且符合常识。我们的管道由两个分支组成,一个通过Variational Auto-encoder 预测整个场景布局,另一个通过隐式扩散生成兼容的形状,捕捉全球场景对象和本地对象之间的关系,同时保持形状多样性。生成的场景可以通过编辑输入场景图和采样扩散模型中的噪声来操纵。由于缺少提供高质量对象级网格与关系的场景图数据集,我们还建立了SG-Front,将现有的室内数据集3D-Front中添加额外的场景图标签。在SG-Front上进行广泛的实验,CommonScenes 在生成一致性、质量和多样性方面明显优于其他方法。代码和数据集将在接受后发布。
https://arxiv.org/abs/2305.16283
In recent years, Denoising Diffusion Probabilistic Models (DDPM) have caught significant attention. By composing a Markovian process that starts in the data domain and then gradually adds noise until reaching pure white noise, they achieve superior performance in learning data distributions. Yet, these models require a large number of diffusion steps to produce aesthetically pleasing samples, which is inefficient. In addition, unlike common generative adversarial networks, the latent space of diffusion models is not interpretable. In this work, we propose to generalize the denoising diffusion process into an Upsampling Diffusion Probabilistic Model (UDPM), in which we reduce the latent variable dimension in addition to the traditional noise level addition. As a result, we are able to sample images of size $256\times 256$ with only 7 diffusion steps, which is less than two orders of magnitude compared to standard DDPMs. We formally develop the Markovian diffusion processes of the UDPM, and demonstrate its generation capabilities on the popular FFHQ, LSUN horses, ImageNet, and AFHQv2 datasets. Another favorable property of UDPM is that it is very easy to interpolate its latent space, which is not the case with standard diffusion models. Our code is available online \url{this https URL}
近年来,去噪扩散概率模型(DDPM)吸引了大量关注。通过构建始于数据域的马尔可夫过程,然后逐渐添加噪声,直到达到纯白色噪声的水平,这些模型在学习数据分布方面表现出更好的性能。然而,这些模型需要许多扩散步骤来产生审美上满意的样本,效率较低。此外,与常见的生成对抗网络不同,扩散模型的隐状态空间无法解释。在本文中,我们提议将去噪扩散过程泛化为增采样扩散概率模型(UDPM),其中我们除了传统的噪声水平增加外,还减少了隐变量维度。因此,我们只需要7个扩散步骤就能样本大小为256×256的图像,比标准DDPM的规模小得多。我们正式开发了UDPM的马尔可夫扩散过程,并在流行的FFHQ、LCNS horses、ImageNet和AFHQv2数据集上展示了其生成能力。UDPM的另一个有利特性是,它很容易进行隐状态空间的插值,而标准扩散模型则无法做到。我们的代码现在在线 \url{this https URL}。
https://arxiv.org/abs/2305.16269
We propose a new class of generative models that naturally handle data of varying dimensionality by jointly modeling the state and dimension of each datapoint. The generative process is formulated as a jump diffusion process that makes jumps between different dimensional spaces. We first define a dimension destroying forward noising process, before deriving the dimension creating time-reversed generative process along with a novel evidence lower bound training objective for learning to approximate it. Simulating our learned approximation to the time-reversed generative process then provides an effective way of sampling data of varying dimensionality by jointly generating state values and dimensions. We demonstrate our approach on molecular and video datasets of varying dimensionality, reporting better compatibility with test-time diffusion guidance imputation tasks and improved interpolation capabilities versus fixed dimensional models that generate state values and dimensions separately.
我们提出一种新的生成模型,该模型通过同时 Modeling 每个数据点的状态和维度,自然地处理不同维度的数据。生成过程可以表述为在不同维度空间中的跳跃扩散过程。我们首先定义一个破坏维度的向前噪声过程,然后推导出维度生成的逆生成过程,并提出了一种新的证据下的训练目标,以学习近似该逆生成过程。模拟我们学习到的近似逆生成过程,然后通过同时生成状态值和维度,有效地采样不同维度的数据。我们在不同维度的分子和视频数据集上演示了我们的这种方法,并报告了与测试时扩散指导插值任务更好的兼容性,以及与生成状态值和维度分别独立的固定维度模型相比,更好的插值能力。
https://arxiv.org/abs/2305.16261
This paper investigates the potential of enhancing Neural Radiance Fields (NeRF) with semantics to expand their applications. Although NeRF has been proven useful in real-world applications like VR and digital creation, the lack of semantics hinders interaction with objects in complex scenes. We propose to imitate the backbone feature of off-the-shelf perception models to achieve zero-shot semantic segmentation with NeRF. Our framework reformulates the segmentation process by directly rendering semantic features and only applying the decoder from perception models. This eliminates the need for expensive backbones and benefits 3D consistency. Furthermore, we can project the learned semantics onto extracted mesh surfaces for real-time interaction. With the state-of-the-art Segment Anything Model (SAM), our framework accelerates segmentation by 16 times with comparable mask quality. The experimental results demonstrate the efficacy and computational advantages of our approach. Project page: \url{https://me.kiui.moe/san/}.
本论文研究如何将神经网络辐射场(NeRF)与语义增强其应用范围。尽管NeRF在虚拟现实和数字创造等实际应用领域已经被证明有用,但缺乏语义会阻碍复杂场景下与物体的互动。我们提议仿效现有的感知模型的主干特性,以通过直接渲染语义特征来实现NeRF的零次元语义分割。我们的框架重写了分割过程,仅从感知模型中应用解码器,从而消除了昂贵的主干需求并实现了3D一致性。此外,我们可以将学到的语义投影到提取的网格表面,实现实时交互。利用最先进的分割任意模型(SAM),我们的框架将分割速度提高了16倍,与同等掩模质量相比。实验结果证明了我们方法的有效性和计算优势。项目页面: \url{https://me.kiui.moe/san/}。
https://arxiv.org/abs/2305.16233
Personalizing generative models offers a way to guide image generation with user-provided references. Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models. However, representing and editing specific visual attributes like material, style, layout, etc. remains a challenge, leading to a lack of disentanglement and editability. To address this, we propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low- to high-frequency information, providing a new perspective on representing, generating, and editing images. We develop Prompt Spectrum Space P*, an expanded textual conditioning space, and a new image representation method called ProSpect. ProSpect represents an image as a collection of inverted textual token embeddings encoded from per-stage prompts, where each prompt corresponds to a specific generation stage (i.e., a group of consecutive steps) of the diffusion model. Experimental results demonstrate that P* and ProSpect offer stronger disentanglement and controllability compared to existing methods. We apply ProSpect in various personalized attribute-aware image generation applications, such as image/text-guided material/style/layout transfer/editing, achieving previously unattainable results with a single image input without fine-tuning the diffusion models.
个性化生成模型提供了一个用户提供参考的方式来指导图像生成,提供了一种新的视角,可以代表、生成和编辑图像。目前,个性化方法可以将对象或概念翻转到文本 conditioning space 中,并为文本到图像扩散模型生成新的自然语句。然而,代表和编辑特定的视觉属性,如材料、风格、布局等仍然是一个挑战,导致缺乏分离性和编辑性。为了解决这个问题,我们提出了一种新方法,利用扩散模型的每一步生成过程,提供了新的代表、生成和编辑图像的视角。我们开发Prompt Spectrum Space P*、扩展了文本 conditioning space,并开发了一个新的图像表示方法,称为ProSpect。ProSpect 表示一个图像是一个从每个阶段引导的逆转文本 token embeddings 编码的集合,每个引导对应于扩散模型的特定生成阶段(即一组连续的步骤)。实验结果显示,P* 和 ProSpect 相比现有方法提供了更强的分离性和控制性。我们应用 ProSpect 在各种个性化属性aware图像生成应用程序中,如图像/文本引导的材料、风格、布局转移/编辑,通过单个图像输入实现了以前无法达到的结果,而不需要微调扩散模型。
https://arxiv.org/abs/2305.16225
Text-to-image (T2I) research has grown explosively in the past year, owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches. Yet, one pain point persists: the text prompt engineering, and searching high-quality text prompts for customized results is more art than science. Moreover, as commonly argued: "an image is worth a thousand words" - the attempt to describe a desired image with texts often ends up being ambiguous and cannot comprehensively cover delicate visual details, hence necessitating more additional controls from the visual domain. In this paper, we take a bold step forward: taking "Text" out of a pre-trained T2I diffusion model, to reduce the burdensome prompt engineering efforts for users. Our proposed framework, Prompt-Free Diffusion, relies on only visual inputs to generate new images: it takes a reference image as "context", an optional image structural conditioning, and an initial noise, with absolutely no text prompt. The core architecture behind the scene is Semantic Context Encoder (SeeCoder), substituting the commonly used CLIP-based or LLM-based text encoder. The reusability of SeeCoder also makes it a convenient drop-in component: one can also pre-train a SeeCoder in one T2I model and reuse it for another. Through extensive experiments, Prompt-Free Diffusion is experimentally found to (i) outperform prior exemplar-based image synthesis approaches; (ii) perform on par with state-of-the-art T2I models using prompts following the best practice; and (iii) be naturally extensible to other downstream applications such as anime figure generation and virtual try-on, with promising quality. Our code and models are open-sourced at this https URL.
图像生成文本(T2I)研究在过去一年里迅速发展,由于大规模训练扩散模型和许多新兴的个性化和编辑方法。然而,仍然存在一个痛点:文本 prompt engineering 和寻找高质量的自定义文本提示对于实现个性化的结果来说更像是艺术而不是科学。此外,正如通常所说的那样:“一个图像的价值在于它的千句话”——试图用文本描述想要的图像往往会导致模糊不清,不能全面覆盖微妙的视觉细节,因此需要更多的视觉域额外的控制。在本文中,我们将采取大胆一步:将“文本”从训练好的 T2I 扩散模型中移除,以减少用户的负担式的文本提示工程努力。我们提出的框架称为Prompt-Free Diffusion,它仅依赖于视觉输入生成新图像:它使用参考图像作为“上下文”,可选的图像结构 conditioning,以及最初的噪声,完全没有文本提示。在幕后的核心架构是语义上下文编码器(SeeCoder),替代了常用的CLIP 或 LLM 文本编码器。SeeCoder的可重用性也使其成为一个方便的升级组件:你也可以在其中一个 T2I 模型中预先训练 SeeCoder,并用它来另一个模型。通过广泛的实验,Prompt-Free Diffusion 实验表明(i)比先前基于示例的图像合成方法表现更好;(ii)与最先进的 T2I 模型使用最佳实践的提示运行水平相当;(iii)自然地可扩展到其他下游应用,如动画人物生成和虚拟试穿,具有出色的质量。我们的代码和模型在此 https URL 上开源。
https://arxiv.org/abs/2305.16223
Recent advancements in the acquisition of various brain data sources have created new opportunities for integrating multimodal brain data to assist in early detection of complex brain disorders. However, current data integration approaches typically need a complete set of biomedical data modalities, which may not always be feasible, as some modalities are only available in large-scale research cohorts and are prohibitive to collect in routine clinical practice. Especially in studies of brain diseases, research cohorts may include both neuroimaging data and genetic data, but for practical clinical diagnosis, we often need to make disease predictions only based on neuroimages. As a result, it is desired to design machine learning models which can use all available data (different data could provide complementary information) during training but conduct inference using only the most common data modality. We propose a new incomplete multimodal data integration approach that employs transformers and generative adversarial networks to effectively exploit auxiliary modalities available during training in order to improve the performance of a unimodal model at inference. We apply our new method to predict cognitive degeneration and disease outcomes using the multimodal imaging genetic data from Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. Experimental results demonstrate that our approach outperforms the related machine learning and deep learning methods by a significant margin.
近年来,获取各种脑数据源的进步,为整合多模态脑数据,帮助早期识别复杂的脑障碍创造了新的机会。然而,当前的数据整合方法通常需要完整的生物医学数据模态,这可能不一定可行,因为一些模态只有在大规模的研究群体才能提供,并且在常规临床实践中收集是禁止的。特别是对于脑疾病的研究,研究群体可能包括神经影像学数据和基因数据,但在实际临床诊断中,我们通常需要仅基于神经影像进行疾病预测。因此,我们希望设计一种机器学习模型,可以在训练期间使用所有可用的数据(不同的数据可以提供补充信息),但仅使用最常见的数据模态进行推理。我们提出了一种新的不完整的多模态数据整合方法,利用Transformer和生成对抗网络有效地利用训练期间提供的辅助模态,以提高单一模态模型的推理性能。我们应用我们的新方法来预测阿尔茨海默病神经影像学倡议(ADNI)研究群体中的多模态影像基因数据中的脑功能退化和疾病结果。实验结果显示,我们的方法相比相关的机器学习和深度学习方法表现出显著的优势。
https://arxiv.org/abs/2305.16222
Segment anything model (SAM) has presented impressive objectness identification capability with the idea of prompt learning and a new collected large-scale dataset. Given a prompt (e.g., points, bounding boxes, or masks) and an input image, SAM is able to generate valid segment masks for all objects indicated by the prompts, presenting high generalization across diverse scenarios and being a general method for zero-shot transfer to downstream vision tasks. Nevertheless, it remains unclear whether SAM may introduce errors in certain threatening scenarios. Clarifying this is of significant importance for applications that require robustness, such as autonomous vehicles. In this paper, we aim to study the testing-time robustness of SAM under adversarial scenarios and common corruptions. To this end, we first build a testing-time robustness evaluation benchmark for SAM by integrating existing public datasets. Second, we extend representative adversarial attacks against SAM and study the influence of different prompts on robustness. Third, we study the robustness of SAM under diverse corruption types by evaluating SAM on corrupted datasets with different prompts. With experiments conducted on SA-1B and KITTI datasets, we find that SAM exhibits remarkable robustness against various corruptions, except for blur-related corruption. Furthermore, SAM remains susceptible to adversarial attacks, particularly when subjected to PGD and BIM attacks. We think such a comprehensive study could highlight the importance of the robustness issues of SAM and trigger a series of new tasks for SAM as well as downstream vision tasks.
Segment anything模型(Sam)以prompt learning和收集大型数据集的新想法,展示了令人印象深刻的对象识别能力。给定prompt(例如点、边界框或掩膜)和输入图像,Sam能够生成所有由prompt指示的对象的有效分块Mask,在各种情况下表现出高泛化能力,是直接转移到后续视觉任务通用的方法。然而,仍然不清楚Sam在某些威胁情况下可能会引入错误。澄清这一点对于需要鲁棒性的应用程序,例如自动驾驶车辆等具有重要意义。在本文中,我们旨在研究Sam在对抗场景和常见腐败情况下的测试时鲁棒性。为此,我们首先建立了Sam的测试时鲁棒性评估基准,通过整合现有公共数据集。其次,我们扩展了代表性的对抗攻击对Sam进行研究,并探讨不同prompt对鲁棒性的影响了。通过在SAB和KITTI数据集上进行实验,我们发现Sam表现出对多种腐败的 remarkable 鲁棒性,除了与模糊相关的腐败。此外,Sam仍然容易受到对抗攻击,特别是在受到PGD和BIM攻击的情况下。我们认为这种全面研究可以强调Sam的鲁棒性问题的重要性,并触发一系列新的任务,为Sam以及后续视觉任务。
https://arxiv.org/abs/2305.16220
Semi-supervised medical image segmentation offers a promising solution for large-scale medical image analysis by significantly reducing the annotation burden while achieving comparable performance. Employing this method exhibits a high degree of potential for optimizing the segmentation process and increasing its feasibility in clinical settings during translational investigations. Recently, cross-supervised training based on different co-training sub-networks has become a standard paradigm for this task. Still, the critical issues of sub-network disagreement and label-noise suppression require further attention and progress in cross-supervised training. This paper proposes a cross-supervised learning framework based on dual classifiers (DC-Net), including an evidential classifier and a vanilla classifier. The two classifiers exhibit complementary characteristics, enabling them to handle disagreement effectively and generate more robust and accurate pseudo-labels for unlabeled data. We also incorporate the uncertainty estimation from the evidential classifier into cross-supervised training to alleviate the negative effect of the error supervision signal. The extensive experiments on LA and Pancreas-CT dataset illustrate that DC-Net outperforms other state-of-the-art methods for semi-supervised segmentation. The code will be released soon.
半监督医学图像分割提供了一个有前途的解决方案,通过显著减少标注负担而实现类似的性能。使用这种方法可以展示高度的潜力,以优化分割过程并增加在临床实验期间 Translational 研究期间的实践可行性。最近,基于不同的协同训练子网络的交叉监督训练已经成为该任务的标准范式。然而,子网络不同意和标签噪声抑制等关键问题需要进一步的关注和进展的交叉监督训练。本文提出了基于双重分类器(DC-Net)的交叉监督学习框架,包括证据分类器和无分类分类器。两个分类器具有互补的特征,使他们能够有效地处理不同意并生成未标记数据更为稳健和准确的伪标签。我们还将证据分类器的不确定估计引入交叉监督训练,以减轻错误监督信号的负面影响。在LA和肝脏CT数据集上的广泛实验表明,DC-Net在半监督分割方面优于其他先进的方法。代码将很快发布。
https://arxiv.org/abs/2305.16216
Consistency learning plays a crucial role in semi-supervised medical image segmentation as it enables the effective utilization of limited annotated data while leveraging the abundance of unannotated data. The effectiveness and efficiency of consistency learning are challenged by prediction diversity and training stability, which are often overlooked by existing studies. Meanwhile, the limited quantity of labeled data for training often proves inadequate for formulating intra-class compactness and inter-class discrepancy of pseudo labels. To address these issues, we propose a self-aware and cross-sample prototypical learning method (SCP-Net) to enhance the diversity of prediction in consistency learning by utilizing a broader range of semantic information derived from multiple inputs. Furthermore, we introduce a self-aware consistency learning method that exploits unlabeled data to improve the compactness of pseudo labels within each class. Moreover, a dual loss re-weighting method is integrated into the cross-sample prototypical consistency learning method to improve the reliability and stability of our model. Extensive experiments on ACDC dataset and PROMISE12 dataset validate that SCP-Net outperforms other state-of-the-art semi-supervised segmentation methods and achieves significant performance gains compared to the limited supervised training. Our code will come soon.
一致性学习在半监督医学图像分割中发挥着关键作用,因为它能够充分利用有限的标注数据,同时利用未标注数据的丰富性。一致性学习的性能和效率受到预测多样性和训练稳定性的挑战,往往被现有研究忽视。与此同时,训练数据的标注数量往往不足以满足形成伪标签班级内部凝聚力和班级间差异的充分表达。为了解决这些问题,我们提出了一种具有自我意识的交叉样本典型学习方法(SCP-Net),通过利用多个输入来源的更广泛的语义信息,增强一致性学习的预测多样性。我们还引入了一种具有自我意识的一致性学习方法,利用未标注数据改善每个班级伪标签班级内部凝聚力。此外,我们还将双重损失重新加权方法集成到交叉样本典型一致性学习方法中,以提高我们的模型的可靠性和稳定性。在ACDC数据和PROMISE12数据集上进行广泛的实验验证,SCP-Net相比于有限的监督训练,在性能上表现更好。我们的代码即将发布。
https://arxiv.org/abs/2305.16214
Score distillation sampling (SDS) has shown great promise in text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models, but suffers from over-saturation, over-smoothing, and low-diversity problems. In this work, we propose to model the 3D parameter as a random variable instead of a constant as in SDS and present variational score distillation (VSD), a principled particle-based variational framework to explain and address the aforementioned issues in text-to-3D generation. We show that SDS is a special case of VSD and leads to poor samples with both small and large CFG weights. In comparison, VSD works well with various CFG weights as ancestral sampling from diffusion models and simultaneously improves the diversity and sample quality with a common CFG weight (i.e., $7.5$). We further present various improvements in the design space for text-to-3D such as distillation time schedule and density initialization, which are orthogonal to the distillation algorithm yet not well explored. Our overall approach, dubbed ProlificDreamer, can generate high rendering resolution (i.e., $512\times512$) and high-fidelity NeRF with rich structure and complex effects (e.g., smoke and drops). Further, initialized from NeRF, meshes fine-tuned by VSD are meticulously detailed and photo-realistic. Project page: this https URL
评分蒸馏采样(SDS)在从训练好的大规模文本到图像迁移模型中提取文本到三维生成方面表现出巨大的潜力,但仍然存在过载、过度平滑和多样性不足的问题。在本文中,我们提出将三维参数建模为随机变量,而不是像SDS那样以常量的形式表示,并提出了 variational score distillation(VSD),这是一个基于粒子的变量替换框架,以解释和解决上述问题的文本到三维生成。我们表明,SDS是VSD的一个特殊情况,会导致较差的样本,不论大小都是cfg权重。相比之下,VSD从迁移模型中继承了各种cfg权重,同时通过一个共同的cfg权重(即7.5)来提高多样性和样本质量。我们还提出了在文本到三维生成的设计空间中的多种改进,例如蒸馏时间计划和密度初始化,这些与蒸馏算法相悖,但尚未充分探索。我们的总体方法被称为“勤奋的梦想家”,可以生成高渲染分辨率(即512x512)和高保真度的人偶场卷积图形,具有丰富的结构和复杂的效果(例如烟雾和滴落)。此外,从人偶场卷积中初始化,通过VSD优化的网格精度详细而现实。项目页面: this https URL
https://arxiv.org/abs/2305.16213