Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats. This comprehensive analysis provides valuable insights into optimizing captioning strategies, thereby advancing the pre-training of multimodal foundation models.
近年来在多模态模型的进步突出了重构式字幕的价值,然而关键挑战仍然存在。例如,虽然合成字幕通常提供更好的质量和图像文本对齐,但并不清楚它们是否可以完全取代AltTexts:合成字幕及其与原始爬取的AltText之间的相互作用仍然不太清楚。此外,不同的多模态基础模型可能对特定的字幕格式有独特的偏好,但努力确定每个模型的最佳重构式仍然有限。在这项工作中,我们提出了一个新颖、可控制和可扩展的 captioning 管道,旨在为各种多模态模型生成定制化的字幕格式。通过将 Short Synthetic Captions(SSC)与Dense Synthetic Captions(DSC+)作为案例研究,我们系统地探讨了它们对不同模型(如CLIP、多模态LLM和扩散模型)与AltText之间的影响。我们的研究结果表明,将人造字幕和原始文本相结合的半监督方法可以优于仅使用人造字幕,提高两者的对齐度和性能,每个模型都表现出对特定字幕格式的偏好。这种全面的分析为优化字幕策略提供了宝贵的洞见,从而推动了多模态基础模型的预训练。
https://arxiv.org/abs/2410.02740
Text-to-image (T2I) diffusion models have drawn attention for their ability to generate high-quality images with precise text alignment. However, these models can also be misused to produce inappropriate content. Existing safety measures, which typically rely on text classifiers or ControlNet-like approaches, are often insufficient. Traditional text classifiers rely on large-scale labeled datasets and can be easily bypassed by rephrasing. As diffusion models continue to scale, fine-tuning these safeguards becomes increasingly challenging and lacks flexibility. Recent red-teaming attack researches further underscore the need for a new paradigm to prevent the generation of inappropriate content. In this paper, we introduce SteerDiff, a lightweight adaptor module designed to act as an intermediary between user input and the diffusion model, ensuring that generated images adhere to ethical and safety standards with little to no impact on usability. SteerDiff identifies and manipulates inappropriate concepts within the text embedding space to guide the model away from harmful outputs. We conduct extensive experiments across various concept unlearning tasks to evaluate the effectiveness of our approach. Furthermore, we benchmark SteerDiff against multiple red-teaming strategies to assess its robustness. Finally, we explore the potential of SteerDiff for concept forgetting tasks, demonstrating its versatility in text-conditioned image generation.
文本到图像(T2I)扩散模型因能够在精确的文本对齐下生成高质量图像而引起了关注。然而,这些模型也可能被用于制作不适当的内容。现有的安全措施,通常依赖于文本分类器或类似ControlNet的方法,往往是不够的。传统的文本分类器依赖于大规模标记数据集,并且很容易通过重新表述绕过。随着扩散模型不断扩展,对这些安全措施进行微调变得越来越具有挑战性,而且缺乏灵活性。最近的一些红队攻击研究进一步强调了需要一种新的范式来防止生成不适当内容的重要性。在本文中,我们引入了SteerDiff,一个轻量级的适配器模块,旨在充当用户输入和扩散模型之间的中介,确保生成的图像符合道德和安全标准,对可用性影响较小。SteerDiff在文本嵌入空间中识别和操作不适当的概念,引导模型远离有害的输出。我们对各种概念消除任务进行了广泛的实验,以评估我们方法的有效性。此外,我们还对SteerDiff与多种红队策略进行了比较,以评估其稳健性。最后,我们探讨了SteerDiff在概念忘记任务中的潜力,展示了其在文本条件下图像生成的多样性和灵活性。
https://arxiv.org/abs/2410.02710
Autoregressive (AR) models have reformulated image generation as next-token prediction, demonstrating remarkable potential and emerging as strong competitors to diffusion models. However, control-to-image generation, akin to ControlNet, remains largely unexplored within AR models. Although a natural approach, inspired by advancements in Large Language Models, is to tokenize control images into tokens and prefill them into the autoregressive model before decoding image tokens, it still falls short in generation quality compared to ControlNet and suffers from inefficiency. To this end, we introduce ControlAR, an efficient and effective framework for integrating spatial controls into autoregressive image generation models. Firstly, we explore control encoding for AR models and propose a lightweight control encoder to transform spatial inputs (e.g., canny edges or depth maps) into control tokens. Then ControlAR exploits the conditional decoding method to generate the next image token conditioned on the per-token fusion between control and image tokens, similar to positional encodings. Compared to prefilling tokens, using conditional decoding significantly strengthens the control capability of AR models but also maintains the model's efficiency. Furthermore, the proposed ControlAR surprisingly empowers AR models with arbitrary-resolution image generation via conditional decoding and specific controls. Extensive experiments can demonstrate the controllability of the proposed ControlAR for the autoregressive control-to-image generation across diverse inputs, including edges, depths, and segmentation masks. Furthermore, both quantitative and qualitative results indicate that ControlAR surpasses previous state-of-the-art controllable diffusion models, e.g., ControlNet++. Code, models, and demo will soon be available at this https URL.
自回归(AR)模型将图像生成重新定义为下一个词的预测,展示了惊人的潜力和与扩散模型的崛起成为强大的竞争对手。然而,在AR模型中,控制到图像生成类似于ControlNet的控制仍然很大程度上没有被探索。虽然自然的方法,受到大型语言模型发展的启发,是将控制图像元词化,并将它们填入自回归模型在进行图像token解码之前,但它仍然在生成质量上比ControlNet和低效。因此,我们引入了ControlAR,一个用于将空间控制集成到自回归图像生成模型的有效而有效的框架。首先,我们探讨了AR模型的控制编码,并提出了一个轻量级的控制编码器,将空间输入(例如,锐利的边缘或深度图)转换为控制token。然后,ControlAR利用条件解码方法,根据控制和图像token之间的自适应融合生成下一个图像token,类似于位置编码。与预填充的token相比,使用条件解码显著增强了AR模型的控制能力,同时保持了模型的效率。此外,与自适应扩散模型(例如,ControlNet++)相比,所提出的ControlAR通过条件解码在各种输入上实现了任意分辨率图像生成。丰富的实验结果可以证明,所提出的ControlAR在各种输入上实现了自适应的控制,包括边缘、深度和分割掩码。此外,定量和定性结果表明,与最先进的可控制扩散模型(例如,ControlNet++)相比,ControlAR具有超越性的能力。代码、模型和演示文稿很快将在这个https://此URL上提供。
https://arxiv.org/abs/2410.02705
While recent research increasingly showcases the remarkable capabilities of Large Language Models (LLMs), it's vital to confront their hidden pitfalls. Among these challenges, the issue of memorization stands out, posing significant ethical and legal risks. In this paper, we presents a Systematization of Knowledge (SoK) on the topic of memorization in LLMs. Memorization is the effect that a model tends to store and reproduce phrases or passages from the training data and has been shown to be the fundamental issue to various privacy and security attacks against LLMs. We begin by providing an overview of the literature on the memorization, exploring it across five key dimensions: intentionality, degree, retrievability, abstraction, and transparency. Next, we discuss the metrics and methods used to measure memorization, followed by an analysis of the factors that contribute to memorization phenomenon. We then examine how memorization manifests itself in specific model architectures and explore strategies for mitigating these effects. We conclude our overview by identifying potential research topics for the near future: to develop methods for balancing performance and privacy in LLMs, and the analysis of memorization in specific contexts, including conversational agents, retrieval-augmented generation, multilingual language models, and diffusion language models.
虽然最近的研究越来越展示了大型语言模型(LLMs)的非凡能力,但面对其隐藏的陷阱至关重要。在这些挑战中,记忆问题突出,带来了重大的伦理和法律风险。在本文中,我们关于记忆在LLMs上的系统化知识(SoK)。记忆是模型倾向于存储和复制训练数据中的短语或段落的效应,已经被证明是各种对LLMs进行隐私和安全攻击的根本问题。我们首先对相关文献进行了回顾,探讨了记忆在五个关键维度上的影响:故意性、程度、可检索性、抽象性和透明度。接下来,我们讨论了用于衡量记忆的指标和方法,并分析了导致记忆现象的因素。然后我们研究了记忆在具体模型架构中的表现,并探讨了减轻这些影响的方法。最后,我们在概述中指出了未来可能的研究方向:为LLMs开发平衡性能和隐私的方法,以及分析特定情境(包括对话机器人、检索增强生成、多语言语言模型和扩散语言模型)下的记忆现象。
https://arxiv.org/abs/2410.02650
Diffusion-based extreme image compression methods have achieved impressive performance at extremely low bitrates. However, constrained by the iterative denoising process that starts from pure noise, these methods are limited in both fidelity and efficiency. To address these two issues, we present Relay Residual Diffusion Extreme Image Compression (RDEIC), which leverages compressed feature initialization and residual diffusion. Specifically, we first use the compressed latent features of the image with added noise, instead of pure noise, as the starting point to eliminate the unnecessary initial stages of the denoising process. Second, we design a novel relay residual diffusion that reconstructs the raw image by iteratively removing the added noise and the residual between the compressed and target latent features. Notably, our relay residual diffusion network seamlessly integrates pre-trained stable diffusion to leverage its robust generative capability for high-quality reconstruction. Third, we propose a fixed-step fine-tuning strategy to eliminate the discrepancy between the training and inference phases, further improving the reconstruction quality. Extensive experiments demonstrate that the proposed RDEIC achieves state-of-the-art visual quality and outperforms existing diffusion-based extreme image compression methods in both fidelity and efficiency. The source code will be provided in this https URL.
基于扩散的极端图像压缩方法在极低比特率下取得了令人印象深刻的性能。然而,由于从纯噪声开始的迭代去噪过程限制了它们的保真度和效率,这些方法在信噪比方面存在局限性。为了解决这两个问题,我们提出了Relay Residual Diffusion Extreme Image Compression(RDEIC),它利用了压缩特征初始化和残差扩散。具体来说,我们首先使用添加噪音的图像的压缩 latent 特征作为去噪过程的起点,而不是纯噪声。其次,我们设计了一个新的 relay 残差扩散,它通过迭代删除添加的噪音和压缩与目标 latent 特征之间的残差来重构原始图像。值得注意的是,我们的 relay 残差扩散网络无缝地整合了预训练的稳定扩散,以利用其对高质量重建的稳健生成能力。第三,我们提出了一个固定步数微调策略,以消除训练和推理阶段之间的差异,进一步提高了重建质量。大量实验证明,所提出的 RDEIC 达到了最先进的视觉质量,并且在信噪比方面优于现有的扩散型极端图像压缩方法。源代码将在此处的链接提供。
https://arxiv.org/abs/2410.02640
Deformable image registration is crucial for aligning medical images in a non-linear fashion across different modalities, allowing for precise spatial correspondence between varying anatomical structures. This paper presents NestedMorph, a novel network utilizing a Nested Attention Fusion approach to improve intra-subject deformable registration between T1-weighted (T1w) MRI and diffusion MRI (dMRI) data. NestedMorph integrates high-resolution spatial details from an encoder with semantic information from a decoder using a multi-scale framework, enhancing both local and global feature extraction. Our model notably outperforms existing methods, including CNN-based approaches like VoxelMorph, MIDIR, and CycleMorph, as well as Transformer-based models such as TransMorph and ViT-V-Net, and traditional techniques like NiftyReg and SyN. Evaluations on the HCP dataset demonstrate that NestedMorph achieves superior performance across key metrics, including SSIM, HD95, and SDlogJ, with the highest SSIM of 0.89, and the lowest HD95 of 2.5 and SDlogJ of 0.22. These results highlight NestedMorph's ability to capture both local and global image features effectively, leading to superior registration performance. The promising outcomes of this study underscore NestedMorph's potential to significantly advance deformable medical image registration, providing a robust framework for future research and clinical applications. The source code and our implementation are available at: this https URL
非线性图像配准对于在不同模态之间对医疗图像进行对齐至关重要,实现不同解剖结构的准确空间对应。本文介绍了一种新网络NestedMorph,利用Nested Attention Fusion方法在T1加权(T1w)MRI和扩散加权MRI(dMRI)数据之间进行自适应对齐。NestedMorph通过多尺度框架将编码器中高分辨率的空间信息与解码器中语义信息相结合,增强局部和全局特征提取。我们的模型在包括CNN基于方法(如VoxelMorph、MIDIR和CycleMorph)以及Transformer基于方法(如TransMorph和ViT-V-Net)以及传统方法(如NiftyReg和SyN)的基础上,显著优于现有方法。在HCP数据集上的评估表明,NestedMorph在关键指标,包括SSIM、HD95和SDlogJ上取得了卓越的性能,SSIM为0.89,HD95为2.5,SDlogJ为0.22。这些结果强调了NestedMorph有效捕捉局部和全局图像特征的能力,从而实现卓越的配准性能。本研究的结果表明,NestedMorph有很大的潜力显著改进非线性图像配准,为未来的研究和临床应用提供了一个稳健的框架。源代码和我们的实现可以在以下这个链接中获得:https://this URL
https://arxiv.org/abs/2410.02550
Data augmentation, a cornerstone technique in deep learning, is crucial in enhancing model performance, especially with scarce labeled data. While traditional techniques are effective, their reliance on hand-crafted methods limits their applicability across diverse data types and tasks. Although modern learnable augmentation methods offer increased adaptability, they are computationally expensive and challenging to incorporate within prevalent augmentation workflows. In this work, we present a novel, efficient method for data augmentation, effectively bridging the gap between existing augmentation strategies and emerging datasets and learning tasks. We introduce SAFLEX (Self-Adaptive Augmentation via Feature Label EXtrapolation), which learns the sample weights and soft labels of augmented samples provided by any given upstream augmentation pipeline, using a specifically designed efficient bilevel optimization algorithm. Remarkably, SAFLEX effectively reduces the noise and label errors of the upstream augmentation pipeline with a marginal computational cost. As a versatile module, SAFLEX excels across diverse datasets, including natural and medical images and tabular data, showcasing its prowess in few-shot learning and out-of-distribution generalization. SAFLEX seamlessly integrates with common augmentation strategies like RandAug, CutMix, and those from large pre-trained generative models like stable diffusion and is also compatible with frameworks such as CLIP's fine-tuning. Our findings highlight the potential to adapt existing augmentation pipelines for new data types and tasks, signaling a move towards more adaptable and resilient training frameworks.
数据增强是深度学习中的一个关键技术,尤其是在稀疏标注数据的情况下,它有助于提高模型性能。虽然传统的技术很有效,但它们依赖于手工定制的方法,因此其对不同数据类型和任务的适用性有限。尽管现代可学习增强方法提供了更大的适应性,但它们在计算上昂贵,且难以融入现有的增强工作流程。在这项工作中,我们提出了一个新颖、高效的数据增强方法,有效地将现有的增强策略与新兴数据和任务之间的差距弥合。我们引入了SAFLEX(自适应增强通过特征标签扩展),它使用一个专门设计的有效双层优化算法来学习任何给定上游增强管道的增强样本的样本权重和软标签。值得注意的是,SAFLEX有效地降低了上游增强管道的噪声和标签错误,且计算成本微不足道。作为一款多功能的模块,SAFLEX在各种数据集上表现出色,包括自然和医学图像以及表格数据,展示了其在少量样本学习和离散分布泛化方面的卓越能力。SAFLEX轻松地与常见的增强策略(如RandAug、CutMix)以及像Stable Diffusion这样的大规模预训练生成模型兼容,也兼容CLIP的微调框架。我们的研究结果强调了为新的数据类型和任务调整现有增强管道的重要性,表明朝着更加适应性和弹性的训练框架的方向发展。
https://arxiv.org/abs/2410.02512
Customized Image Generation, generating customized images with user-specified concepts, has raised significant attention due to its creativity and novelty. With impressive progress achieved in subject customization, some pioneer works further explored the customization of action and interaction beyond entity (i.e., human, animal, and object) appearance. However, these approaches only focus on basic actions and interactions between two entities, and their effects are limited by insufficient ''exactly same'' reference images. To extend customized image generation to more complex scenes for general real-world applications, we propose a new task: event-customized image generation. Given a single reference image, we define the ''event'' as all specific actions, poses, relations, or interactions between different entities in the scene. This task aims at accurately capturing the complex event and generating customized images with various target entities. To solve this task, we proposed a novel training-free event customization method: FreeEvent. Specifically, FreeEvent introduces two extra paths alongside the general diffusion denoising process: 1) Entity switching path: it applies cross-attention guidance and regulation for target entity generation. 2) Event transferring path: it injects the spatial feature and self-attention maps from the reference image to the target image for event generation. To further facilitate this new task, we collected two evaluation benchmarks: SWiG-Event and Real-Event. Extensive experiments and ablations have demonstrated the effectiveness of FreeEvent.
定制图像生成,使用用户指定概念生成定制图像,因其创造力和新颖性而引起了广泛关注。在主题自定义方面取得了令人印象深刻的进展后,一些先驱作品进一步探索了定制动作和交互的范围,超出了实体(即人类、动物和物体)外观。然而,这些方法仅关注基本动作和两个实体之间的交互,其效果受到不足的“完全相同”参考图像的限制。为了将定制图像生成扩展到更复杂的场景,我们提出了一个新的任务:事件定制图像生成。给定一个单一的参考图像,我们定义“事件”为场景中不同实体之间具体动作、姿势、关系或交互。这个任务旨在准确捕捉复杂的事件,并生成具有各种目标实体的定制图像。为了解决这个问题,我们提出了一个新颖的训练-free 事件定制方法:FreeEvent。具体来说,FreeEvent 在普通扩散去噪过程中引入了两条附加路径: 1. 实体切换路径:它应用跨注意引导和调节来生成目标实体的交叉注意力。 2. 事件传递路径:它从参考图像中注入空间特征和自注意图,用于事件生成。 为了进一步促进这项新任务,我们还收集了两个评估基准:SWiG-Event 和 Real-Event。大量的实验和分析证明了FreeEvent的有效性。
https://arxiv.org/abs/2410.02483
As diffusion probabilistic models (DPMs) are being employed as mainstream models for Generative Artificial Intelligence (GenAI), the study of their memorization of training data has attracted growing attention. Existing works in this direction aim to establish an understanding of whether or to what extent DPMs learn via memorization. Such an understanding is crucial for identifying potential risks of data leakage and copyright infringement in diffusion models and, more importantly, for trustworthy application of GenAI. Existing works revealed that conditional DPMs are more prone to training data memorization than unconditional DPMs, and the motivated data extraction methods are mostly for conditional DPMs. However, these understandings are primarily empirical, and extracting training data from unconditional models has been found to be extremely challenging. In this work, we provide a theoretical understanding of memorization in both conditional and unconditional DPMs under the assumption of model convergence. Our theoretical analysis indicates that extracting data from unconditional models can also be effective by constructing a proper surrogate condition. Based on this result, we propose a novel data extraction method named \textbf{Surrogate condItional Data Extraction (SIDE)} that leverages a time-dependent classifier trained on the generated data as a surrogate condition to extract training data from unconditional DPMs. Empirical results demonstrate that our SIDE can extract training data in challenging scenarios where previous methods fail, and it is, on average, over 50\% more effective across different scales of the CelebA dataset.
随着扩散概率模型(DPMs)作为主流模型应用于生成人工智能(GenAI),研究其对训练数据的记忆已经引起了越来越多的关注。在这方面,现有工作旨在建立通过记忆学习DPMs的认知,这一理解对于确定扩散模型中数据泄露和版权侵犯的风险以及更重要的是可靠地应用GenAI至关重要。现有研究表明,条件DPM比无条件DPM更容易通过记忆学习训练数据,并且大多数动机数据提取方法都是为条件DPM设计的。然而,这些理解主要是基于经验性的,从无条件模型中提取训练数据被证明是非常具有挑战性的。在本文中,我们提供了一种在模型收敛的条件下对扩散和无条件DPMs进行记忆的理解。我们的理论分析表明,通过构建一个适当的代理条件,从无条件模型中提取数据也可以有效。基于这一结果,我们提出了名为\textbf{Surrogate condItional Data Extraction (SIDE)}的新数据提取方法,该方法利用训练在生成数据上的时间依赖分类器作为代理条件,从无条件DPM中提取训练数据。实证结果表明,在具有挑战性的场景中,我们的SIDE可以提取训练数据,而以前的方法在此情况下均失败,并且具有平均超过50%的效率,覆盖了CelebA数据集中的不同规模。
https://arxiv.org/abs/2410.02467
Classifier-free guidance (CFG) is crucial for improving both generation quality and alignment between the input condition and final output in diffusion models. While a high guidance scale is generally required to enhance these aspects, it also causes oversaturation and unrealistic artifacts. In this paper, we revisit the CFG update rule and introduce modifications to address this issue. We first decompose the update term in CFG into parallel and orthogonal components with respect to the conditional model prediction and observe that the parallel component primarily causes oversaturation, while the orthogonal component enhances image quality. Accordingly, we propose down-weighting the parallel component to achieve high-quality generations without oversaturation. Additionally, we draw a connection between CFG and gradient ascent and introduce a new rescaling and momentum method for the CFG update rule based on this insight. Our approach, termed adaptive projected guidance (APG), retains the quality-boosting advantages of CFG while enabling the use of higher guidance scales without oversaturation. APG is easy to implement and introduces practically no additional computational overhead to the sampling process. Through extensive experiments, we demonstrate that APG is compatible with various conditional diffusion models and samplers, leading to improved FID, recall, and saturation scores while maintaining precision comparable to CFG, making our method a superior plug-and-play alternative to standard classifier-free guidance.
分类器无指导(CFG)对于提高扩散模型的生成质量和输入条件与最终输出之间的对齐至关重要。虽然通常需要高指导级别来增强这些方面,但它也会导致过饱和和虚幻的伪影。在本文中,我们重新审视了CFG的更新规则,并引入了修改来解决这个问题。我们首先将CFG更新项分解为与条件模型预测并行的和正交的组件,观察到并行组件主要导致过饱和,而正交组件增强图像质量。因此,我们提出减轻并行组件以实现无过饱和的高质量生成的方法。此外,我们还基于这个见解将CFG与梯度上升建立联系,并引入了一种新的根据梯度上升更新规则的缩放和动量方法。我们的方法被称为自适应投影引导(APG),保留了CFG的质量提升优势,同时允许使用更高的指导级别,而不会导致过饱和。APG易于实现,对抽样过程没有任何额外的计算开销。通过广泛的实验,我们证明了APG与各种条件扩散模型和抽样器兼容,从而在保持精度与CFG相当的情况下改善了FID、召回和饱和分数。这使得我们的方法成为标准分类器无引导的插件和备选方法。
https://arxiv.org/abs/2410.02416
Safe and successful deployment of robots requires not only the ability to generate complex plans but also the capacity to frequently replan and correct execution errors. This paper addresses the challenge of long-horizon trajectory planning under temporally extended objectives in a receding horizon manner. To this end, we propose DOPPLER, a data-driven hierarchical framework that generates and updates plans based on instruction specified by linear temporal logic (LTL). Our method decomposes temporal tasks into chain of options with hierarchical reinforcement learning from offline non-expert datasets. It leverages diffusion models to generate options with low-level actions. We devise a determinantal-guided posterior sampling technique during batch generation, which improves the speed and diversity of diffusion generated options, leading to more efficient querying. Experiments on robot navigation and manipulation tasks demonstrate that DOPPLER can generate sequences of trajectories that progressively satisfy the specified formulae for obstacle avoidance and sequential visitation. Demonstration videos are available online at: this https URL.
安全和成功的机器人部署需要不仅具备生成复杂计划的能力,还需要频繁重新规划和纠正执行错误的能力。本文解决了在撤退视野方法下,长视野轨迹规划中的挑战。为此,我们提出了DOPPLER,一种基于指令由线性时间逻辑(LTL)指定的数据驱动分层框架。我们的方法将时间任务分解为具有分层强化学习从离线非专家数据集生成的选项链。它利用扩散模型生成具有低级动作的选项。我们在批生成过程中设计了一种确定性指导的后验采样技术,从而改善了扩散生成的选项的速度和多样性,导致更有效的查询。在机器人导航和操作任务上的实验证明,DOPPLER可以生成满足避障和顺序访问指定形式的轨迹序列。演示视频可在该网址上观看:https://this URL。
https://arxiv.org/abs/2410.02389
The Diffusion Model has not only garnered noteworthy achievements in the realm of image generation but has also demonstrated its potential as an effective pretraining method utilizing unlabeled data. Drawing from the extensive potential unveiled by the Diffusion Model in both semantic correspondence and open vocabulary segmentation, our work initiates an investigation into employing the Latent Diffusion Model for Few-shot Semantic Segmentation. Recently, inspired by the in-context learning ability of large language models, Few-shot Semantic Segmentation has evolved into In-context Segmentation tasks, morphing into a crucial element in assessing generalist segmentation models. In this context, we concentrate on Few-shot Semantic Segmentation, establishing a solid foundation for the future development of a Diffusion-based generalist model for segmentation. Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework. Subsequently, we delve deeper into optimizing the infusion of information from the support mask and simultaneously re-evaluating how to provide reasonable supervision from the query mask. Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework and effectively utilizing the pre-training prior. Experimental results demonstrate that our method significantly outperforms the previous SOTA models in multiple settings.
扩散模型不仅在图像生成领域取得了显著的成就,还在利用未标注数据作为有效预训练方法方面展现了其潜力。从扩散模型在语义匹配和开放词汇分割领域展示的广泛潜力中,我们的工作开始了研究,探讨使用潜在扩散模型进行少样本语义分割。近年来,受到大型语言模型在上下文理解能力的影响,少样本语义分割已经演变成评估通用分割模型的关键要素。在这种情况下,我们专注于少样本语义分割,为基于扩散模型的通用分割模型的发展奠定了坚实的基础。我们的初始关注点在于理解如何促进查询图像和支持图像之间的交互,从而在自注意力框架内提出KV融合方法。随后,我们深入研究了如何优化支持掩码中信息的注入以及同时重新评估如何从查询掩码提供合理的监督。根据我们的分析,我们建立了一个简单而有效的框架,名为DiffewS,保留了原始潜在扩散模型的生成框架,并有效利用了预训练的先前知识。实验结果表明,我们的方法在多个设置中显著优于之前的最佳模型。
https://arxiv.org/abs/2410.02369
Text plays a crucial role in the transmission of human civilization, and teaching machines to generate online handwritten text in various styles presents an interesting and significant challenge. However, most prior work has concentrated on generating individual Chinese fonts, leaving {complete text line generation largely unexplored}. In this paper, we identify that text lines can naturally be divided into two components: layout and glyphs. Based on this division, we designed a text line layout generator coupled with a diffusion-based stylized font synthesizer to address this challenge hierarchically. More concretely, the layout generator performs in-context-like learning based on the text content and the provided style references to generate positions for each glyph autoregressively. Meanwhile, the font synthesizer which consists of a character embedding dictionary, a multi-scale calligraphy style encoder, and a 1D U-Net based diffusion denoiser will generate each font on its position while imitating the calligraphy style extracted from the given style references. Qualitative and quantitative experiments on the CASIA-OLHWDB demonstrate that our method is capable of generating structurally correct and indistinguishable imitation samples.
文本在人类文明的传播中扮演着关键角色,并且教机器生成各种风格的在线手写文本提出了一个有趣且具有挑战性的问题。然而,大多数先前的研究都集中在生成单个中文字体上, leaving {完整的文本行生成大多没有被探索}。在本文中,我们发现文本行可以自然地分为两个部分:布局和字符。基于这一划分,我们设计了一个文本行布局生成器与扩散为基础的 stylized 字体合成器来解决这个问题。具体来说,布局生成器根据文本内容和提供的风格参考进行上下文类似的学习,以生成每个字符的自动位置。同时,由字符嵌入字典、多尺度书法风格编码器和基于 U-Net 的扩散去噪器组成的字体合成器将在其位置上模仿从给定风格参考中提取的书法风格。在 CASIA-OLHWDB 等数据集上进行的定性和定量实验证明,我们的方法能够生成结构正确且难以区分模仿样本。
https://arxiv.org/abs/2410.02309
Unrestricted adversarial attacks typically manipulate the semantic content of an image (e.g., color or texture) to create adversarial examples that are both effective and photorealistic. Recent works have utilized the diffusion inversion process to map images into a latent space, where high-level semantics are manipulated by introducing perturbations. However, they often results in substantial semantic distortions in the denoised output and suffers from low efficiency. In this study, we propose a novel framework called Semantic-Consistent Unrestricted Adversarial Attacks (SCA), which employs an inversion method to extract edit-friendly noise maps and utilizes Multimodal Large Language Model (MLLM) to provide semantic guidance throughout the process. Under the condition of rich semantic information provided by MLLM, we perform the DDPM denoising process of each step using a series of edit-friendly noise maps, and leverage DPM Solver++ to accelerate this process, enabling efficient sampling with semantic consistency. Compared to existing methods, our framework enables the efficient generation of adversarial examples that exhibit minimal discernible semantic changes. Consequently, we for the first time introduce Semantic-Consistent Adversarial Examples (SCAE). Extensive experiments and visualizations have demonstrated the high efficiency of SCA, particularly in being on average 12 times faster than the state-of-the-art attacks. Our code can be found at this https URL}{this https URL.
通常,无限制的对抗攻击会操纵图像的语义内容(例如,颜色或纹理)以创建既有效又逼真的对抗实例。最近的工作利用扩散反演过程将图像映射到拉取式空间,通过引入扰动来操作高阶语义信息。然而,它们通常会导致在去噪输出中产生严重的语义扭曲,且效率低下。在本文中,我们提出了一个名为 Semantic-Consistent Unrestricted Adversarial Attacks (SCA) 的新框架,它采用反演方法提取编辑友好的噪声映射,并利用 Multimodal Large Language Model (MLLM) 提供语义指导。在 MLLM 提供的丰富语义信息条件下,我们使用一系列编辑友好的噪声映射对每个步骤进行 DDPM 去噪处理,并利用 DPM Solver++ 加速此过程,实现基于语义一致性的高效采样。与现有方法相比,我们的框架能够高效地生成具有最小可察觉语义变化的对抗实例。因此,我们首次引入了 Semantic-Consistent Adversarial Examples (SCAE)。大量的实验和可视化结果证明了 SCA 的高效率,特别是平均速度比最先进的攻击快约 12 倍。我们的代码可在此链接中找到:https://this链接。
https://arxiv.org/abs/2410.02240
We present SoundMorpher, a sound morphing method that generates perceptually uniform morphing trajectories using a diffusion model. Traditional sound morphing methods models the intractable relationship between morph factor and perception of the stimuli for resulting sounds under a linear assumption, which oversimplifies the complex nature of sound perception and limits their morph quality. In contrast, SoundMorpher explores an explicit proportional mapping between the morph factor and the perceptual stimuli of morphed sounds based on Mel-spectrogram. This approach enables smoother transitions between intermediate sounds and ensures perceptually consistent transformations, which can be easily extended to diverse sound morphing tasks. Furthermore, we present a set of quantitative metrics to comprehensively assess sound morphing systems based on three objective criteria, namely, correspondence, perceptual intermediateness, and smoothness. We provide extensive experiments to demonstrate the effectiveness and versatility of SoundMorpher in real-world scenarios, highlighting its potential impact on various applications such as creative music composition, film post-production and interactive audio technologies.
我们提出了SoundMorpher,一种使用扩散模型生成感知统一形状的方法。与传统声音变形方法不同,SoundMorpher基于Mel频谱图探索了形状因子和变形声音的感知之间的关系。这种方法使得中间声音之间的过渡更加平滑,确保了感知上的转换保持一致,可以轻松地扩展到各种声音变形任务中。此外,我们还提出了一个基于三个客观标准的定量指标,全面评估声音变形系统的效果和多样性。我们进行了大量实验,展示了SoundMorpher在现实世界场景中的有效性和多样性,并强调了其对诸如创意音乐创作、电影后期制作和交互式音频技术等领域的影响。
https://arxiv.org/abs/2410.02144
We introduce MDSGen, a novel framework for vision-guided open-domain sound generation optimized for model parameter size, memory consumption, and inference speed. This framework incorporates two key innovations: (1) a redundant video feature removal module that filters out unnecessary visual information, and (2) a temporal-aware masking strategy that leverages temporal context for enhanced audio generation accuracy. In contrast to existing resource-heavy Unet-based models, MDSGen employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre-trained diffusion models. Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves 97.9% alignment accuracy, using 172x fewer parameters, 371% less memory, and offering 36x faster inference than the current 860M-parameter state-of-the-art model (93.9% accuracy). The larger model (131M parameters) reaches nearly 99% accuracy while requiring 6.5x fewer parameters. These results highlight the scalability and effectiveness of our approach.
我们提出了MDSGen,一种针对视觉指导的开放域声音生成框架,优化了参数大小、内存消耗和推理速度。该框架采用了两个关键创新:(1)一个去冗余的视频特征去除模块,可以过滤出多余的视觉信息;(2)一个时间感知掩码策略,利用时间上下文来提高音频生成准确性。与现有的资源密集型Unet基模型相比,MDSGen采用去噪掩码扩散变换器,无需依赖预训练扩散模型,实现高效的生成。在基准VGGSound数据集上评估,我们的最小模型(5M参数)实现了97.9%的对齐准确性,使用了172x fewer parameters,371% less memory,并且比目前的860M-parameter最先进模型(93.9%准确度)快了36x的推理速度。较大模型(131M参数)达到几乎99%的准确度,同时减少了6.5x的参数。这些结果突出了我们方法的可扩展性和有效性。
https://arxiv.org/abs/2410.02130
Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion transformers with expert-choice routing. EC-DIT learns to adaptively optimize the compute allocated to understand the input texts and generate the respective image patches, enabling heterogeneous computation aligned with varying text-image complexities. This heterogeneity provides an efficient way of scaling EC-DIT up to 97 billion parameters and achieving significant improvements in training convergence, text-to-image alignment, and overall generation quality over dense models and conventional MoE models. Through extensive ablations, we show that EC-DIT demonstrates superior scalability and adaptive compute allocation by recognizing varying textual importance through end-to-end training. Notably, in text-to-image alignment evaluation, our largest models achieve a state-of-the-art GenEval score of 71.68% and still maintain competitive inference speed with intuitive interpretability.
扩散变换器在文本到图像合成中得到了广泛应用。将这些模型扩展到数十亿参数具有前景,但是超越当前大小的扩展效果仍然没有被充分利用和挑战。通过明确利用图像生成的计算异质性,我们开发了一种名为混合专家(MoE)的扩散变换器新模型(EC-DIT)。EC-DIT学会了适应性地优化分配给定文本的计算,理解输入文本并生成相应的图像补丁,从而实现与文本图像复杂性相匹敌的异质计算。这种异质性使得EC-DIT能够以可扩展到970亿参数的方式提高性能,并在训练收敛、文本到图像对齐和整体生成质量方面显著改善密集模型和传统MoE模型。通过广泛的实验,我们证明了EC-DIT通过端到端的训练在异质计算分配方面表现出卓越的性能。值得注意的是,在文本到图像对齐评估中,我们的最大模型获得了71.68%的GenEval评分,并且具有与直观可解释性相媲美的推理速度竞争。
https://arxiv.org/abs/2410.02098
Posterior sampling in high-dimensional spaces using generative models holds significant promise for various applications, including but not limited to inverse problems and guided generation tasks. Despite many recent developments, generating diverse posterior samples remains a challenge, as existing methods require restarting the entire generative process for each new sample, making the procedure computationally expensive. In this work, we propose efficient posterior sampling by simulating Langevin dynamics in the noise space of a pre-trained generative model. By exploiting the mapping between the noise and data spaces which can be provided by distilled flows or consistency models, our method enables seamless exploration of the posterior without the need to re-run the full sampling chain, drastically reducing computational overhead. Theoretically, we prove a guarantee for the proposed noise-space Langevin dynamics to approximate the posterior, assuming that the generative model sufficiently approximates the prior distribution. Our framework is experimentally validated on image restoration tasks involving noisy linear and nonlinear forward operators applied to LSUN-Bedroom (256 x 256) and ImageNet (64 x 64) datasets. The results demonstrate that our approach generates high-fidelity samples with enhanced semantic diversity even under a limited number of function evaluations, offering superior efficiency and performance compared to existing diffusion-based posterior sampling techniques.
在高维空间中使用生成模型进行后验采样在各种应用中具有很大的潜力,包括反问题指导和生成任务。尽管许多最近的发展为生成多样后验样本提供了可能,但现有的方法在生成过程中需要重新启动整个生成过程,导致该过程在计算上变得昂贵。在这项工作中,我们提出了一种通过在预训练生成模型的噪声空间中模拟Langevin动力来实现高效后验采样的方法。通过利用通过去中心化流或一致性模型提供的噪声与数据空间之间的映射,我们的方法使得在不需要重新运行完整的采样链的情况下,轻松地探索后验,大大减少了计算开销。从理论上看,我们证明了所提出的噪声空间Langevin动力在生成足够逼近先验分布的生成模型上近似后验的保证。我们的框架在应用于LSUN-Bedroom(256 x 256)和ImageNet(64 x 64)数据集的图像修复任务中进行了实验验证。结果表明,即使在对数函数评估数量有限的情况下,我们的方法也能够生成具有丰富语义多样性的高保真度样本,与现有的扩散为基础的后验采样技术相比,具有卓越的效率和性能。
https://arxiv.org/abs/2410.02078
We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pretrained models through instruction finetuning: Mistral for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities superior to LLaVA and additionally demonstrates competitive results with native multimodal models such as Unified-IO 2, paving the way for building advanced general-purpose visual assistants by effectively re-using existing multimodal models. We open-source our dataset, codebase, and model checkpoints to foster further research and application in this domain.
我们提出了一种使用自动生成的指令跟随数据来提高具有生成和图像编辑任务的复杂多模态模型的零 shot能力的方法。我们通过使用GPT-4V和现有的图像生成和编辑数据集来创建一个新的多模态指令跟随集。通过使用这个指令集以及现有的LLaVA-Finetune指令集来支持视觉理解任务,我们产生了GenLLaVA,一种基于指令微调的大规模语言和视觉助手。GenLLaVA是通过结合三种大型预训练模型通过指令微调来实现这一策略的:Mistral进行语言建模,SigLIP进行图像文本匹配,StableDiffusion进行文本到图像生成。我们的模型在LLaVA和原有的多模态模型上展示了卓越的视觉理解能力,并且与诸如Unified-IO 2等本地多模态模型竞争,为通过有效重用现有多模态模型构建先进的通用视觉助手铺平道路。我们将开放源码我们的数据集、代码库和模型检查点,以促进该领域进一步的研究和应用。
https://arxiv.org/abs/2406.11262
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-world audios. To address this shortcoming, we propose to augment the dataset with synthetic audio generated from text-to-audio (T2A) diffusion models. However, synthesizing effective augmentations is challenging because not only should the generated data be acoustically consistent with the underlying small-scale dataset, but they should also have sufficient compositional diversity. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. This ensures that the acoustic characteristics of the generated data remain consistent with the small-scale dataset. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models to (1) generate diverse and meaningful audio captions and (2) iteratively refine their quality. The generated captions are then used to prompt the aligned T2A model. We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned AudioSet.
我们提出了Synthio,一种用于在小型音频分类数据集中增加合成数据的新方法。我们的目标是利用有限标记数据来提高音频分类准确性。传统的数据增强技术(例如添加随机噪音或遮盖段落)很难创建具有真实世界音频中真实多样性的数据。为了克服这一缺陷,我们提出了一种通过从文本到音频(T2A)扩散模型生成的合成音频来丰富数据集的方法。然而,生成有效的增强数据具有挑战性,因为生成的数据不仅在声学上要与底层小型数据集保持一致,还应该具有足够的构成多样性。为了克服第一个挑战,我们通过偏好优化来对T2A模型的 generations进行对齐。这确保生成的数据的声学特征与小型数据集保持一致。为了应对第二个挑战,我们提出了一个利用大型语言模型的推理能力生成多样且意义的音频摘要的新颖 caption 生成技术。然后,生成的摘要用于提示对齐的T2A模型。我们对Synthio在十个数据集和四个模拟有限数据集上进行了广泛评估。结果表明,我们的方法通过仅使用弱捕获音频集训练的T2A模型,在0.1%到39%的基线方法上始终优于所有基线。
https://arxiv.org/abs/2410.02056