The advancements in the domain of LLMs in recent years have surprised many, showcasing their remarkable capabilities and diverse applications. Their potential applications in various real-world scenarios have led to significant research on their reliability and effectiveness. On the other hand, multimodal LLMs and Text-to-Image models have only recently gained prominence, especially when compared to text-only LLMs. Their reliability remains constrained due to insufficient research on assessing their performance and robustness. This paper aims to establish a comprehensive evaluation framework for Text-to-Image models, concentrating particularly on their adherence to prompts. We created a novel dataset that aimed to assess the robustness of these models in generating images that conform to the specified factors of variation in the input text prompts. Our evaluation studies present findings on three variants of Stable Diffusion models: Stable Diffusion 3 Medium, Stable Diffusion 3.5 Large, and Stable Diffusion 3.5 Large Turbo, and two variants of Janus models: Janus Pro 1B and Janus Pro 7B. We introduce a pipeline that leverages text descriptions generated by the gpt-4o model for our ground-truth images, which are then used to generate artificial images by passing these descriptions to the Text-to-Image models. We then pass these generated images again through gpt-4o using the same system prompt and compare the variation between the two descriptions. Our results reveal that these models struggle to create simple binary images with only two factors of variation: a simple geometric shape and its location. We also show, using pre-trained VAEs on our dataset, that they fail to generate images that follow our input dataset distribution.
近年来,大型语言模型(LLM)领域取得了许多令人惊讶的进步,展示了它们在多种场景中的显著能力和广泛应用。由于这些模型在实际应用场景中显示出巨大的潜力,因此对其可靠性和有效性进行了大量的研究。相比之下,多模态的LLM和文本到图像模型直到最近才开始受到关注,特别是在与纯文本LLM相比时。尽管如此,由于缺乏对它们性能和鲁棒性的评估研究,其可靠性仍然受限。 本文旨在为文本到图像(Text-to-Image)模型建立一个全面的评价框架,特别关注这些模型在生成符合输入文本提示中指定因素变化的图像方面的表现能力。为此,我们创建了一个新的数据集,用于评估这些模型在上述条件下的稳健性。我们的研究对三种Stable Diffusion模型进行了评测:Stable Diffusion 3 Medium、Stable Diffusion 3.5 Large以及Stable Diffusion 3.5 Large Turbo;同时还有两种Janus模型:Janus Pro 1B和Janus Pro 7B。 为了进行这项评估,我们引入了一个管道系统,该系统利用gpt-4o生成的文本描述作为真实图像的基础,并将这些描述传递给Text-to-Image模型以生成人工图像。然后,我们将生成的图像再次通过相同的系统提示传递给gpt-4o,并比较两个描述之间的变化。 我们的结果表明,在只有两种因素(一个简单的几何形状及其位置)发生二元变化的情况下,这些模型在创建简单二元图象方面遇到困难。此外,我们还使用预训练的变分自编码器(VAE),并在我们的数据集上展示了它们无法生成符合输入数据分布特性的图像。 总的来说,这项研究揭示了现有Text-to-Image模型的一些局限性,并为进一步改善这些模型提供了有价值的见解和建议。
https://arxiv.org/abs/2507.08039
Remarkable progress in text-to-image diffusion models has brought a major concern about potentially generating images on inappropriate or trademarked concepts. Concept erasing has been investigated with the goals of deleting target concepts in diffusion models while preserving other concepts with minimal distortion. To achieve these goals, recent concept erasing methods usually fine-tune the cross-attention layers of diffusion models. In this work, we first show that merely updating the cross-attention layers in diffusion models, which is mathematically equivalent to adding \emph{linear} modules to weights, may not be able to preserve diverse remaining concepts. Then, we propose a novel framework, dubbed Concept Pinpoint Eraser (CPE), by adding \emph{nonlinear} Residual Attention Gates (ResAGs) that selectively erase (or cut) target concepts while safeguarding remaining concepts from broad distributions by employing an attention anchoring loss to prevent the forgetting. Moreover, we adversarially train CPE with ResAG and learnable text embeddings in an iterative manner to maximize erasing performance and enhance robustness against adversarial attacks. Extensive experiments on the erasure of celebrities, artistic styles, and explicit contents demonstrated that the proposed CPE outperforms prior arts by keeping diverse remaining concepts while deleting the target concepts with robustness against attack prompts. Code is available at this https URL
文本到图像的扩散模型取得了显著的进步,但也引发了人们对生成不适当或受商标保护的概念图片的担忧。概念擦除技术的研究目标是在删除特定概念的同时尽可能地保留其他概念而不造成重大失真。最近的方法通常通过微调扩散模型中的交叉注意力层来实现这一目的。 然而,在这项工作中,研究者们首先展示了仅仅更新扩散模型中的交叉注意力层(这在数学上等同于向权重中添加线性模块)可能不足以保护多样化存在的其他概念。为了解决这个问题,他们提出了一种新的框架——概念定点擦除器(Concept Pinpoint Eraser, CPE)。该框架通过添加非线性的残差注意门(Residual Attention Gates, ResAGs),选择性地删除目标概念,同时利用注意力锚定损失来防止遗忘,从而保护其他广泛分布的概念。此外,研究者们还采用对抗训练方法迭代式地优化CPE及其可学习的文本嵌入,以最大化擦除性能并增强对对抗攻击的鲁棒性。 通过在名人、艺术风格和显性内容上的大量实验验证了所提出的CPE框架,在删除目标概念的同时能够保持多样化存在的其他概念,并且具有较强的抗干扰能力。代码可在提供的链接处获取。
https://arxiv.org/abs/2506.22806
Standard datasets often present limitations, particularly due to the fixed nature of input data sensors, which makes it difficult to compare methods that actively adjust sensor parameters to suit environmental conditions. This is the case with Automatic-Exposure (AE) methods, which rely on environmental factors to influence the image acquisition process. As a result, AE methods have traditionally been benchmarked in an online manner, rendering experiments non-reproducible. Building on our prior work, we propose a methodology that utilizes an emulator capable of generating images at any exposure time. This approach leverages BorealHDR, a unique multi-exposure stereo dataset, along with its new extension, in which data was acquired along a repeated trajectory at different times of the day to assess the impact of changing illumination. In total, BorealHDR covers 13.4 km over 59 trajectories in challenging lighting conditions. The dataset also includes lidar-inertial-odometry-based maps with pose estimation for each image frame, as well as Global Navigation Satellite System (GNSS) data for comparison. We demonstrate that by using images acquired at various exposure times, we can emulate realistic images with a Root-Mean-Square Error (RMSE) below 1.78% compared to ground truth images. Using this offline approach, we benchmarked eight AE methods, concluding that the classical AE method remains the field's best performer. To further support reproducibility, we provide in-depth details on the development of our backpack acquisition platform, including hardware, electrical components, and performance specifications. Additionally, we share valuable lessons learned from deploying the backpack over more than 25 km across various environments. Our code and dataset are available online at this link: this https URL BorealHDR
标准数据集通常存在一些限制,特别是由于输入数据传感器的固有特性难以改变,这使得比较那些能够根据环境条件调整传感器参数的方法变得困难。这种情况尤其体现在自动曝光(AE)方法中,这些方法依赖于环境因素来影响图像获取过程。因此,传统的AE方法在线进行基准测试,导致实验不可重复。 基于我们之前的工作,我们提出了一种利用模拟器生成任意曝光时间图像的方法。这种方法使用了BorealHDR数据集,这是一个独特的多曝光立体视觉数据集,以及其新的扩展版本,在这个版本中,数据是在一天中的不同时间段沿着相同的轨迹收集的,以评估光照变化的影响。总体而言,BorealHDR涵盖了59条轨迹、13.4公里的距离,并在挑战性的照明条件下进行了采集。该数据集还包括基于激光雷达-惯性里程计的地图以及每帧图像的姿态估计,还提供了全球导航卫星系统(GNSS)数据进行比较。 我们展示了通过使用不同曝光时间获取的图像,可以生成与真实情况相比误差低于1.78%的逼真图像。利用这种方法,我们在离线状态下对八种AE方法进行了基准测试,并得出结论:经典的AE方法仍然是表现最佳的方法。为了进一步支持可重复性研究,我们提供了有关背包采集平台开发的详细信息,包括硬件、电气组件和性能规范。此外,我们也分享了在超过25公里的各种环境下部署背包所获得的经验教训。 我们的代码和数据集可通过以下链接获取:[BorealHDR](https://this https URL)
https://arxiv.org/abs/2506.18844
We consider the problem of disentangling 3D from large vision-language models, which we show on generative 3D portraits. This allows free-form text control of appearance attributes like age, hair style, and glasses, and 3D geometry control of face expression and camera pose. In this setting, we assume we use a pre-trained large vision-language model (LVLM; CLIP) to generate from a smaller 2D dataset with no additional paired labels and with a pre-defined 3D morphable model (FLAME). First, we disentangle using canonicalization to a 2D reference frame from a deformable neural 3D triplane representation. But another form of entanglement arises from the significant noise in the LVLM's embedding space that describes irrelevant features. This damages output quality and diversity, but we overcome this with a Jacobian regularization that can be computed efficiently with a stochastic approximator. Compared to existing methods, our approach produces portraits with added text and 3D control, where portraits remain consistent when either control is changed. Broadly, this approach lets creators control 3D generators on their own 2D face data without needing resources to label large data or train large models.
我们考虑从大型视觉-语言模型中分离出三维信息的问题,并通过生成的三维肖像展示了这一过程。这使得我们可以用自由形式的文字控制外观属性(如年龄、发型和眼镜),并通过3D几何学控制面部表情和相机姿态。在这个设置下,假设我们使用一个预训练的大规模视觉-语言模型(LVLM;CLIP)从一个小的2D数据集中生成结果,并且该数据集没有额外配对标签,同时定义了一个预设的三维可变形模型(FLAME)。首先,我们通过将神经3D三平面表示规范到二维参考帧来实现分离。然而,另一种纠缠形式来自于LVLM嵌入空间中的大量噪声,这些噪声描述了无关特征。这种噪声会损害输出质量和多样性,但我们可以通过计算效率高的随机近似器进行雅可比正则化的方法克服这一问题。 与现有方法相比,我们的方法可以在生成的肖像中添加文本和3D控制,当改变任一控制时,肖像仍然保持一致性。总体而言,这种方法让创作者能够在其自己的2D面部数据上控制三维生成模型,并且无需为标注大量数据或训练大型模型投入资源。
https://arxiv.org/abs/2506.14015
We present LatentCSI, a novel method for generating images of the physical environment from WiFi CSI measurements that leverages a pretrained latent diffusion model (LDM). Unlike prior approaches that rely on complex and computationally intensive techniques such as GANs, our method employs a lightweight neural network to map CSI amplitudes directly into the latent space of an LDM. We then apply the LDM's denoising diffusion model to the latent representation with text-based guidance before decoding using the LDM's pretrained decoder to obtain a high-resolution image. This design bypasses the challenges of pixel-space image generation and avoids the explicit image encoding stage typically required in conventional image-to-image pipelines, enabling efficient and high-quality image synthesis. We validate our approach on two datasets: a wide-band CSI dataset we collected with off-the-shelf WiFi devices and cameras; and a subset of the publicly available MM-Fi dataset. The results demonstrate that LatentCSI outperforms baselines of comparable complexity trained directly on ground-truth images in both computational efficiency and perceptual quality, while additionally providing practical advantages through its unique capacity for text-guided controllability.
我们提出了LatentCSI,这是一种利用预训练的潜在扩散模型(LDM)从WiFi CSI测量生成物理环境图像的新方法。与以往依赖复杂且计算密集的技术如GAN的方法不同,我们的方法采用轻量级神经网络直接将CSI幅度映射到LDM的潜在空间中。接着,我们使用带有文本引导的LDM去噪扩散模型对潜在表示进行处理,并通过预训练解码器解码以获得高分辨率图像。这种设计绕过了像素空间图像生成的挑战,并避免了传统图像到图像工作流通常需要的显式图像编码阶段,从而实现了高效且高质量的图像合成。 我们在两个数据集上验证了我们的方法:一个是使用现成的WiFi设备和摄像头收集的宽带CSI数据集;另一个是公开可用的MM-Fi数据子集。实验结果表明,在计算效率和感知质量方面,LatentCSI优于直接在真实图像上训练的复杂度相当的基础模型,同时通过其独特的文本引导可控性提供了实用优势。
https://arxiv.org/abs/2506.10605
Generating images in a consistent reference visual style remains a challenging computer vision task. State-of-the-art methods aiming for style-consistent generation struggle to effectively separate semantic content from stylistic elements, leading to content leakage from the image provided as a reference to the targets. To address this challenge, we propose Only-Style: a method designed to mitigate content leakage in a semantically coherent manner while preserving stylistic consistency. Only-Style works by localizing content leakage during inference, allowing the adaptive tuning of a parameter that controls the style alignment process, specifically within the image patches containing the subject in the reference image. This adaptive process best balances stylistic consistency with leakage elimination. Moreover, the localization of content leakage can function as a standalone component, given a reference-target image pair, allowing the adaptive tuning of any method-specific parameter that provides control over the impact of the stylistic reference. In addition, we propose a novel evaluation framework to quantify the success of style-consistent generations in avoiding undesired content leakage. Our approach demonstrates a significant improvement over state-of-the-art methods through extensive evaluation across diverse instances, consistently achieving robust stylistic consistency without undesired content leakage.
在一致的参考视觉风格下生成图像仍然是一个具有挑战性的计算机视觉任务。最先进的旨在实现风格一致性生成的方法难以有效地区分语义内容和风格元素,导致从提供的参考图片中向目标泄露不必要的内容信息。为了解决这一挑战,我们提出了Only-Style:一种方法,旨在以语义连贯的方式减少内容泄漏,同时保持风格的一致性。Only-Style通过在推理过程中定位内容泄漏来工作,并允许自适应调整控制样式对齐过程的参数(特别是在参考图像中包含主体的图块内)。这种自适应处理最能平衡风格一致性与泄漏消除之间的关系。此外,定位内容泄露的功能可以作为独立组件存在,给定一个参考目标图片对,允许任何特定方法的参数进行自适应调节,从而控制样式参照的影响。 我们还提出了一种新的评估框架,以量化风格一致生成在避免不需要的内容泄漏方面的成功程度。通过广泛的测试和各种实例,我们的方法明显优于最先进的方法,在保持强大风格一致性的同时,始终能有效地防止不需要的内容泄露。
https://arxiv.org/abs/2506.09916
Recently, Large language models (LLMs) have shown great promise across a diversity of tasks, ranging from generating images to reasoning spatially. Considering their remarkable (and growing) textual reasoning capabilities, we investigate LLMs' potency in conducting analyses of an individual's preferences in music (based on playlist metadata, personal write-ups, etc.) and producing effective prompts (based on these analyses) to be passed to Suno AI (a generative AI tool for music production). Our proposition of a novel LLM-based textual representation to music model (which we call TuneGenie) and the various methods we develop to evaluate & benchmark similar models add to the increasing (and increasingly controversial) corpus of research on the use of AI in generating art.
最近,大型语言模型(LLM)在多种任务中展现了巨大的潜力,从生成图像到进行空间推理。鉴于它们卓越且不断增长的文本推理能力,我们研究了这些模型在分析个人音乐偏好方面的效力(基于播放列表元数据、个人评论等),并开发有效提示(基于这些分析)以传递给Suno AI(一种用于音乐制作的生成式AI工具)。我们提出了一种新颖的LLM基文本表示到音乐模型(称之为TuneGenie),以及我们为评估和基准测试类似模型而发展的各种方法,都增加了关于在艺术创作中使用AI的研究日益增长且日益具有争议性的文献。
https://arxiv.org/abs/2506.12083
Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts. However, rapid T2I model advancements reveal limitations in early benchmarks, lacking comprehensive evaluations, for example, the evaluation on reasoning, text rendering and style. Notably, recent state-of-the-art models, with their rich knowledge modeling capabilities, show promising results on the image generation problems requiring strong reasoning ability, yet existing evaluation systems have not adequately addressed this frontier. To systematically address these gaps, we introduce OneIG-Bench, a meticulously designed comprehensive benchmark framework for fine-grained evaluation of T2I models across multiple dimensions, including prompt-image alignment, text rendering precision, reasoning-generated content, stylization, and diversity. By structuring the evaluation, this benchmark enables in-depth analysis of model performance, helping researchers and practitioners pinpoint strengths and bottlenecks in the full pipeline of image generation. Specifically, OneIG-Bench enables flexible evaluation by allowing users to focus on a particular evaluation subset. Instead of generating images for the entire set of prompts, users can generate images only for the prompts associated with the selected dimension and complete the corresponding evaluation accordingly. Our codebase and dataset are now publicly available to facilitate reproducible evaluation studies and cross-model comparisons within the T2I research community.
文本到图像(T2I)模型因其能生成高质量并与文本提示相匹配的图像而备受关注。然而,随着T2I模型的快速进步,早期基准测试中出现了一些局限性,例如缺乏全面评估,特别是在推理、文本渲染和风格方面。值得注意的是,最近最先进的模型在需要强大推理能力的图像生成问题上展现了令人鼓舞的结果,但现有的评估系统尚未充分解决这一领域的问题。为了系统地填补这些空白,我们引入了OneIG-Bench,这是一个精心设计的基准框架,旨在对T2I模型进行细致全面的多维度评估,包括提示-图像匹配度、文本渲染精度、推理生成内容、风格化和多样性等方面。通过结构化的评估方法,该基准能够深入分析模型性能,帮助研究人员和实践者识别整个图像生成流程中的优势与瓶颈。具体而言,OneIG-Bench允许用户灵活地进行特定评估子集的测试,而非为所有提示生成图像,使用者只需针对选定维度的相关提示生成图像,并完成相应的评估即可。我们的代码库和数据集现已公开发布,便于T2I研究社区内的可重复性研究及跨模型比较。
https://arxiv.org/abs/2506.07977
Domain generalization (DG) for object detection aims to enhance detectors' performance in unseen scenarios. This task remains challenging due to complex variations in real-world applications. Recently, diffusion models have demonstrated remarkable capabilities in diverse scene generation, which inspires us to explore their potential for improving DG tasks. Instead of generating images, our method extracts multi-step intermediate features during the diffusion process to obtain domain-invariant features for generalized detection. Furthermore, we propose an efficient knowledge transfer framework that enables detectors to inherit the generalization capabilities of diffusion models through feature and object-level alignment, without increasing inference time. We conduct extensive experiments on six challenging DG benchmarks. The results demonstrate that our method achieves substantial improvements of 14.0% mAP over existing DG approaches across different domains and corruption types. Notably, our method even outperforms most domain adaptation methods without accessing any target domain data. Moreover, the diffusion-guided detectors show consistent improvements of 15.9% mAP on average compared to the baseline. Our work aims to present an effective approach for domain-generalized detection and provide potential insights for robust visual recognition in real-world scenarios. The code is available at this https URL.
领域泛化(Domain Generalization,DG)在目标检测中的应用旨在提高检测器在未见过场景下的表现。由于实际应用场景中存在复杂多变的情况,这一任务依然充满挑战。最近,扩散模型展示了其在生成多样化场景方面的卓越能力,这激发了我们将它们用于改善DG任务的潜力的研究兴趣。与直接生成图像不同,我们的方法通过在扩散过程中提取多步中间特征来获取跨域不变性特征,从而实现泛化检测。 此外,我们提出了一种高效的知识转移框架,使检测器能够继承扩散模型的泛化能力,而无需增加推理时间,并且该框架是通过特征级和对象级对齐完成的。我们在六个具有挑战性的DG基准数据集上进行了广泛的实验,结果显示我们的方法在不同域和损坏类型上比现有DG方法提高了14.0% mAP(平均精度)。值得注意的是,即使不访问目标域数据,我们提出的方法也超越了大多数领域适应方法的表现。 此外,由扩散模型指导的检测器相比基准线,在所有测试场景中平均mAP值提升达到了15.9%。我们的工作旨在提供一种有效的跨领域泛化检测解决方案,并为在实际场景中的稳健视觉识别提供潜在见解。代码可以在以下链接获取:[此链接](https://this https URL)。 请注意,最后一个URL应替换为您实际提供的代码仓库地址或其他相关资源的正确链接。
https://arxiv.org/abs/2503.02101
Text-to-image (T2I) generation model has made significant advancements, resulting in high-quality images aligned with an input prompt. However, despite T2I generation's ability to generate fine-grained images, it still faces challenges in accurately generating images when the input prompt contains complex concepts, especially human pose. In this paper, we propose PointT2I, a framework that effectively generates images that accurately correspond to the human pose described in the prompt by using a large language model (LLM). PointT2I consists of three components: Keypoint generation, Image generation, and Feedback system. The keypoint generation uses an LLM to directly generate keypoints corresponding to a human pose, solely based on the input prompt, without external references. Subsequently, the image generation produces images based on both the text prompt and the generated keypoints to accurately reflect the target pose. To refine the outputs of the preceding stages, we incorporate an LLM-based feedback system that assesses the semantic consistency between the generated contents and the given prompts. Our framework is the first approach to leveraging LLM for keypoints-guided image generation without any fine-tuning, producing accurate pose-aligned images based solely on textual prompts.
文本到图像(T2I)生成模型已经取得了显著进展,能够根据输入提示生成高质量的图像。然而,尽管T2I生成模型能够在某些方面产生细节丰富的图像,当输入提示包含复杂的概念时,尤其是人体姿态时,仍然面临准确生成图像的挑战。 在本文中,我们提出了PointT2I框架,该框架利用大规模语言模型(LLM)有效地生成与提示中描述的人体姿态相匹配的高质量图像。PointT2I包括三个组成部分:关键点生成、图像生成和反馈系统。 1. **关键点生成**:这部分使用一个大模型直接根据输入提示生成对应人体姿势的关键点,而不依赖于外部参考。 2. **图像生成**:该部分基于文本提示以及由前面步骤产生的关键点来生成图像,以准确反映目标姿态。 3. **反馈系统**:为了优化前两个阶段的输出,我们引入了一个基于大模型的反馈机制,它评估生成内容与给定提示之间的语义一致性。 我们的框架是首个不进行微调即可利用LLM引导关键点并据此生成精确对齐的人体姿势图像的方法。仅通过文本提示,PointT2I就能产生准确的姿态对齐图像。
https://arxiv.org/abs/2506.01370
Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision language model to assess overall visual quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. More: this http URL.
尽管基于链式思维推理和强化学习(RL)在自然语言处理(NLP)领域取得了突破性进展,但它们在生成视觉模型中的整合仍处于探索阶段。我们引入了ReasonGen-R1框架,这是一个两阶段的架构:首先通过在新生成的推理数据集上进行监督微调,赋予自回归图像生成器显式的基于文本的“思考”技能;然后利用群体相对策略优化(GRPO)来改进其输出结果。为了使模型能够在生成图像之前通过文本进行推理,我们自动创建并发布了一套由模型创作的论述文集,这些文档与视觉提示配对,从而能够控制物体布局、风格和场景构图的设计。我们的GRPO算法使用预训练的视觉语言模型提供的奖励信号来评估整体视觉质量,并在每次更新中优化策略。在GenEval、DPG以及T2I基准测试上的评估结果表明,ReasonGen-R1始终优于强大的基线模型及先前最先进的模型。更多信息,请参见:[此链接](http://this.http.url)。 注意:原文中的“this http URL”是一个占位符或伪链接,并非实际链接地址,在正式发布时应替换为有效的链接地址。
https://arxiv.org/abs/2505.24875
The diffusion models (DMs) have demonstrated the remarkable capability of generating images via learning the noised score function of data distribution. Current DM sampling techniques typically rely on first-order Langevin dynamics at each noise level, with efforts concentrated on refining inter-level denoising strategies. While leveraging additional second-order Hessian geometry to enhance the sampling quality of Langevin is a common practice in Markov chain Monte Carlo (MCMC), the naive attempts to utilize Hessian geometry in high-dimensional DMs lead to quadratic-complexity computational costs, rendering them non-scalable. In this work, we introduce a novel Levenberg-Marquardt-Langevin (LML) method that approximates the diffusion Hessian geometry in a training-free manner, drawing inspiration from the celebrated Levenberg-Marquardt optimization algorithm. Our approach introduces two key innovations: (1) A low-rank approximation of the diffusion Hessian, leveraging the DMs' inherent structure and circumventing explicit quadratic-complexity computations; (2) A damping mechanism to stabilize the approximated Hessian. This LML approximated Hessian geometry enables the diffusion sampling to execute more accurate steps and improve the image generation quality. We further conduct a theoretical analysis to substantiate the approximation error bound of low-rank approximation and the convergence property of the damping mechanism. Extensive experiments across multiple pretrained DMs validate that the LML method significantly improves image generation quality, with negligible computational overhead.
扩散模型(DM)通过学习数据分布的噪声得分函数,展示了生成图像的强大能力。目前的DM采样技术通常依赖于每个噪声水平上的第一阶Langevin动力学,并且努力集中在改进跨层级去噪策略上。尽管在马尔可夫链蒙特卡洛(MCMC)方法中利用额外的二阶Hessian几何来增强Langevin采样的质量是一种常见做法,但在高维DM中直接使用Hessian几何会导致二次复杂度的计算成本,使其无法扩展。 在这项工作中,我们引入了一种新的莱文伯格-马夸尔特-兰热姆(Levenberg-Marquardt-Langevin, LML)方法,该方法以无训练的方式近似扩散Hessian几何,借鉴了著名的Levenberg-Marquardt优化算法。我们的方法提出了两项关键创新: 1. 一种对扩散Hessian的低秩近似,利用DM固有的结构,并避免直接进行二次复杂度计算。 2. 一个阻尼机制来稳定近似的Hessian。 这种LML近似Hessian几何使扩散采样能够执行更准确的步骤并提高图像生成质量。我们进一步进行了理论分析以验证低秩近似误差界和阻尼机制的收敛性。在多个预训练DM上的广泛实验验证了LML方法显著提高了图像生成的质量,且计算开销几乎可以忽略不计。
https://arxiv.org/abs/2505.24222
Generating images from text involving complex and novel object arrangements remains a significant challenge for current text-to-image (T2I) models. Although prior layout-based methods improve object arrangements using spatial constraints with 2D layouts, they often struggle to capture 3D positioning and sacrifice quality and coherence. In this work, we introduce ComposeAnything, a novel framework for improving compositional image generation without retraining existing T2I models. Our approach first leverages the chain-of-thought reasoning abilities of LLMs to produce 2.5D semantic layouts from text, consisting of 2D object bounding boxes enriched with depth information and detailed captions. Based on this layout, we generate a spatial and depth aware coarse composite of objects that captures the intended composition, serving as a strong and interpretable prior that replaces stochastic noise initialization in diffusion-based T2I models. This prior guides the denoising process through object prior reinforcement and spatial-controlled denoising, enabling seamless generation of compositional objects and coherent backgrounds, while allowing refinement of inaccurate priors. ComposeAnything outperforms state-of-the-art methods on the T2I-CompBench and NSR-1K benchmarks for prompts with 2D/3D spatial arrangements, high object counts, and surreal compositions. Human evaluations further demonstrate that our model generates high-quality images with compositions that faithfully reflect the text.
从文本生成包含复杂和新颖对象排列的图像,对于当前的文字到图像(T2I)模型来说仍是一个重大挑战。虽然以前基于布局的方法通过使用二维布局中的空间约束来改进对象排列,但它们常常难以捕捉三维定位,并且会牺牲质量和连贯性。在这项工作中,我们引入了ComposeAnything,这是一个新颖的框架,在不重新训练现有T2I模型的情况下,可以提高图像生成的组合能力。 我们的方法首先利用大语言模型(LLM)的链式思维推理能力,从文本中产生包含深度信息和详细说明的二维对象边界框的2.5D语义布局。基于这种布局,我们根据空间和深度感知生成一个粗略的复合图,该图捕获了预期的组合,并作为强大的、可解释的先验条件替代扩散型T2I模型中的随机噪声初始化。 这个先验通过物体优先级强化和空间控制去噪指导降噪过程,使复杂对象和连贯背景的无缝生成成为可能,同时允许对不准确的先验进行细化。ComposeAnything在具有二维/三维空间布局、高对象计数和超现实组合提示的T2I-CompBench和NSR-1K基准测试中超越了最先进的方法。 人类评估进一步表明,我们的模型能够生成高质量且忠实反映文本内容组成的图像。
https://arxiv.org/abs/2505.24086
Generating images from rhetorical languages remains a critical challenge for text-to-image models. Even state-of-the-art (SOTA) multimodal large language models (MLLM) fail to generate images based on the hidden meaning inherent in rhetorical language--despite such content being readily mappable to visual representations by humans. A key limitation is that current models emphasize object-level word embedding alignment, causing metaphorical expressions to steer image generation towards their literal visuals and overlook the intended semantic meaning. To address this, we propose Rhet2Pix, a framework that formulates rhetorical text-to-image generation as a multi-step policy optimization problem, incorporating a two-layer MDP diffusion module. In the outer layer, Rhet2Pix converts the input prompt into incrementally elaborated sub-sentences and executes corresponding image-generation actions, constructing semantically richer visuals. In the inner layer, Rhet2Pix mitigates reward sparsity during image generation by discounting the final reward and optimizing every adjacent action pair along the diffusion denoising trajectory. Extensive experiments demonstrate the effectiveness of Rhet2Pix in rhetorical text-to-image generation. Our model outperforms SOTA MLLMs such as GPT-4o, Grok-3 and leading academic baselines across both qualitative and quantitative evaluations. The code and dataset used in this work are publicly available.
从修辞语言生成图像仍然是文本到图像模型面临的关键挑战。即使是最先进的多模态大型语言模型(MLLM)也无法根据修辞语言中固有的含义生成图像——尽管这些内容可以很容易地由人类映射为视觉表示形式。目前的模型主要侧重于对象级别的词嵌入对齐,这会导致比喻表达被导向其字面意义的视觉效果,并忽略所意图的意义。为了应对这一挑战,我们提出了Rhet2Pix框架,该框架将修辞文本到图像生成问题定义为一个多步骤策略优化问题,并引入了一个两层MDP扩散模块。 在外层,Rhet2Pix将输入提示转换为逐步详细化的子句并执行相应的图像生成操作,以此构建语义更丰富的视觉效果。在内层,Rhet2Pix通过减少最终奖励并沿扩散去噪轨迹优化每一个相邻的动作对来缓解图像生成过程中的奖励稀疏性问题。 大量的实验展示了Rhet2Pix在修辞文本到图像生成方面的有效性。我们的模型在定性和定量评估中均超过了最先进的MLLM(如GPT-4o,GroK-3)和学术基准模型。本研究使用的代码和数据集已公开提供。
https://arxiv.org/abs/2505.22792
Text-to-image generation has evolved beyond single monolithic models to complex multi-component pipelines. These combine fine-tuned generators, adapters, upscaling blocks and even editing steps, leading to significant improvements in image quality. However, their effective design requires substantial expertise. Recent approaches have shown promise in automating this process through large language models (LLMs), but they suffer from two critical limitations: extensive computational requirements from generating images with hundreds of predefined pipelines, and poor generalization beyond memorized training examples. We introduce a novel reinforcement learning-based framework that addresses these inefficiencies. Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations, eliminating the need for costly image generation during training. We then implement a two-phase training strategy: initial workflow vocabulary training followed by GRPO-based optimization that guides the model toward higher-performing regions of the workflow space. Additionally, we incorporate a classifier-free guidance based enhancement technique that extrapolates along the path between the initial and GRPO-tuned models, further improving output quality. We validate our approach through a set of comparisons, showing that it can successfully create new flows with greater diversity and lead to superior image quality compared to existing baselines.
文本到图像的生成技术已经从单一的整体模型发展到了复杂的多组件流水线。这些流水线结合了经过微调的生成器、适配器、上采样模块以及编辑步骤,从而显著提升了图像质量。然而,这种流水线的有效设计需要大量的专业知识。最近的研究表明,通过大型语言模型(LLMs)可以实现自动化这一过程,但是这种方法存在两个关键限制:一是由于需要生成数百个预定义的流水线来完成任务而产生的巨大计算需求;二是对超出训练样本的记忆泛化能力较差。我们提出了一种基于强化学习的新框架,以解决这些效率问题。 我们的方法首先训练一组奖励模型,这些模型能够直接从提示-工作流组合中预测图像质量得分,从而在训练过程中消除了昂贵的图像生成需求。接下来,我们实施了两阶段的训练策略:首先是初始的工作流程词汇表训练,然后是基于GRPO(Guided Reinforcement Policy Optimization)的优化,该优化引导模型向表现更好的工作流程空间区域发展。此外,我们还引入了一种无需分类器指导的技术改进方法,通过在初始模型与经过GRPO调整后的模型之间的路径上进行外推来进一步提升输出质量。 为了验证我们的方法的有效性,我们通过一系列对比实验进行了测试,并展示了该方法能够成功创建多样化的新型工作流程,并且相比现有基准,在图像质量方面表现出色。
https://arxiv.org/abs/2505.21478
Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence generation tasks demonstrate that Smoothie outperforms existing diffusion-based models in generation quality. Furthermore, ablation studies show that our proposed diffusion space yields better performance than both the standard embedding space and the categorical simplex. Our code is available at this https URL.
扩散模型在生成图像、音频和视频方面取得了最先进的性能,但在文本领域的应用仍面临挑战,主要是因为文本的离散特性。先前的方法要么在连续潜在空间中应用高斯扩散,这种方法虽然继承了语义结构但难以解决词元解码的问题;要么在分类单纯形空间中操作,这虽尊重了离散性但却忽略了词元之间的语义关系。在这篇论文中,我们提出了 Smoothie(基于 Token Embeddings 的平滑扩散),这是一种结合了两种方法优势的新型扩散方法,通过逐步根据语义相似度平滑词元嵌入来实现信息渐进式移除的同时保持自然解码过程。在几个序列到序列生成任务上的实验结果表明,Smoothie 在生成质量上超越了现有的基于扩散的方法。此外,消融研究表明我们提出的扩散空间比标准嵌入空间和分类单纯形的表现都要好。我们的代码可在该链接获取:[此URL](请将 [此URL] 替换为实际的代码链接)。
https://arxiv.org/abs/2505.18853
Classifier-free guidance (CFG) has emerged as a pivotal advancement in text-to-image latent diffusion models, establishing itself as a cornerstone technique for achieving high-quality image synthesis. However, under high guidance weights, where text-image alignment is significantly enhanced, CFG also leads to pronounced color distortions in the generated images. We identify that these distortions stem from the amplification of sample norms in the latent space. We present a theoretical framework that elucidates the mechanisms of norm amplification and anomalous diffusion phenomena induced by classifier-free guidance. Leveraging our theoretical insights and the latent space structure, we propose an Angle Domain Guidance (ADG) algorithm. ADG constrains magnitude variations while optimizing angular alignment, thereby mitigating color distortions while preserving the enhanced text-image alignment achieved at higher guidance weights. Experimental results demonstrate that ADG significantly outperforms existing methods, generating images that not only maintain superior text alignment but also exhibit improved color fidelity and better alignment with human perceptual preferences.
无分类器指导(CFG)已成为文本到图像的潜在扩散模型中的关键进步,在实现高质量图像合成方面确立了其基石技术的地位。然而,当使用高引导权重时,虽然文图对齐显著增强,但CFG也会导致生成图像中出现明显的颜色失真。我们发现这些失真是由于潜在空间中样本范数放大的结果。为此,我们提出了一种理论框架来阐明无分类器指导所引起的范数放大和异常扩散现象的机制。通过利用我们的理论见解以及潜在空间结构,我们提出了角度域引导(ADG)算法。ADG在优化角度对齐的同时约束幅度变化,从而减轻颜色失真,同时保持较高引导权重下增强的文图对齐效果。实验结果表明,与现有方法相比,ADG显著超越了它们,在生成既保持卓越文本对齐又表现出改进的颜色保真度和更符合人类感知偏好的图像方面表现优异。
https://arxiv.org/abs/2506.11039
Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept $\langle bo\rangle$, generating "$\langle bo\rangle$ wearing its hat" without additional textual descriptions of its hat. We call this kind of generation personalized knowledge-driven generation. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and knowledge-driven generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized knowledge-driven generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding. Our code and dataset will be released at: \href{this https URL}{this https URL}.
个性化模型在理解并生成用户提供的概念方面取得了显著的成功。然而,现有的方法使用单独的概念令牌来处理理解和生成任务,并将这两个任务视为孤立的。这可能会限制生成具有复杂提示图像的能力。例如,在给定概念$\langle bo\rangle$的情况下,无法仅通过不提供帽子的文字描述就生成“$\langle bo\rangle$戴着它的帽子”。我们将这种类型的生成称为个性化知识驱动生成。 为了解决这一局限性,我们提出了UniCTokens,这是一种新型框架,可以将个性化的信息有效地整合到一个统一的视觉语言模型(VLM)中,以支持理解和生成任务。UniCTokens训练一组统一的概念令牌来利用互补语义,从而增强两个个性化任务的表现。此外,我们还提出了一种分阶段训练策略,包含三个步骤:理解预热、从理解启动生成、以及通过生成加深理解,以此来增强两项任务之间的相互利益。 为了定量评估统一的VLM个性化,我们推出了UnifyBench,这是首个用于评估概念理解、概念生成和知识驱动生成效果的标准基准。在UnifyBench上的实验结果表明,UniCTokens在概念理解和概念生成方面与现有领先方法的表现相当,在个性化知识驱动生成方面的表现更是达到了最先进的水平。 我们的研究证明了增强的理解可以提升生成能力,并且生成过程能够提供对理解的宝贵见解。我们的代码和数据集将在[此链接](this https URL)发布。
https://arxiv.org/abs/2505.14671
Image Generation models are a trending topic nowadays, with many people utilizing Artificial Intelligence models in order to generate images. There are many such models which, given a prompt of a text, will generate an image which depicts said prompt. There are many image generation models, such as Latent Diffusion Models, Denoising Diffusion Probabilistic Models, Generative Adversarial Networks and many more. When generating images, these models can generate sensitive image data, which can be threatening to privacy or may violate copyright laws of private entities. Machine unlearning aims at removing the influence of specific data subsets from the trained models and in the case of image generation models, remove the influence of a concept such that the model is unable to generate said images of the concept when prompted. Conventional retraining of the model can take upto days, hence fast algorithms are the need of the hour. In this paper we propose an algorithm that aims to remove the influence of concepts in diffusion models through updating the gradients of the final layers of the text encoders. Using a weighted loss function, we utilize backpropagation in order to update the weights of the final layers of the Text Encoder componet of the Stable Diffusion Model, removing influence of the concept from the text-image embedding space, such that when prompted, the result is an image not containing the concept. The weighted loss function makes use of Textual Inversion and Low-Rank this http URL perform our experiments on Latent Diffusion Models, namely the Stable Diffusion v2 model, with an average concept unlearning runtime of 50 seconds using 4-5 images.
图像生成模型是当今热门的话题,许多人利用人工智能模型来生成图片。这些模型中的许多在收到文本提示后能够生成相应的图片。目前有许多图像生成模型,如潜在扩散模型、去噪扩散概率模型以及生成对抗网络等。当使用这些模型进行图像生成时,可能会产生涉及隐私或可能侵犯私有实体版权的敏感内容。 机器遗忘技术旨在从训练好的模型中移除特定数据集的影响,在图像生成模型的情况下,则是为了去除某些概念的影响,使得在收到相应提示后,模型无法生成带有该概念的图片。传统的重新训练过程可能需要数天时间来完成,因此快速算法变得尤为重要。在这篇论文中,我们提出了一种旨在通过更新文本编码器最终层的梯度来移除扩散模型中的特定概念影响的算法。使用加权损失函数并通过反向传播方法来更新稳定扩散模型(Stable Diffusion)文本编码组件最终层的权重,在此过程中从文本-图像嵌入空间中去除该概念的影响,使得当再次收到相同提示时生成的结果图片不再包含该概念。 我们在潜在扩散模型上进行了实验验证,具体使用的是Stable Diffusion v2版本的模型。我们的实验表明,通过使用4到5张相关图片进行训练,在平均情况下可以实现大约50秒的概念遗忘运行时间。
https://arxiv.org/abs/2505.12395
Stable Diffusion has advanced text-to-image synthesis, but training models to generate images with accurate object quantity is still difficult due to the high computational cost and the challenge of teaching models the abstract concept of quantity. In this paper, we propose CountDiffusion, a training-free framework aiming at generating images with correct object quantity from textual descriptions. CountDiffusion consists of two stages. In the first stage, an intermediate denoising result is generated by the diffusion model to predict the final synthesized image with one-step denoising, and a counting model is used to count the number of objects in this image. In the second stage, a correction module is used to correct the object quantity by changing the attention map of the object with universal guidance. The proposed CountDiffusion can be plugged into any diffusion-based text-to-image (T2I) generation models without further training. Experiment results demonstrate the superiority of our proposed CountDiffusion, which improves the accurate object quantity generation ability of T2I models by a large margin.
稳定扩散模型在文本到图像的合成方面取得了进展,但训练模型生成具有准确对象数量的图像仍然困难,这主要是由于计算成本高昂以及向模型传授抽象的数量概念极具挑战性。本文提出了一种名为CountDiffusion的新框架,该框架旨在从文字描述中生成具有正确对象数量的图像,并且无需额外训练。CountDiffusion由两个阶段组成:第一阶段使用扩散模型通过一步去噪预测最终合成的图像并生成中间去噪结果,同时利用一个计数模型来计算此图像中的物体数量;第二阶段则采用修正模块根据通用引导改变物体的关注图以纠正对象的数量。 所提出的CountDiffusion可以无缝集成到任何基于扩散的文本到图像(T2I)生成模型中而无需进一步训练。实验结果显示了我们提出的方法CountDiffusion的优势,它大大提高了T2I模型准确生成对象数量的能力。
https://arxiv.org/abs/2505.04347