We introduce SynthLight, a diffusion model for portrait relighting. Our approach frames image relighting as a re-rendering problem, where pixels are transformed in response to changes in environmental lighting conditions. Using a physically-based rendering engine, we synthesize a dataset to simulate this lighting-conditioned transformation with 3D head assets under varying lighting. We propose two training and inference strategies to bridge the gap between the synthetic and real image domains: (1) multi-task training that takes advantage of real human portraits without lighting labels; (2) an inference time diffusion sampling procedure based on classifier-free guidance that leverages the input portrait to better preserve details. Our method generalizes to diverse real photographs and produces realistic illumination effects, including specular highlights and cast shadows, while preserving the subject's identity. Our quantitative experiments on Light Stage data demonstrate results comparable to state-of-the-art relighting methods. Our qualitative results on in-the-wild images showcase rich and unprecedented illumination effects. Project Page: \url{this https URL}
我们介绍了SynthLight,这是一种用于人物肖像重新照明的扩散模型。我们的方法将图像重新照明问题视为一个重渲染过程,在此过程中,像素会根据环境光照条件的变化进行转换。通过基于物理的渲染引擎,我们在不同的光照条件下使用3D头部资产来合成数据集,从而模拟这种光照条件下的转换。 为了弥合合成与真实图像领域的差距,我们提出了两种训练和推理策略:(1)多任务训练方法,利用没有光照标签的真实人类肖像;(2)在推理时间采用基于无分类器引导的扩散采样过程,该过程利用输入肖像来更好地保留细节。我们的方法可以应用于各种真实的照片,并产生现实主义照明效果,包括镜面反射和投影阴影的同时还能保持主体的身份特征。 我们在Light Stage数据上的定量实验显示了与最先进的重新照明方法相当的结果。我们对野外图像的定性结果显示出了丰富且前所未有的照明效果。 项目页面: \url{this https URL}
https://arxiv.org/abs/2501.09756
Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. Although scaling Transformer-based generators has been central to recent advances, the tokenizer component itself is rarely scaled, leaving open questions about how auto-encoder design choices influence both its objective of reconstruction and downstream generative performance. Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank. To facilitate this exploration, we replace the typical convolutional backbone with an enhanced Vision Transformer architecture for Tokenization (ViTok). We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling. We first study how scaling the auto-encoder bottleneck affects both reconstruction and generation -- and find that while it is highly correlated with reconstruction, its relationship with generation is more complex. We next explored the effect of separately scaling the auto-encoders' encoder and decoder on reconstruction and generation performance. Crucially, we find that scaling the encoder yields minimal gains for either reconstruction or generation, while scaling the decoder boosts reconstruction but the benefits for generation are mixed. Building on our exploration, we design ViTok as a lightweight auto-encoder that achieves competitive performance with state-of-the-art auto-encoders on ImageNet-1K and COCO reconstruction tasks (256p and 512p) while outperforming existing auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates competitive performance on image generation for ImageNet-1K and sets new state-of-the-art benchmarks for class-conditional video generation on UCF-101.
这段文本描述了一项关于通过改进自动编码器(auto-encoder)架构来提升图像和视频生成模型性能的研究工作。以下是该工作的主要内容翻译: 视觉标记化通过自编码能够使最先进的图像和视频生成模型受益,它将像素压缩到潜在空间中。尽管基于Transformer的生成器的扩展在最近的进步中占据了中心地位,但其组件标记器本身很少被扩大规模,这留下了一些关于自动编码器设计选择如何影响重建目标以及下游生成性能的问题。我们的工作旨在通过探索自动编码器的扩展来填补这一空白。 为了促进这项研究,我们用增强型视觉Transformer架构(ViTok)替换了传统的卷积骨干网络以进行标记化,并在ImageNet-1K数据集规模之上训练了ViTok,从而消除了对令牌生成器扩展的数据限制。我们首先研究了压缩自动编码器瓶颈如何影响重建和生成——发现尽管它与重建高度相关,但其与生成的关系更加复杂。 接下来,我们探讨单独扩大自动编码器的编码器和解码器在重建和生成性能上的效果。最重要的是,我们发现在对任何重建或生成都没有显著增益的情况下扩展了编码器,而扩展解码器则提升了重建的效果,但对于生成的影响则是好坏参半。 基于我们的探索结果,设计了一种轻量级的自动编码器ViTok,在ImageNet-1K和COCO图像重建任务(256p和512p)中表现与最先进的自动编码器相当,并且在UCF-101数据集上对于16帧128p视频重建,性能超越了现有的自动编码器,同时计算量仅为原来的2到5倍。当将ViTok集成至Diffusion Transformers时,在ImageNet-1K图像生成中展示了竞争力的表现,并为UCF-101的类条件视频生成设定了新的最先进基准。 这项研究不仅扩展了对自动编码器如何影响图像和视频生成的理解,还提出了一种轻量级而高效的解决方案ViTok,能够在不牺牲性能的前提下大幅减少计算资源需求。
https://arxiv.org/abs/2501.09755
Autoregressive sequence models, such as Transformer-based vision-language action (VLA) policies, can be tremendously effective for capturing complex and generalizable robotic behaviors. However, such models require us to choose a tokenization of our continuous action signals, which determines how the discrete symbols predicted by the model map to continuous robot actions. We find that current approaches for robot action tokenization, based on simple per-dimension, per-timestep binning schemes, typically perform poorly when learning dexterous skills from high-frequency robot data. To address this challenge, we propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, Frequency-space Action Sequence Tokenization (FAST), enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where standard discretization methods fail completely. Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories. It can be used as a black-box tokenizer for a wide range of robot action sequences, with diverse action spaces and control frequencies. Finally, we show that, when combined with the pi0 VLA, our method can scale to training on 10k hours of robot data and match the performance of diffusion VLAs, while reducing training time by up to 5x.
自回归序列模型,如基于Transformer的视觉-语言-行动(VLA)策略,在捕捉复杂且可泛化的机器人行为方面非常有效。然而,这类模型需要我们选择连续动作信号的标记化方案,这决定了模型预测出的离散符号如何映射到连续的机器人动作上。我们发现,目前基于每维度、每时间步简单分箱方法的机器人行动标记化技术,在从高频机器人数据中学习灵巧技能时通常表现不佳。为了解决这一挑战,我们提出了一种新的基于离散余弦变换的机器人动作压缩式标记化方案。我们的标记化方法,即频域操作序列标记化(FAST),使我们能够训练自回归VLA模型来处理高度灵巧且高频的任务,在这些任务中标准的离散化方法完全失败了。基于FAST,我们发布了FAST+,这是一个通用的机器人动作标记器,它是在100万条真实机器人的行动轨迹上进行训练的。它可以作为一个黑盒标记器用于各种不同操作空间和控制频率范围内的机器人行动序列。最后,我们展示了当与pi0 VLA结合使用时,我们的方法可以扩展到在10,000小时的数据集上进行训练,并且性能能够匹敌扩散VLA模型,同时将训练时间减少最多5倍。
https://arxiv.org/abs/2501.09747
Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional computation during inference. Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of denoising steps, although the performance gains typically flatten after a few dozen. In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation. Specifically, we consider a search problem aimed at identifying better noises for the diffusion sampling process. We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better noise candidates. Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be specifically chosen to conform with different application scenario.
生成模型已经在多个领域产生了重要影响,这主要归功于它们在训练过程中通过增加数据量、计算资源和模型规模来扩展的能力,这一现象被称为扩展定律。最近的研究已经开始探索大型语言模型(LLMs)的推理时长扩展行为,揭示了如何通过增加推理过程中的计算能力进一步提升性能。与LLMs不同的是,扩散模型本身具备通过调整去噪步骤数量来灵活调节推理时间计算的能力,尽管通常在几十个去噪步骤之后性能增益会趋于平缓。在这项工作中,我们探讨了超出增加去噪步骤之外的扩散模型的推理时长扩展行为,并研究了如何利用更多的计算资源进一步提升生成表现。 具体来说,我们考虑了一个搜索问题,旨在为扩散采样过程找到更好的噪声样本。我们在两个轴上构建设计空间:一是提供反馈的验证器;二是用于寻找更好噪声候选者的算法。通过在类别条件和文本条件图像生成基准上的大量实验,我们的研究发现表明增加推理时间计算能够显著提升由扩散模型生成样本的质量,并且鉴于图像的复杂性,框架中组件的不同组合可以根据不同的应用场景具体选择以符合需求。
https://arxiv.org/abs/2501.09732
This tutorial provides an in-depth guide on inference-time guidance and alignment methods for optimizing downstream reward functions in diffusion models. While diffusion models are renowned for their generative modeling capabilities, practical applications in fields such as biology often require sample generation that maximizes specific metrics (e.g., stability, affinity in proteins, closeness to target structures). In these scenarios, diffusion models can be adapted not only to generate realistic samples but also to explicitly maximize desired measures at inference time without fine-tuning. This tutorial explores the foundational aspects of such inference-time algorithms. We review these methods from a unified perspective, demonstrating that current techniques -- such as Sequential Monte Carlo (SMC)-based guidance, value-based sampling, and classifier guidance -- aim to approximate soft optimal denoising processes (a.k.a. policies in RL) that combine pre-trained denoising processes with value functions serving as look-ahead functions that predict from intermediate states to terminal rewards. Within this framework, we present several novel algorithms not yet covered in the literature. Furthermore, we discuss (1) fine-tuning methods combined with inference-time techniques, (2) inference-time algorithms based on search algorithms such as Monte Carlo tree search, which have received limited attention in current research, and (3) connections between inference-time algorithms in language models and diffusion models. The code of this tutorial on protein design is available at this https URL
这篇教程提供了关于推理时引导和对齐方法的深入指南,这些方法用于优化扩散模型中的下游奖励函数。虽然扩散模型因其生成建模能力而闻名,但在生物学等领域中的实际应用通常需要生成最大化特定指标(例如蛋白质的稳定性、亲和力以及接近目标结构的程度)的样本。在这些场景中,可以对扩散模型进行调整,使其不仅能生成逼真的样本,还能在推理时明确地最大化所需的度量值而不需微调。本教程探讨了此类推理时间算法的基础方面,并从统一的角度回顾这些方法,表明当前的技术——如基于序列蒙特卡洛(SMC)的引导、基于价值的采样以及分类器引导——旨在近似软优化去噪过程(即RL中的策略),该过程结合了预训练的去噪过程和作为预测函数的价值功能,从中间状态到最终奖励。在此框架内,我们提出了一些尚未在文献中被涵盖的新算法。 此外,本教程还讨论了: 1. 结合推理时间技术的微调方法; 2. 基于搜索算法(如蒙特卡洛树搜索)的推理时间算法,在当前研究中受到了较少关注;以及 3. 语言模型与扩散模型之间在推理时间算法上的联系。 有关蛋白质设计教程代码,请访问此链接:[https URL]
https://arxiv.org/abs/2501.09685
Video colorization aims to transform grayscale videos into vivid color representations while maintaining temporal consistency and structural integrity. Existing video colorization methods often suffer from color bleeding and lack comprehensive control, particularly under complex motion or diverse semantic cues. To this end, we introduce VanGogh, a unified multimodal diffusion-based framework for video colorization. VanGogh tackles these challenges using a Dual Qformer to align and fuse features from multiple modalities, complemented by a depth-guided generation process and an optical flow loss, which help reduce color overflow. Additionally, a color injection strategy and luma channel replacement are implemented to improve generalization and mitigate flickering artifacts. Thanks to this design, users can exercise both global and local control over the generation process, resulting in higher-quality colorized videos. Extensive qualitative and quantitative evaluations, and user studies, demonstrate that VanGogh achieves superior temporal consistency and color this http URL page: this https URL.
视频着色的目标是将灰度视频转换为生动的彩色表示,同时保持时间一致性与结构完整性。现有的视频着色方法通常在处理复杂运动或多样语义提示时会出现色彩溢出问题,并且缺乏全面控制。 为此,我们引入了VanGogh——一种统一的多模态扩散框架用于视频着色。VanGogh通过使用Dual Qformer来对齐和融合来自多种模式的特征来解决这些问题,该方法还结合深度引导生成过程以及光流损失,以减少色彩溢出。此外,实施了一种颜色注入策略和亮度通道替换操作,用以提升泛化能力和减轻闪烁伪影。 由于这些设计特点,用户可以在整个过程中进行全局或局部控制,从而生成更高质量的着色视频。大量的定性和定量评估以及用户研究表明,VanGogh在时间一致性与色彩质量方面表现优异。详情请参见[此页面](https://this https URL)。 请注意,原文中的链接格式有误,请将"this https URL"替换为实际的有效网址以供访问。
https://arxiv.org/abs/2501.09499
The synthesis of high-quality 3D assets from textual or visual inputs has become a central objective in modern generative modeling. Despite the proliferation of 3D generation algorithms, they frequently grapple with challenges such as multi-view inconsistency, slow generation times, low fidelity, and surface reconstruction problems. While some studies have addressed some of these issues, a comprehensive solution remains elusive. In this paper, we introduce \textbf{CaPa}, a carve-and-paint framework that generates high-fidelity 3D assets efficiently. CaPa employs a two-stage process, decoupling geometry generation from texture synthesis. Initially, a 3D latent diffusion model generates geometry guided by multi-view inputs, ensuring structural consistency across perspectives. Subsequently, leveraging a novel, model-agnostic Spatially Decoupled Attention, the framework synthesizes high-resolution textures (up to 4K) for a given geometry. Furthermore, we propose a 3D-aware occlusion inpainting algorithm that fills untextured regions, resulting in cohesive results across the entire model. This pipeline generates high-quality 3D assets in less than 30 seconds, providing ready-to-use outputs for commercial applications. Experimental results demonstrate that CaPa excels in both texture fidelity and geometric stability, establishing a new standard for practical, scalable 3D asset generation.
从文本或视觉输入生成高质量的三维(3D)资产已成为现代生成模型的核心目标。尽管出现了许多3D生成算法,但它们仍然面临着多视角不一致、生成时间长、保真度低以及表面重构问题等挑战。虽然一些研究已经解决了一部分这些问题,但仍缺乏一个全面的解决方案。在这篇文章中,我们介绍了\textbf{CaPa}(雕刻与绘制框架),这是一个能够高效生成高保真3D资产的系统。 CaPa采用了一个两阶段的过程,将几何生成和纹理合成解耦开来。在第一阶段,使用一个多视角输入引导的3D潜在扩散模型来生成几何结构,确保从不同视角观察时的一致性。第二阶段则利用了一种新颖且不依赖特定模型的空间分离注意力机制(Spatially Decoupled Attention),为给定的几何体合成高分辨率纹理(最高可达4K)。此外,我们还提出了一种3D感知的遮挡修复算法,用于填充未绘制区域,确保整个模型的一致性。该流程可以在不到30秒的时间内生成高质量的3D资产,并提供可以直接应用于商业用途的输出结果。 实验结果显示,CaPa在纹理保真度和几何稳定性方面均表现出色,为实用且可扩展的3D资产生成设定了新的标准。
https://arxiv.org/abs/2501.09433
Large Reconstruction Models (LRMs) have recently become a popular method for creating 3D foundational models. Training 3D reconstruction models with 2D visual data traditionally requires prior knowledge of camera poses for the training samples, a process that is both time-consuming and prone to errors. Consequently, 3D reconstruction training has been confined to either synthetic 3D datasets or small-scale datasets with annotated poses. In this study, we investigate the feasibility of 3D reconstruction using unposed video data of various objects. We introduce UVRM, a novel 3D reconstruction model capable of being trained and evaluated on monocular videos without requiring any information about the pose. UVRM uses a transformer network to implicitly aggregate video frames into a pose-invariant latent feature space, which is then decoded into a tri-plane 3D representation. To obviate the need for ground-truth pose annotations during training, UVRM employs a combination of the score distillation sampling (SDS) method and an analysis-by-synthesis approach, progressively synthesizing pseudo novel-views using a pre-trained diffusion model. We qualitatively and quantitatively evaluate UVRM's performance on the G-Objaverse and CO3D datasets without relying on pose information. Extensive experiments show that UVRM is capable of effectively and efficiently reconstructing a wide range of 3D objects from unposed videos.
最近,大型重构模型(LRMs)已成为创建3D基础模型的一种流行方法。使用2D视觉数据训练3D重构模型通常需要了解训练样本的相机姿态,这一过程既耗时又容易出错。因此,传统的3D重构训练要么局限于合成的3D数据集,要么是带有标注姿势的小规模数据集。在这项研究中,我们探讨了仅用未标记视频数据进行三维重建的可能性,并引入了一种新型的三维重建模型UVRM(Unposed Video Reconstruction Model)。这种模型可以在无需任何姿态信息的情况下通过单目视频进行训练和评估。 UVRM采用一种变换网络,将视频帧隐式地聚合成一个与姿势无关的潜在特征空间,然后解码为三平面3D表示。为了在训练过程中避免使用真实姿态标注,UVRM结合了分数蒸馏采样(SDS)方法和分析合成方法,逐步利用预训练的扩散模型生成伪新视角视图。 我们在G-Objaverse和CO3D数据集上对UVRM进行了定性和定量评估,并且在这些评估中没有使用任何姿态信息。大量的实验表明,UVRM能够有效并高效地从无标记视频中重构各种各样的三维物体。
https://arxiv.org/abs/2501.09347
Purpose: To propose a domain-conditioned and temporal-guided diffusion modeling method, termed dynamic Diffusion Modeling (dDiMo), for accelerated dynamic MRI reconstruction, enabling diffusion process to characterize spatiotemporal information for time-resolved multi-coil Cartesian and non-Cartesian data. Methods: The dDiMo framework integrates temporal information from time-resolved dimensions, allowing for the concurrent capture of intra-frame spatial features and inter-frame temporal dynamics in diffusion modeling. It employs additional spatiotemporal ($x$-$t$) and self-consistent frequency-temporal ($k$-$t$) priors to guide the diffusion process. This approach ensures precise temporal alignment and enhances the recovery of fine image details. To facilitate a smooth diffusion process, the nonlinear conjugate gradient algorithm is utilized during the reverse diffusion steps. The proposed model was tested on two types of MRI data: Cartesian-acquired multi-coil cardiac MRI and Golden-Angle-Radial-acquired multi-coil free-breathing lung MRI, across various undersampling rates. Results: dDiMo achieved high-quality reconstructions at various acceleration factors, demonstrating improved temporal alignment and structural recovery compared to other competitive reconstruction methods, both qualitatively and quantitatively. This proposed diffusion framework exhibited robust performance in handling both Cartesian and non-Cartesian acquisitions, effectively reconstructing dynamic datasets in cardiac and lung MRI under different imaging conditions. Conclusion: This study introduces a novel diffusion modeling method for dynamic MRI reconstruction.
目的:为了加速动态MRI重建,提出了一种基于领域条件和时间引导的扩散建模方法,称为动态扩散模型(dDiMo),使扩散过程能够表征时空信息以适应时变多线圈笛卡尔和非笛卡尔数据。 方法:dDiMo框架结合了从时间分辨维度获取的时间信息,允许在扩散建模过程中同时捕获帧内空间特征与帧间时间动力学。它采用附加的时空(x-t)和自洽频域-时间(k-t)先验来引导扩散过程。这种方法确保了精确的时间对齐,并增强了图像细节的恢复。为了促进平滑的扩散过程,在反向扩散步骤中应用非线性共轭梯度算法。所提出的模型在两种类型的MRI数据上进行了测试:笛卡尔获取的多线圈心脏MRI和黄金角度径向获取的多线圈自由呼吸肺部MRI,覆盖各种欠采样率。 结果:dDiMo在各种加速因子下均达到了高质量重建,在时间对齐和结构恢复方面优于其他竞争性重建方法(从定性和定量两方面来看)。该提出的扩散框架表现出处理笛卡尔和非笛卡尔采集的强大性能,在心脏和肺部MRI的不同成像条件下有效重构动态数据集。 结论:本研究引入了一种用于动态MRI重建的新型扩散建模方法。
https://arxiv.org/abs/2501.09305
Flexibility in the AI-based residential layout design remains a significant challenge, as traditional methods like rule-based heuristics and graph-based generation often lack flexibility and require substantial design knowledge from users. To address these limitations, we propose a cross-modal design approach based on the Stable Diffusion model for generating flexible residential layouts. The method offers multiple input types for learning objectives, allowing users to specify both boundaries and layouts. It incorporates natural language as design constraints and introduces ControlNet to enable stable layout generation through two distinct pathways. We also present a scheme that encapsulates design expertise within a knowledge graph and translates it into natural language, providing an interpretable representation of design knowledge. This comprehensibility and diversity of input options enable professionals and non-professionals to directly express design requirements, enhancing flexibility and controllability. Finally, experiments verify the flexibility of the proposed methods under multimodal constraints better than state-of-the-art models, even when specific semantic information about room areas or connections is incomplete.
基于AI的住宅布局设计中的灵活性仍然是一个重大挑战,因为传统的规则启发式和图生成方法往往缺乏灵活性,并且需要用户具备大量的设计知识。为了解决这些局限性,我们提出了一种基于Stable Diffusion模型的跨模态设计方案,用于生成灵活多变的住宅布局。该方法提供多种输入类型以适应不同的学习目标,允许用户指定边界条件和布局细节。它还结合了自然语言作为设计约束,并引入ControlNet来通过两种独立路径实现稳定的布局生成。 我们还提出了一种方案,将设计专业知识封装在一个知识图中,并将其转换为自然语言描述,提供可解释性的设计知识表示。这种清晰性和输入选项的多样性使得专业人士和非专业人士可以直接表达他们的设计需求,从而提高灵活性和可控性。最后,实验验证了在多模态约束条件下,所提出的方法比现有最先进的模型具有更好的灵活性,即使关于房间面积或连接的具体语义信息不完整也是如此。
https://arxiv.org/abs/2501.09279
Large-scale text-to-image (T2I) diffusion models have demonstrated an outstanding performance in synthesizing diverse high-quality visuals from natural language text captions. Multiple layout-to-image models have been developed to control the generation process by utilizing a broad array of layouts such as segmentation maps, edges, and human keypoints. In this work, we present ObjectDiffusion, a model that takes inspirations from the top cutting-edge image generative frameworks to seamlessly condition T2I models with new bounding boxes capabilities. Specifically, we make substantial modifications to the network architecture introduced in ContorlNet to integrate it with the condition processing and injection techniques proposed in GLIGEN. ObjectDiffusion is initialized with pretraining parameters to leverage the generation knowledge obtained from training on large-scale datasets. We fine-tune ObjectDiffusion on the COCO2017 training dataset and evaluate it on the COCO2017 validation dataset. Our model achieves an AP$_{50}$ of 46.6, an AR of 44.5, and a FID of 19.8 outperforming the current SOTA model trained on open-source datasets in all of the three metrics. ObjectDiffusion demonstrates a distinctive capability in synthesizing diverse, high-quality, high-fidelity images that seamlessly conform to the semantic and spatial control layout. Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits remarkable grounding abilities on closed-set and open-set settings across a wide variety of contexts. The qualitative assessment verifies the ability of ObjectDiffusion to generate multiple objects of different sizes and locations.
大规模的文本到图像(T2I)扩散模型在从自然语言文字描述中生成多样且高质量视觉效果方面表现出卓越性能。已经开发出多种布局到图像的模型,利用包括分割图、边缘和人体关键点在内的广泛布局来控制生成过程。在这项工作中,我们提出了ObjectDiffusion模型,该模型借鉴了顶尖的图像生成框架,以无缝地将新的边界框功能整合进T2I模型中进行条件处理。具体来说,我们在ControlNet引入的网络架构基础上进行了重大修改,并将其与GLIGEN提出的条件处理和注入技术相结合。ObjectDiffusion使用从大规模数据集训练中获得的知识预训练参数来初始化自身。我们对ObjectDiffusion在COCO2017训练数据集上进行微调,并在其验证数据集上进行评估。我们的模型在AP$_{50}$、AR以及FID这三个指标上分别达到了46.6、44.5和19.8,超越了当前开源数据集训练的最先进(SOTA)模型的所有性能指标。ObjectDiffusion展示了生成多样且高质量、高保真的图像的独特能力,这些图像是根据语义和空间控制布局无缝形成的。在定性和定量测试中,在封闭集合和开放集合设置以及各种上下文背景下,ObjectDiffusion展示出了显著的定位能力。定性评估验证了ObjectDiffusion能够生成不同大小和位置的多个物体的能力。
https://arxiv.org/abs/2501.09194
Prostate cancer (PCa) is the most prevalent cancer among men in the United States, accounting for nearly 300,000 cases, 29% of all diagnoses and 35,000 total deaths in 2024. Traditional screening methods such as prostate-specific antigen (PSA) testing and magnetic resonance imaging (MRI) have been pivotal in diagnosis, but have faced limitations in specificity and generalizability. In this paper, we explore the potential of enhancing PCa lesion segmentation using a novel MRI modality called synthetic correlated diffusion imaging (CDI$^s$). We employ several state-of-the-art deep learning models, including U-Net, SegResNet, Swin UNETR, Attention U-Net, and LightM-UNet, to segment PCa lesions from a 200 CDI$^s$ patient cohort. We find that SegResNet achieved superior segmentation performance with a Dice-Sorensen coefficient (DSC) of $76.68 \pm 0.8$. Notably, the Attention U-Net, while slightly less accurate (DSC $74.82 \pm 2.0$), offered a favorable balance between accuracy and computational efficiency. Our findings demonstrate the potential of deep learning models in improving PCa lesion segmentation using CDI$^s$ to enhance PCa management and clinical support.
前列腺癌(PCa)是美国男性中最常见的癌症,占所有病例的近30万例,占全部诊断病例的29%,并在2024年导致了大约35,000人死亡。传统的筛查方法,如前列腺特异性抗原(PSA)测试和磁共振成像(MRI),在诊断中发挥了关键作用,但它们在特异性和普适性方面存在局限性。本文探讨了一种名为合成相关扩散成像(CDI$^s$)的新MRI模式用于增强PCa病灶分割的潜力。我们采用了几种最先进的深度学习模型,包括U-Net、SegResNet、Swin UNETR、Attention U-Net和LightM-UNet,以从200名CDI$^s$患者的队列中分割出PCa病灶。研究发现,SegResNet在Dice-Sorensen系数(DSC)为76.68 ± 0.8的情况下实现了最佳的分割性能。值得注意的是,尽管Attention U-Net的准确性稍低一些(DSC为74.82 ± 2.0),但它在准确性和计算效率之间提供了一个理想的平衡点。我们的研究结果展示了深度学习模型用于提高使用CDI$^s$进行PCa病灶分割的潜力,并可以增强对PCa的管理和临床支持。
https://arxiv.org/abs/2501.09185
The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue's head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.
基于预训练的文本到视频模型,先进先出(FIFO)视频扩散技术作为一种无需调整参数即可生成长视频的有效方法最近崭露头角。该技术维护着一个包含逐渐增加噪点的视频帧队列,在队列前端持续产生清晰帧的同时,高斯噪声则在队列尾端被加入。然而,由于缺乏跨帧对应关系建模,FIFO-Diffusion 往往难以维持生成视频中的长时序一致性。 为此,我们提出了 Ouroboros-Diffusion,这是一种新的视频去噪框架,旨在增强结构和内容(主题)的一致性,使任意长度的连贯视频生成成为可能。具体而言,我们在队列尾部引入了一种新的潜在采样技术,以提升结构一致性,确保帧之间的过渡在感知上平滑无间断。 为了进一步提高主题一致性,我们设计了“主题感知跨帧注意力机制”(Subject-Aware Cross-Frame Attention, SACFA),该机制通过在短片段内对齐各帧中的主体对象来实现更好的视觉连贯性。此外,我们还引入了一种自递归引导技术,这种技术利用队列前端所有先前更清晰的帧信息来指导队列尾部较噪点帧的去噪过程,从而促进丰富的上下文全局信息互动。 在 VBench 长视频生成基准测试中的大量实验结果表明,我们的 Ouroboros-Diffusion 在主题一致性、运动平滑度和时序一致性方面表现出了显著的优势。
https://arxiv.org/abs/2501.09019
Generative models are nowadays widely used to generate graphical content used for multiple purposes, e.g. web, art, advertisement. However, it has been shown that the images generated by these models could reinforce societal biases already existing in specific contexts. In this paper, we focus on understanding if this is the case when one generates images related to various software engineering tasks. In fact, the Software Engineering (SE) community is not immune from gender and ethnicity disparities, which could be amplified by the use of these models. Hence, if used without consciousness, artificially generated images could reinforce these biases in the SE domain. Specifically, we perform an extensive empirical evaluation of the gender and ethnicity bias exposed by three versions of the Stable Diffusion (SD) model (a very popular open-source text-to-image model) - SD 2, SD XL, and SD 3 - towards SE tasks. We obtain 6,720 images by feeding each model with two sets of prompts describing different software-related tasks: one set includes the Software Engineer keyword, and one set does not include any specification of the person performing the task. Next, we evaluate the gender and ethnicity disparities in the generated images. Results show how all models are significantly biased towards male figures when representing software engineers. On the contrary, while SD 2 and SD XL are strongly biased towards White figures, SD 3 is slightly more biased towards Asian figures. Nevertheless, all models significantly under-represent Black and Arab figures, regardless of the prompt style used. The results of our analysis highlight severe concerns about adopting those models to generate content for SE tasks and open the field for future research on bias mitigation in this context.
生成模型如今被广泛用于创建用于多种目的的图形内容,例如网页设计、艺术创作和广告。然而,已有研究表明这些模型生成的图片可能会强化特定情境中已存在的社会偏见。本文专注于研究在生成与各种软件工程任务相关的图像时是否会出现此类问题。实际上,软件工程(SE)社区并非对性别和族裔差异免疫,这些问题可能因使用这类模型而被放大。因此,如果在没有意识的情况下使用这些人工生成的图片,可能会加剧软件工程领域的偏见。 具体而言,我们进行了广泛的实证评估,研究三种版本的Stable Diffusion(SD)模型(一种非常流行的开源文本到图像模型)——即SD 2、SD XL和SD 3,在处理软件相关任务时对性别和族裔的偏差。我们通过给每个模型输入两组描述不同软件任务的提示语句,获得了6,720张图片:一组包括“Software Engineer”关键词,另一组则不涉及执行任务的人的身份信息。随后,我们评估了生成图像中性别和族裔差异。 结果显示,所有模型在表示软件工程师时都明显偏向男性形象。相反,虽然SD 2和SD XL对白人形象有强烈偏见,但SD 3对亚洲人的偏好稍微强一些。然而,无论使用哪种提示风格,所有模型都显著地低估了黑人和阿拉伯人的形象。 我们分析的结果表明,在软件工程任务中采用此类模型生成内容存在严重问题,并为未来关于此领域偏差缓解的研究开辟道路。
https://arxiv.org/abs/2501.09014
Acquiring and annotating surgical data is often resource-intensive, ethical constraining, and requiring significant expert involvement. While generative AI models like text-to-image can alleviate data scarcity, incorporating spatial annotations, such as segmentation masks, is crucial for precision-driven surgical applications, simulation, and education. This study introduces both a novel task and method, SimGen, for Simultaneous Image and Mask Generation. SimGen is a diffusion model based on the DDPM framework and Residual U-Net, designed to jointly generate high-fidelity surgical images and their corresponding segmentation masks. The model leverages cross-correlation priors to capture dependencies between continuous image and discrete mask distributions. Additionally, a Canonical Fibonacci Lattice (CFL) is employed to enhance class separability and uniformity in the RGB space of the masks. SimGen delivers high-fidelity images and accurate segmentation masks, outperforming baselines across six public datasets assessed on image and semantic inception distance metrics. Ablation study shows that the CFL improves mask quality and spatial separation. Downstream experiments suggest generated image-mask pairs are usable if regulations limit human data release for research. This work offers a cost-effective solution for generating paired surgical images and complex labels, advancing surgical AI development by reducing the need for expensive manual annotations.
获取和标注手术数据通常耗费资源、受伦理限制,并且需要大量专家参与。虽然生成式AI模型,如文本到图像的转换,可以缓解数据稀缺问题,但为了精确驱动的外科应用、模拟及教育目的,结合空间注释(例如分割掩模)至关重要。这项研究介绍了一项新任务和方法——SimGen,用于同时生成图像和遮罩。SimGen基于DDPM框架和残差U-Net模型构建,旨在协同生成高质量的手术图像及其对应的分割掩模。 该模型利用交叉相关先验来捕捉连续图像和离散掩模分布之间的依赖关系。此外,还采用了一种正则化方法——规范斐波那契网格(CFL),以增强RGB空间中各类别的分离度和均匀性。SimGen能够生成高保真度的图像和准确的分割掩模,在六个公共数据集上根据图像和语义 inception 距离指标,超越了基准模型的表现。 消融研究表明,CFL可以提高遮罩质量和空间分离效果。下游实验表明,在限制人类数据发布的监管条件下,生成的图象-遮罩对仍可用于研究目的。这项工作为低成本生成配对手术图像及其复杂标签提供了一种解决方案,通过减少昂贵的手动注释需求,促进了外科AI的发展。
https://arxiv.org/abs/2501.09008
In this project, we address the issue of infidelity in text-to-image generation, particularly for actions involving multiple objects. For this we build on top of the CONFORM framework which uses Contrastive Learning to improve the accuracy of the generated image for multiple objects. However the depiction of actions which involves multiple different object has still large room for improvement. To improve, we employ semantically hypergraphic contrastive adjacency learning, a comprehension of enhanced contrastive structure and "contrast but link" technique. We further amend Stable Diffusion's understanding of actions by InteractDiffusion. As evaluation metrics we use image-text similarity CLIP and TIFA. In addition, we conducted a user study. Our method shows promising results even with verbs that Stable Diffusion understands mediocrely. We then provide future directions by analyzing the results. Our codebase can be found on polybox under the link: this https URL
在这个项目中,我们解决了从文本生成图像时的不忠问题,特别是在涉及多个对象的动作场景中。为此,我们在CONFORM框架的基础上进行构建,该框架使用对比学习来提高包含多个对象的生成图像的准确性。然而,描绘涉及多种不同对象的动作仍然有很大的改进空间。为了解决这个问题,我们采用了一种语义超图对比邻近性学习方法,并引入了增强型对比结构理解和“对比但链接”技术。此外,我们通过InteractDiffusion进一步完善Stable Diffusion对动作的理解。我们的评估指标包括图像-文本相似度的CLIP和TIFA。此外,我们还进行了一项用户研究。即使对于Stable Diffusion理解得不那么好的动词,我们的方法也表现出有前景的结果。最后,通过对结果的分析,我们提供了未来的研究方向。我们的代码库可以在polybox下的此链接中找到:this https URL
https://arxiv.org/abs/2501.09055
Video generation has achieved remarkable progress with the introduction of diffusion models, which have significantly improved the quality of generated videos. However, recent research has primarily focused on scaling up model training, while offering limited insights into the direct impact of representations on the video generation process. In this paper, we initially investigate the characteristics of features in intermediate layers, finding substantial variations in attention maps across different layers. These variations lead to unstable semantic representations and contribute to cumulative differences between features, which ultimately reduce the similarity between adjacent frames and negatively affect temporal coherence. To address this, we propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information. These enhanced representations are then used as inputs to the attention mechanism, thereby improving semantic expressiveness while ensuring feature consistency across adjacent frames. Extensive experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, such as capturing complex spatial relationships between multiple objects, but also improves temporal consistency in video generation.
视频生成技术在引入扩散模型后取得了显著进展,这些模型极大地提高了生成视频的质量。然而,最近的研究主要集中在扩大模型训练规模上,而对于表示方式如何直接影响视频生成过程的探讨则相对较少。在这篇论文中,我们首先研究了中间层特征的特点,并发现不同层次之间的注意力图存在显著变化。这种变化导致语义表示不稳定,并且造成了相邻特征间的累积差异,最终降低了连续帧之间的相似度,从而对时间一致性产生了负面影响。 为了解决这个问题,我们提出了RepVideo(增强表示框架),专门用于文本到视频的扩散模型。该方法通过汇集来自邻近层次的特性来形成更丰富的表示形式,捕捉更为稳定的语义信息。然后将这些改进后的表示作为输入传递给注意力机制,在提高语义表达能力的同时保证了相邻帧之间特征的一致性。 广泛的实验表明,我们的RepVideo不仅显著增强了生成精确空间外观的能力(例如捕获多个对象之间的复杂空间关系),而且还改善了视频生成的时间一致性。
https://arxiv.org/abs/2501.08994
Localizing text descriptions in large-scale 3D scenes is inherently an ambiguous task. This nonetheless arises while describing general concepts, e.g. all traffic lights in a city. To facilitate reasoning based on such concepts, text localization in the form of distribution is required. In this paper, we generate the distribution of the camera poses conditioned upon the textual description. To facilitate such generation, we propose a diffusion-based architecture that conditionally diffuses the noisy 6DoF camera poses to their plausible locations. The conditional signals are derived from the text descriptions, using the pre-trained text encoders. The connection between text descriptions and pose distribution is established through pretrained Vision-Language-Model, i.e. CLIP. Furthermore, we demonstrate that the candidate poses for the distribution can be further refined by rendering potential poses using 3D Gaussian splatting, guiding incorrectly posed samples towards locations that better align with the textual description, through visual reasoning. We demonstrate the effectiveness of our method by comparing it with both standard retrieval methods and learning-based approaches. Our proposed method consistently outperforms these baselines across all five large-scale datasets. Our source code and dataset will be made publicly available.
在大规模的3D场景中,对文本描述进行本地化本质上是一个含糊的任务。这一挑战尤其突出在描述一些通用概念时,例如一个城市中的所有交通信号灯。为了促进基于这些概念的推理,需要以分布的形式来进行文本定位。本文提出了一种方法:根据给定的文本描述生成相机姿态(视角和位置)的概率分布。为此,我们设计了一个基于扩散架构的方法,可以有条件地从噪声6自由度(6DoF)的相机姿态扩散到其合理的位置上。 条件信号是从文本描述中提取出来的,并通过预先训练好的文本编码器获得。而文本描述与姿态分布之间的关联则通过预训练的视觉-语言模型(例如CLIP)来建立。此外,我们还展示了如何利用3D高斯点喷射技术对潜在的姿态进行渲染,进而通过对图像的理解来纠正那些定位不准确的例子,使其更符合文本描述的内容。 我们通过将其与标准检索方法和基于学习的方法进行了比较,证明了所提出的方法的有效性,并且在所有五种大规模数据集上,我们的方法都优于这些基准。我们的源代码和数据集将公开发布以供使用。
https://arxiv.org/abs/2501.08982
In this paper, we propose a novel cross-attention-based generative adversarial network (GAN) for the challenging person image generation task. Cross-attention is a novel and intuitive multi-modal fusion method in which an attention/correlation matrix is calculated between two feature maps of different modalities. Specifically, we propose the novel XingGAN (or CrossingGAN), which consists of two generation branches that capture the person's appearance and shape, respectively. Moreover, we propose two novel cross-attention blocks to effectively transfer and update the person's shape and appearance embeddings for mutual improvement. This has not been considered by any other existing GAN-based image generation work. To further learn the long-range correlations between different person poses at different scales and sub-regions, we propose two novel multi-scale cross-attention blocks. To tackle the issue of independent correlation computations within the cross-attention mechanism leading to noisy and ambiguous attention weights, which hinder performance improvements, we propose a module called enhanced attention (EA). Lastly, we introduce a novel densely connected co-attention module to fuse appearance and shape features at different stages effectively. Extensive experiments on two public datasets demonstrate that the proposed method outperforms current GAN-based methods and performs on par with diffusion-based methods. However, our method is significantly faster than diffusion-based methods in both training and inference.
在本文中,我们提出了一种新颖的基于交叉注意力的生成对抗网络(GAN),用于具有挑战性的人体图像生成任务。交叉注意力是一种创新且直观的多模态融合方法,在这种方法中,计算不同模式的两个特征图之间的注意力/相关矩阵。 具体来说,我们提出了一个新的模型XingGAN(或称为CrossingGAN),该模型包含两个生成分支,分别捕捉人体的外观和形状。此外,为了有效传输并更新人的形状和外观嵌入以实现相互改进,我们还提出两种新的交叉注意模块。这一特性是现有的基于GAN的图像生成工作中未曾考虑过的。 为进一步学习不同姿势之间以及不同规模和子区域之间的长程相关性,我们提出了两个新颖的多尺度交叉注意力模块。为了解决交叉注意力机制中独立的相关计算导致噪声大且模棱两可的关注权重问题(这些问题阻碍了性能的提升),我们提出了一种称为增强注意(Enhanced Attention, EA)的模块。 最后,为了有效融合不同阶段的人体外观和形状特征,我们引入了一个新颖的密集连接共注意力模块。在两个公开数据集上的广泛实验表明,所提出的模型优于现有的基于GAN的方法,并且与基于扩散的方法性能相当。然而,在训练和推理过程中,我们的方法比基于扩散的方法快得多。
https://arxiv.org/abs/2501.08900
Our work addresses the problem of stochastic long-term dense anticipation. The goal of this task is to predict actions and their durations several minutes into the future based on provided video observations. Anticipation over extended horizons introduces high uncertainty, as a single observation can lead to multiple plausible future outcomes. To address this uncertainty, stochastic models are designed to predict several potential future action sequences. Recent work has further proposed to incorporate uncertainty modelling for observed frames by simultaneously predicting per-frame past and future actions in a unified manner. While such joint modelling of actions is beneficial, it requires long-range temporal capabilities to connect events across distant past and future time points. However, the previous work struggles to achieve such a long-range understanding due to its limited and/or sparse receptive field. To alleviate this issue, we propose a novel MANTA (MAmba for ANTicipation) network. Our model enables effective long-term temporal modelling even for very long sequences while maintaining linear complexity in sequence length. We demonstrate that our approach achieves state-of-the-art results on three datasets - Breakfast, 50Salads, and Assembly101 - while also significantly improving computational and memory efficiency.
我们的工作解决了随机长期密集预测的问题。这项任务的目标是基于提供的视频观察数据,预测未来几分钟内可能发生的动作及其持续时间。在较长的时间范围内进行预测会引入很高的不确定性,因为单次观测可能会导致多个合理的未来结果。为了应对这种不确定性,我们设计了随机模型来预测多种潜在的未来行动序列。最近的研究进一步提出,在统一的方式下同时预测每个帧的历史和未来的动作,以将观察到的不确定因素纳入模型之中。尽管这样的联合建模方式是有益的,但它需要强大的长期时间感知能力,以便连接过去与未来之间相距甚远的时间点上的事件。然而,先前的工作由于其受限且稀疏的感受野(receptive field),难以实现这种长距离的理解。 为了解决这个问题,我们提出了一种新颖的MANTA(用于预测的MAmba)网络。我们的模型能够在保持线性复杂度的同时,有效地进行长时间序列的时间建模,即使面对非常长的序列也是如此。我们在三个数据集上展示了我们的方法——Breakfast、50Salads和Assembly101—取得了最先进的结果,并且在计算和内存效率方面也有了显著提升。
https://arxiv.org/abs/2501.08837