Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.
偏好优化对于扩散模型和流匹配模型依赖于既具有判别性又计算效率高的奖励函数。视觉-语言模型(VLMs)已成为主要的奖励提供者,通过利用其丰富的多模态先验知识来指导对齐过程。然而,它们的计算成本和内存消耗可能相当大,并且在像素空间中通过奖励优化潜在扩散生成器会导致域不匹配的问题,从而增加对齐难度。在这篇论文中,我们提出了DiNa-LRM(Diffusion-Native Latent Reward Model),这是一种原生的潜在奖励模型,直接在噪声扩散状态上定义偏好学习过程。我们的方法引入了经过校准的Thurstone似然函数,并根据扩散噪音依赖性来估计不确定性。DiNa-LRM利用了一个预训练的潜在扩散主干网络和一个条件于时间步长的奖励头部,并支持推理时的噪声集成,从而提供了一种测试时扩缩和稳健奖励的原生机制。在图像对齐基准上,DiNa-LRM显著优于现有的基于扩散模型的奖励基线,在计算成本大幅减少的情况下达到了与最先进的视觉-语言模型相当的表现水平。在偏好优化方面,我们展示了DiNa-LRM改进了偏好优化动力学过程,使得模型能够更快速且资源效率更高地进行对齐。 总结来说,这项工作提出了一种新颖的方法来解决扩散模型和流匹配模型中的奖励函数设计问题,并证明其在实际应用中具有显著的性能优势。
https://arxiv.org/abs/2602.11146
Diffusion Language Models (DLMs) generate text by iteratively denoising a masked sequence, repeatedly deciding which positions to commit at each step. Standard decoding follows a greedy rule: unmask the most confident positions, yet this local choice can lock the model into a suboptimal unmasking order, especially on reasoning-heavy prompts. We present SOAR, a training-free decoding algorithm that adapts its behavior to the model's uncertainty. When confidence is low, SOAR briefly widens the search over alternative unmasking decisions to avoid premature commitments; when confidence is high, it collapses the search and decodes many positions in parallel to reduce the number of denoising iterations. Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and efficiency in DLM decoding.
扩散语言模型(DLM)通过迭代去噪一个被屏蔽的序列来生成文本,在每个步骤中反复决定在哪些位置进行解码。标准解码遵循贪婪规则:解除最有信心位置的掩码,但这局部选择可能会使模型陷入次优的解码顺序,特别是在需要大量推理的任务提示上尤为明显。 我们提出了一种名为SOAR的训练无关解码算法,该算法可以根据模型的不确定性来调整自身的行为。当模型自信度较低时,SOAR会暂时扩展搜索范围以探索其他可能的解码决策,从而避免过早承诺;而当自信度较高时,则缩小搜索范围,并行地解开许多位置,减少去噪迭代次数。 在数学推理和代码生成基准测试(GSM8K、MBPP、HumanEval)中,在Dream-7B和LLADA-8B模型上,SOAR提高了生成质量并保持了竞争性的推断速度。这为在DLM解码过程中平衡质量和效率提供了一种实用方法。
https://arxiv.org/abs/2602.10953
Autoregressive models, often built on Transformer architectures, represent a powerful paradigm for generating ultra-long videos by synthesizing content in sequential chunks. However, this sequential generation process is notoriously slow. While caching strategies have proven effective for accelerating traditional video diffusion models, existing methods assume uniform denoising across all frames-an assumption that breaks down in autoregressive models where different video chunks exhibit varying similarity patterns at identical timesteps. In this paper, we present FlowCache, the first caching framework specifically designed for autoregressive video generation. Our key insight is that each video chunk should maintain independent caching policies, allowing fine-grained control over which chunks require recomputation at each timestep. We introduce a chunkwise caching strategy that dynamically adapts to the unique denoising characteristics of each chunk, complemented by a joint importance-redundancy optimized KV cache compression mechanism that maintains fixed memory bounds while preserving generation quality. Our method achieves remarkable speedups of 2.38 times on MAGI-1 and 6.7 times on SkyReels-V2, with negligible quality degradation (VBench: 0.87 increase and 0.79 decrease respectively). These results demonstrate that FlowCache successfully unlocks the potential of autoregressive models for real-time, ultra-long video generation-establishing a new benchmark for efficient video synthesis at scale. The code is available at this https URL.
自回归模型,通常基于Transformer架构构建,代表了一种通过顺序合成内容块来生成超长视频的强大范式。然而,这种顺序生成过程非常慢。虽然缓存策略已被证明可以加速传统的视频扩散模型,但现有的方法假设所有帧的去噪是均匀进行的——这一假设在自回归模型中不成立,在这些模型中不同的视频片段在同一时间步展示出不同相似度模式。在这篇论文中,我们介绍了FlowCache,这是第一个专门为自回归视频生成设计的缓存框架。我们的关键洞察是每个视频块应维护独立的缓存策略,从而可以在每个时间步对哪些块需要重新计算进行精细控制。我们引入了一种分段式缓存策略,该策略能够动态适应每个片段独特的去噪特性,并辅以一种联合重要性-冗余优化KV缓存压缩机制,在保持固定内存限制的同时保证生成质量。我们的方法在MAGI-1上实现了2.38倍的速度提升,在SkyReels-V2上实现了6.7倍的加速,同时质量几乎没有下降(VBench评分分别增加了0.87和减少了0.79)。这些结果表明FlowCache成功地释放了自回归模型用于实时、超长视频生成的巨大潜力——建立了大规模高效视频合成的新基准。代码可在提供的链接中找到。 该论文展示了一种新颖的方法,旨在通过改进缓存策略来显著提升基于Transformer架构的自回归视频生成模型的速度和效率,同时确保生成内容的质量不受影响或只受到轻微的影响。这种方法对于实现实时超长视频生成具有重要意义,并为未来的视频合成技术提供了新的方向和发展潜力。
https://arxiv.org/abs/2602.10825
Recent advances in Neural Combinatorial Optimization (NCO) have been dominated by diffusion models that treat the Euclidean Traveling Salesman Problem (TSP) as a stochastic $N \times N$ heatmap generation task. In this paper, we propose CycFlow, a framework that replaces iterative edge denoising with deterministic point transport. CycFlow learns an instance-conditioned vector field that continuously transports input 2D coordinates to a canonical circular arrangement, where the optimal tour is recovered from this $2N$ dimensional representation via angular sorting. By leveraging data-dependent flow matching, we bypass the quadratic bottleneck of edge scoring in favor of linear coordinate dynamics. This paradigm shift accelerates solving speed by up to three orders of magnitude compared to state-of-the-art diffusion baselines, while maintaining competitive optimality gaps.
最近的神经组合优化(NCO)进展主要由扩散模型主导,这些模型将欧几里得旅行商问题(TSP)视为一个随机 $N \times N$ 热图生成任务。在本文中,我们提出了CycFlow框架,该框架用确定性的点传输替代了迭代边去噪过程。CycFlow 学习了一个实例条件下的向量场,这个向量场可以连续地将输入的二维坐标转换为一个规范化的圆形排列,在这种 $2N$ 维表示中,最优路径可以通过角度排序恢复出来。通过利用数据依赖的流匹配技术,我们避开了边评分中的二次瓶颈问题,转而采用线性坐标动态方法。这一范式转变使得求解速度相比最先进的扩散基准模型提升了三个数量级,同时保持了竞争性的优化差距。
https://arxiv.org/abs/2602.10794
The reconstruction of X-rays CT images from sparse or limited-angle geometries is a highly challenging task. The lack of data typically results in artifacts in the reconstructed image and may even lead to object distortions. For this reason, the use of deep generative models in this context has great interest and potential success. In the Deep Generative Prior (DGP) framework, the use of diffusion-based generative models is combined with an iterative optimization algorithm for the reconstruction of CT images from sinograms acquired under sparse geometries, to maintain the explainability of a model-based approach while introducing the generative power of a neural network. There are therefore several aspects that can be further investigated within these frameworks to improve reconstruction quality, such as image generation, the model, and the iterative algorithm used to solve the minimization problem, for which we propose modifications with respect to existing approaches. The results obtained even under highly sparse geometries are very promising, although further research is clearly needed in this direction.
从稀疏或有限角度几何结构重建X射线CT图像是一项极具挑战性的任务。数据的不足通常会导致重建图像中出现伪影,甚至可能导致物体失真。因此,在这种情况下使用深度生成模型具有很大的兴趣和潜在的成功可能性。在深度生成先验(DGP)框架内,基于扩散的生成模型与迭代优化算法相结合,用于从稀疏几何结构获取的投影数据(即正弦图)中重建CT图像,以保持基于模型的方法的可解释性,同时引入神经网络的生成能力。因此,在这些框架内部有许多方面可以进一步研究以提高重建质量,例如图像生成、所使用的模型以及解决最小化问题时采用的迭代算法,我们提议对现有方法进行改进。即使在高度稀疏的几何结构下获得的结果也非常有前景,尽管显然需要在这个方向上开展更多的研究工作。
https://arxiv.org/abs/2602.10722
Graph Domain Adaptation (GDA) aims to bridge distribution shifts between domains by transferring knowledge from well-labeled source graphs to given unlabeled target graphs. One promising recent approach addresses graph transfer by discretizing the adaptation process, typically through the construction of intermediate graphs or stepwise alignment procedures. However, such discrete strategies often fail in real-world scenarios, where graph structures evolve continuously and nonlinearly, making it difficult for fixed-step alignment to approximate the actual transformation process. To address these limitations, we propose \textbf{DiffGDA}, a \textbf{Diff}usion-based \textbf{GDA} method that models the domain adaptation process as a continuous-time generative process. We formulate the evolution from source to target graphs using stochastic differential equations (SDEs), enabling the joint modeling of structural and semantic transitions. To guide this evolution, a domain-aware network is introduced to steer the generative process toward the target domain, encouraging the diffusion trajectory to follow an optimal adaptation path. We theoretically show that the diffusion process converges to the optimal solution bridging the source and target domains in the latent space. Extensive experiments on 14 graph transfer tasks across 8 real-world datasets demonstrate DiffGDA consistently outperforms state-of-the-art baselines.
图域适应(Graph Domain Adaptation,GDA)旨在通过将知识从标注良好的源图转移到给定的未标注目标图来弥合不同领域之间的分布差异。一种近期有前景的方法是通过构建中间图或逐步对齐过程来离散化迁移过程以解决图转移问题。然而,在现实世界场景中,图结构会连续且非线性地演变,这使得固定的步长对准难以近似真实的转变过程。为了解决这些限制,我们提出了一种名为**DiffGDA**的方法,这是一种基于扩散的图域适应方法,它将域适应过程建模为一个连续时间生成过程。我们使用随机微分方程(SDEs)来描述从源图到目标图的演变过程,从而使结构和语义转变能够同时被建模。为了引导这一演化过程,引入了一个领域感知网络以导向生成过程朝向目标域,鼓励扩散轨迹遵循最佳适应路径。理论上证明了该扩散过程在潜在空间中收敛于连接源域与目标域的最佳解。通过14个图迁移任务和8个真实世界数据集的广泛实验表明,DiffGDA始终优于最先进的基准方法。
https://arxiv.org/abs/2602.10506
Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at this https URL.
图像到视频生成(I2V)技术将静态图像转化为遵循文本指令的连贯视频序列,但保持在视角变化下的细粒度对象身份一致仍是一项持续挑战。不同于从文字直接生成视频的方法,现有的I2V流程常常出现外观漂移和几何失真等问题,这些问题我们归因于单视图二维观察数据的稀疏性和跨模态对齐能力较弱。 为了解决这个问题,本文从数据和模型两个角度入手进行研究。首先,我们创建了ConsIDVid,这是一个大规模以对象为中心的数据集,通过可扩展的流水线构建高质量、时间同步的视频,并建立了ConsIDVid-Bench,即一个新型多视角一致性基准测试框架,使用对细微几何与外观偏差敏感的度量来评估模型性能。此外,我们提出了一种名为ConsID-Gen的新方法,这是一种视图辅助I2V生成框架,通过增强初始帧以未放置的辅助视图为特征,并融合语义和结构线索,在视觉-几何双流编码器以及文本-视觉连接模块的帮助下,为一个扩散变换器骨干网络提供统一的条件设置。跨ConsIDVid-Bench的实验表明,ConsID-Gen在多种度量标准下均表现出色,最佳性能超越了Wan2.1和HunyuanVideo等领先视频生成模型,在复杂的真实世界场景中实现了卓越的身份忠实度和时间连贯性。 我们将在此 URL(请参阅原文链接)发布我们的模型和数据集。
https://arxiv.org/abs/2602.10113
Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.
从未标记的视频数据中学习可转移的知识并在新环境中应用,是智能代理的一项基本能力。这项工作介绍了VideoWorld 2,它扩展了原有的VideoWorld,并首次研究了直接从原始真实世界视频中学习可转移知识的问题。在核心部分,VideoWorld 2引入了一种动态增强型潜在动力学模型(dLDM),该模型将动作动力学与视觉外观解耦:一个预训练的视频扩散模型处理视觉外观建模,使dLDM能够专注于从任务相关性中学习紧凑且有意义的动力学编码。这些潜在编码随后被自回归地建模以学习任务策略并支持长时域推理。 我们在具有挑战性的真实世界手工艺品制作任务上评估了VideoWorld 2,在这些任务中,先前的视频生成和潜在动力学模型难以可靠运行。令人印象深刻的是,VideoWorld 2在任务成功率方面提高了高达70%,并且能够产生连贯且长时间执行的视频。在机器人技术领域,我们展示了VideoWorld 2可以从小型开放数据集(Open-X)中获取有效的操作知识,并显著提升了CALVIN中的任务性能。 这项研究揭示了直接从原始视频学习可转移世界知识的巨大潜力,并将所有代码、数据和模型开源以供进一步的研究使用。
https://arxiv.org/abs/2602.10102
Leveraging representation encoders for generative modeling offers a path for efficient, high-fidelity synthesis. However, standard diffusion transformers fail to converge on these representations directly. While recent work attributes this to a capacity bottleneck proposing computationally expensive width scaling of diffusion transformers we demonstrate that the failure is fundamentally geometric. We identify Geometric Interference as the root cause: standard Euclidean flow matching forces probability paths through the low-density interior of the hyperspherical feature space of representation encoders, rather than following the manifold surface. To resolve this, we propose Riemannian Flow Matching with Jacobi Regularization (RJF). By constraining the generative process to the manifold geodesics and correcting for curvature-induced error propagation, RJF enables standard Diffusion Transformer architectures to converge without width scaling. Our method RJF enables the standard DiT-B architecture (131M parameters) to converge effectively, achieving an FID of 3.37 where prior methods fail to converge. Code: this https URL
利用表示编码器进行生成建模提供了一种高效、高保真合成的路径。然而,标准扩散变换器无法直接在这些表示上收敛。虽然最近的工作将这一问题归因于容量瓶颈,并建议通过昂贵的宽度扩展来解决扩散变换器的问题,但我们证明了这个问题的根本原因在于几何特性。我们识别出几何干扰(Geometric Interference)是根本原因:标准欧氏流匹配强迫概率路径穿过表示编码器超球体特征空间中的低密度内部区域,而不是沿着流形表面行进。为了纠正这一问题,我们提出了黎曼流匹配与雅可比正则化(Riemannian Flow Matching with Jacobi Regularization, RJF)。通过将生成过程约束在流形测地线上,并修正曲率引起的误差传播,RJF使标准扩散变换器架构能够在不进行宽度扩展的情况下收敛。我们的方法RJF使得标准DiT-B架构(1.31亿参数)能够有效收敛,达到了先前方法无法达到的FID 3.37。 代码链接:[这个URL]
https://arxiv.org/abs/2602.10099
Causality -- referring to temporal, uni-directional cause-effect relationships between components -- underlies many complex generative processes, including videos, language, and robot trajectories. Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context. In this paper, we show that the causal reasoning in these models is separable from the multi-step denoising process. Through systematic probing of autoregressive video diffusers, we uncover two key regularities: (1) early layers produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and (2) deeper layers exhibit sparse cross-frame attention and primarily perform intra-frame rendering. Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder. Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that SCD significantly improves throughput and per-frame latency while matching or surpassing the generation quality of strong causal diffusion baselines.
因果关系——指组件之间的时间性和单向的因果效应关系——是许多复杂生成过程的基础,包括视频、语言和机器人轨迹。当前的因果扩散模型将时间推理与迭代去噪紧密相连,在每一层中每次去噪步骤以及整个上下文中应用因果注意力。在本文中,我们展示了这些模型中的因果推理可以独立于多步去噪过程。通过对自回归视频扩散器进行系统的探究,我们发现了两个关键规律:(1)早期层次在网络的去噪过程中生成了高度相似的功能特征,表明沿着扩散轨迹存在冗余计算;以及 (2) 更深层次表现出稀疏的跨帧注意力,并主要执行帧内的渲染工作。 受这些发现启发,我们引入了一种新的架构——可分离因果扩散(SCD),该架构通过一个因果变换器编码器明确地将一次一帧的时间推理与多步帧内渲染过程区分开来。后者则通过一个轻量级的扩散解码器实现。 在合成和真实基准上进行预训练和后训练任务的广泛实验表明,SCD 在提高吞吐量和每帧延迟的同时,可以匹配或超越强因果扩散基线模型的生成质量。
https://arxiv.org/abs/2602.10095
Many visual scenes can be described as compositions of latent factors. Effective recognition, reasoning, and editing often require not only forming such compositional representations, but also solving the decomposition problem. One popular choice for constructing these representations is through the binding operation. Resonator networks, which can be understood as coupled Hopfield networks, were proposed as a way to perform decomposition on such bound representations. Recent works have shown notable similarities between Hopfield networks and diffusion models. Motivated by these observations, we introduce a framework for semantic decomposition using coupled inference in diffusion models. Our method frames semantic decomposition as an inverse problem and couples the diffusion processes using a reconstruction-driven guidance term that encourages the composition of factor estimates to match the bound vector. We also introduce a novel iterative sampling scheme that improves the performance of our model. Finally, we show that attention-based resonator networks are a special case of our framework. Empirically, we demonstrate that our coupled inference framework outperforms resonator networks across a range of synthetic semantic decomposition tasks.
许多视觉场景可以被描述为潜在因素的组合。有效的识别、推理和编辑通常不仅需要形成这样的组合表示,还需要解决分解问题。构建这些表示的一种流行方法是通过绑定操作(binding operation)进行。共振网络(Resonator networks),即耦合霍普菲尔德网络(coupled Hopfield networks),被提出用于在这种绑定表示上执行分解任务。近期的研究表明,霍普菲尔德网络和扩散模型之间存在显著的相似性。受到这些观察结果的启发,我们介绍了一种使用扩散模型中的耦合推断进行语义分解的框架。我们的方法将语义分解视为一个逆问题,并通过一种重建驱动引导项来耦合扩散过程,该引导项鼓励因子估计组合以匹配绑定向量。此外,我们还引入了一种新颖的迭代抽样方案,以提高模型性能。最后,我们展示了基于注意力的共振网络是我们的框架的一个特例。在经验上,我们在一系列合成语义分解任务中证明了我们的耦合推断框架优于共振网络。
https://arxiv.org/abs/2602.09983
Reconstructing the early Universe from the evolved present-day Universe is a challenging and computationally demanding problem in modern astrophysics. We devise a novel generative framework, Cosmo3DFlow, designed to address dimensionality and sparsity, the critical bottlenecks inherent in current state-of-the-art methods for cosmological inference. By integrating 3D Discrete Wavelet Transform (DWT) with flow matching, we effectively represent high-dimensional cosmological structures. The Wavelet Transform addresses the ``void problem'' by translating spatial emptiness into spectral sparsity. It decouples high-frequency details from low-frequency structures through spatial compression, and wavelet-space velocity fields facilitate stable ordinary differential equation (ODE) solvers with large step sizes. Using large-scale cosmological $N$-body simulations, at $128^3$ resolution, we achieve up to $50\times$ faster sampling than diffusion models, combining a $10\times$ reduction in integration steps with lower per-step computational cost from wavelet compression. Our results enable initial conditions to be sampled in seconds, compared to minutes for previous methods.
从当前的现代宇宙演化状态重建早期宇宙是一个充满挑战且计算需求极高的问题。为此,我们设计了一个新颖的生成框架——Cosmo3DFlow,旨在解决现有最先进的宇宙学推断方法中的关键瓶颈:维度和稀疏性问题。通过结合三维离散小波变换(DWT)与流匹配技术,我们能够有效地表示高维宇宙结构。小波变换解决了所谓的“空洞问题”,即将空间上的空白转换为频谱上的稀疏性。该方法通过空间压缩将高频细节从低频结构中分离出来,并且在波形空间中的速度场有助于使用较大的步长来稳定普通微分方程(ODE)求解器。 借助大规模宇宙学N体模拟,在$128^3$分辨率下,与扩散模型相比,我们的方法实现了高达50倍的采样加速。这不仅减少了多达10倍的积分步骤数量,还由于小波压缩带来的每步计算成本降低而进一步提高了效率。因此,初始条件可以在几秒内被抽样出来,而以前的方法则需要几分钟的时间。 这一成果显著加快了宇宙早期结构的研究速度,为宇宙学研究带来了新的可能性。
https://arxiv.org/abs/2602.10172
Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, however, either rely on fixed architectures that output a predefined set of stems in parallel, or generate only one stem at a time, resulting in slow inference despite flexibility in stem combination. We propose Stemphonic, a diffusion-/flow-based framework that overcomes this trade-off and generates a variable set of synchronized stems in one inference pass. During training, we treat each stem as a batch element, group synchronized stems in a batch, and apply a shared noise latent to each group. At inference-time, we use a shared initial noise latent and stem-specific text inputs to generate synchronized multi-stem outputs in one pass. We further expand our approach to enable one-pass conditional multi-stem generation and stem-wise activity controls to empower users to iteratively generate and orchestrate the temporal layering of a mix. We benchmark our results on multiple open-source stem evaluation sets and show that Stemphonic produces higher-quality outputs while accelerating the full mix generation process by 25 to 50%. Demos at: this https URL.
音乐音轨生成,即生产与音乐同步且独立的乐器音频片段的任务,相比传统的文本到音乐模型,提供了更大的用户控制能力和更好的与音乐家工作流程的一致性。然而,现有的音轨生成方法要么依赖于输出预定义集合并行音轨的固定架构,要么一次只生成一个音轨,导致尽管在音轨组合方面灵活但推断速度较慢。我们提出了Stemphonic框架,这是一个基于扩散和流的方法,可以克服这一取舍,在一次推理过程中生成一组可变且同步的音轨。 在训练期间,我们将每个音轨视为一批元素中的一个,并将同步音轨分组成一批处理。对于每批同步音轨,应用共享噪声潜在变量(noise latent)。到了推断阶段,我们使用共享初始噪声潜在变量和特定于音轨的文本输入,在一次推断中生成一组同步多轨道输出。 此外,我们将方法扩展到支持一次性条件多轨道生成及逐轨道活动控制,使得用户能够迭代地生成和编排混音中的时间层次。我们在多个开源音轨评估数据集上进行了基准测试,并显示Stemphonic产生了高质量的输出并加速了整个混音生成过程25%到50%。 示例演示请访问:[这个链接](https://this https URL)(请注意,实际应用中需要插入有效的URL以供查看)。
https://arxiv.org/abs/2602.09891
Diffusion Transformers (DiTs) have emerged as the state-of-the-art backbone for high-fidelity image and video generation. However, their massive computational cost and memory footprint hinder deployment on edge devices. While post-training quantization (PTQ) has proven effective for large language models (LLMs), directly applying existing methods to DiTs yields suboptimal results due to the neglect of the unique temporal dynamics inherent in diffusion processes. In this paper, we propose AdaTSQ, a novel PTQ framework that pushes the Pareto frontier of efficiency and quality by exploiting the temporal sensitivity of DiTs. First, we propose a Pareto-aware timestep-dynamic bit-width allocation strategy. We model the quantization policy search as a constrained pathfinding problem. We utilize a beam search algorithm guided by end-to-end reconstruction error to dynamically assign layer-wise bit-widths across different timesteps. Second, we propose a Fisher-guided temporal calibration mechanism. It leverages temporal Fisher information to prioritize calibration data from highly sensitive timesteps, seamlessly integrating with Hessian-based weight optimization. Extensive experiments on four advanced DiTs (e.g., Flux-Dev, Flux-Schnell, Z-Image, and Wan2.1) demonstrate that AdaTSQ significantly outperforms state-of-the-art methods like SVDQuant and ViDiT-Q. Our code will be released at this https URL.
扩散变换器(Diffusion Transformers,简称DiTs)已成为高保真图像和视频生成的最新骨干技术。然而,它们巨大的计算成本和内存需求阻碍了其在边缘设备上的部署。尽管后训练量化(Post-Training Quantization,PTQ)对大型语言模型(LLMs)非常有效,但直接将现有方法应用于DiTs会因忽略扩散过程特有的时间动态特性而产生次优结果。本文中,我们提出了一种新颖的PTQ框架AdaTSQ,该框架通过利用Diffusion Transformers的时间敏感性来推动效率和质量之间的帕累托前沿。 首先,我们提出了一个帕累托感知的时间步长动态比特宽度分配策略。我们将量化策略搜索建模为受约束路径寻优问题,并使用引导于端到端重构误差的束搜索算法,在不同时间步骤之间进行逐层动态比特宽度分配。 其次,我们提出了一种基于Fisher信息的时序校准机制,它利用时间上的Fisher信息来优先选择来自高度敏感时间步的数据用于校准,并无缝集成于基于Hessian的权重优化中。 在四种先进的Diffusion Transformers(如Flux-Dev、Flux-Schnell、Z-Image和Wan2.1)上进行广泛的实验表明,AdaTSQ显著优于SVDQuant和ViDiT-Q等现有最先进的方法。我们的代码将在 [这里](https://this.is.the.url/) 发布。
https://arxiv.org/abs/2602.09883
Building on recent advances in video generation, generative video compression has emerged as a new paradigm for achieving visually pleasing reconstructions. However, existing methods exhibit limited exploitation of temporal correlations, causing noticeable flicker and degraded temporal coherence at ultra-low bitrates. In this paper, we propose Free-GVC, a training-free generative video compression framework that reformulates video coding as latent trajectory compression guided by a video diffusion prior. Our method operates at the group-of-pictures (GOP) level, encoding video segments into a compact latent space and progressively compressing them along the diffusion trajectory. To ensure perceptually consistent reconstruction across GOPs, we introduce an Adaptive Quality Control module that dynamically constructs an online rate-perception surrogate model to predict the optimal diffusion step for each GOP. In addition, an Inter-GOP Alignment module establishes frame overlap and performs latent fusion between adjacent groups, thereby mitigating flicker and enhancing temporal coherence. Experiments show that Free-GVC achieves an average of 93.29% BD-Rate reduction in DISTS over the latest neural codec DCVC-RT, and a user study further confirms its superior perceptual quality and temporal coherence at ultra-low bitrates.
基于近期视频生成领域的进展,生成式视频压缩作为一种新的范例出现,旨在实现视觉上令人满意的重建。然而,现有的方法在利用时间相关性方面存在局限,这导致了在超低比特率下明显的闪烁和时间一致性下降的问题。在这篇论文中,我们提出了Free-GVC(无训练的生成视频压缩框架),该框架将视频编码重新表述为由视频扩散先验引导的潜在轨迹压缩过程。我们的方法以图像组(GOP)级别操作,将视频段编码到一个紧凑的潜在空间,并沿扩散路径逐步压缩它们。 为了确保在不同GOP之间重建时感知上的一致性,我们引入了一个自适应质量控制模块,该模块动态构建在线率-感知替代模型,预测每个GOP的最佳扩散步骤。此外,跨GOP对齐模块建立了帧之间的重叠并执行相邻组的潜在融合,从而减少了闪烁,并增强了时间一致性。 实验表明,Free-GVC在DISTS指标上相对于最新的神经编解码器DCVC-RT平均降低了93.29%的BD-Rate(比特率差),且用户研究进一步证实了它在超低比特率下的卓越感知质量和时间一致性。
https://arxiv.org/abs/2602.09868
Recent studies show that text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating the question: which parts of the world do these training examples come from? We geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Studying English captions from three widely used datasets (Re-LAION, DataComp1B, and Conceptual Captions) across $20$ common entities (e.g., house, flag), we find that the United States, the United Kingdom, and Canada account for $48.0\%$ of samples, while South American and African countries are severely under-represented with only $1.8\%$ and $3.8\%$ of images, respectively. We observe a strong correlation between a country's GDP and its representation in the data ($\rho = 0.82$). Examining non-English subsets for $4$ languages from the Re-LAION dataset, we find that representation skews heavily toward countries where these languages are predominantly spoken. Additionally, we find that higher representation does not necessarily translate to greater visual or semantic diversity. Finally, analyzing country-specific images generated by Stable Diffusion v1.3 trained on Re-LAION, we show that while generations appear realistic, they are severely limited in their coverage compared to real-world images.
最近的研究表明,文本到图像的模型在生成具有地理代表性的图片时常常失败,这引发了对其训练数据代表性问题的关注,并提出了一个问题:这些训练样本来自世界上的哪些地区?我们通过使用大规模多模态数据集(如Re-LAION、DataComp1B和Conceptual Captions)中的英语描述文本,基于从描述中提取的位置信息将图像-描述对映射到国家的方式来进行地理分析。通过对20个常见实体(例如房屋、国旗)进行研究,我们发现美国、英国和加拿大占了48.0%的样本,而南美和非洲国家的代表性严重不足,分别仅占1.8%和3.8%的图片。我们观察到一个国家的国内生产总值与其在数据中的表示之间存在强烈的正相关关系($\rho = 0.82$)。另外,我们还检查了Re-LAION数据集四种语言非英语子集中图像的代表性情况,发现这些图像主要集中在那些语言的主要使用国中。此外,我们发现较高的代表性并不一定意味着视觉或语义多样性更高。最后,在分析Stable Diffusion v1.3在Re-LAION上训练生成的国家特定图片时,虽然生成的图片看起来很逼真,但与现实世界的图片相比,它们的覆盖范围受到了严重限制。
https://arxiv.org/abs/2602.09775
The scarcity of high-quality segmentation masks remains a major bottleneck for medical image analysis, particularly in non-contrast CT (NCCT) neuroimaging, where manual annotation is costly and variable. To address this limitation, we propose an anatomy-preserving generative framework for the unconditional synthesis of multi-class brain segmentation masks, including ischemic infarcts. The proposed approach combines a variational autoencoder trained exclusively on segmentation masks to learn an anatomical latent representation, with a diffusion model operating in this latent space to generate new samples from pure noise. At inference, synthetic masks are obtained by decoding denoised latent vectors through the frozen VAE decoder, with optional coarse control over lesion presence via a binary prompt. Qualitative results show that the generated masks preserve global brain anatomy, discrete tissue semantics, and realistic variability, while avoiding the structural artifacts commonly observed in pixel-space generative models. Overall, the proposed framework offers a simple and scalable solution for anatomy-aware mask generation in data-scarce medical imaging scenarios.
高质量的分割掩码在医学图像分析中的稀缺仍然是一个主要瓶颈,尤其是在非对比CT(NCCT)神经成像中,手动注释既昂贵又不一致。为了解决这一限制,我们提出了一种保持解剖结构的生成框架,用于无条件地合成多类脑部分割掩码,包括缺血性梗死区域。所提出的方案结合了一个仅在分割掩码上训练的变分自编码器(VAE),以学习一个解剖学隐变量表示,并且在一个扩散模型在这种潜在空间中运行,从纯噪声生成新的样本。 在推理阶段,通过冻结的VAE解码器解码去噪后的隐变量向量来获得合成掩码。此外,可以通过二进制提示对病灶的存在进行粗略控制(可选)。定性结果显示,所生成的掩码能够保持全局脑部解剖结构、离散组织语义以及现实中的变异性,并且避免了像素空间生成模型中常见的结构性伪影。 总体而言,提出的框架为在数据稀缺的医学成像场景下生成具有解剖意识的掩码提供了一种简单而可扩展的解决方案。
https://arxiv.org/abs/2602.10167
Rigged 3D assets are fundamental to 3D deformation and animation. However, existing 3D generation methods face challenges in generating animatable geometry, while rigging techniques lack fine-grained structural control over skeleton creation. To address these limitations, we introduce Stroke3D, a novel framework that directly generates rigged meshes from user inputs: 2D drawn strokes and a descriptive text prompt. Our approach pioneers a two-stage pipeline that separates the generation into: 1) Controllable Skeleton Generation, we employ the Skeletal Graph VAE (Sk-VAE) to encode the skeleton's graph structure into a latent space, where the Skeletal Graph DiT (Sk-DiT) generates a skeletal embedding. The generation process is conditioned on both the text for semantics and the 2D strokes for explicit structural control, with the VAE's decoder reconstructing the final high-quality 3D skeleton; and 2) Enhanced Mesh Synthesis via TextuRig and SKA-DPO, where we then synthesize a textured mesh conditioned on the generated skeleton. For this stage, we first enhance an existing skeleton-to-mesh model by augmenting its training data with TextuRig: a dataset of textured and rigged meshes with captions, curated from Objaverse-XL. Additionally, we employ a preference optimization strategy, SKA-DPO, guided by a skeleton-mesh alignment score, to further improve geometric fidelity. Together, our framework enables a more intuitive workflow for creating ready to animate 3D content. To the best of our knowledge, our work is the first to generate rigged 3D meshes conditioned on user-drawn 2D strokes. Extensive experiments demonstrate that Stroke3D produces plausible skeletons and high-quality meshes.
三维配装资产是三维变形和动画的基础。然而,现有的三维生成方法在生成可动画化的几何体方面面临挑战,而装配技术缺乏对骨骼创建的精细结构控制。为了解决这些限制,我们引入了Stroke3D,这是一种新颖的框架,可以直接根据用户输入(包括二维绘制线条和描述性文本提示)生成配装网格。 我们的方法开创了一种两阶段流程,将生成过程分为以下两个部分: 1. **可控骨架生成**:我们使用骨骼图变分自动编码器(Skeletal Graph VAE, Sk-VAE)将骨架的图形结构编码到潜在空间中,在那里通过骨骼图DiT(Sk-DiT)生成骨骼嵌入。生成过程中,既以文本为基础进行语义条件设定,又基于二维线条进行显式的结构控制,并利用自动解码器重构最终高质量的三维骨架。 2. **增强网格合成**:在这一阶段,我们使用TextuRig数据集(Objaverse-XL中具有纹理和装配信息并配有描述的文字的网格集合)来增强现有的骨骼到网格模型的训练数据。此外,我们采用一种偏好优化策略SKA-DPO,在该策略中基于骨架-网格对齐评分进一步提升几何准确性。 总的来说,我们的框架为创建可立即用于动画化的三维内容提供了一种更为直观的工作流程。据我们所知,这是首次生成基于用户绘制二维线条的装配好的三维网格的工作。广泛的实验表明,Stroke3D能够产生合理的骨骼和高质量的网格。
https://arxiv.org/abs/2602.09713
We propose a methodology that combines generative latent diffusion models with physics-informed machine learning to generate solutions of parametric partial differential equations (PDEs) conditioned on partial observations, which includes, in particular, forward and inverse PDE problems. We learn the joint distribution of PDE parameters and solutions via a diffusion process in a latent space of scaled spectral representations, where Gaussian noise corresponds to functions with controlled regularity. This spectral formulation enables significant dimensionality reduction compared to grid-based diffusion models and ensures that the induced process in function space remains within a class of functions for which the PDE operators are well defined. Building on diffusion posterior sampling, we enforce physics-informed constraints and measurement conditions during inference, applying Adam-based updates at each diffusion step. We evaluate the proposed approach on Poisson, Helmholtz, and incompressible Navier--Stokes equations, demonstrating improved accuracy and computational efficiency compared with existing diffusion-based PDE solvers, which are state of the art for sparse observations. Code is available at this https URL.
我们提出了一种结合生成式潜在扩散模型和物理信息机器学习的方法,用于在部分观测条件下生成参数化偏微分方程(PDE)的解。这种方法涵盖了正向和逆向PDE问题。通过在缩放谱表示的空间中的扩散过程来学习PDE参数与解之间的联合分布,其中高斯噪声对应于具有可控制规则性的函数。这种谱形式不仅相比基于网格的扩散模型大大减少了维度,还确保了生成的过程始终处于PDE算子定义良好的函数类中。 在此基础上,我们利用后验抽样技术,并在推理过程中施加物理信息约束和测量条件,在每次扩散步骤应用Adam优化器更新。我们在泊松方程、赫尔默特兹方程及不可压缩纳维-斯托克斯方程上评估了这一方法的性能,与现有的基于扩散模型的PDE求解器相比,显示出更高的精度和计算效率,并且在稀疏观测数据情况下处于领先水平。 代码可以在[这里](https://this-is-the-code-url-you-provided.com)获取。
https://arxiv.org/abs/2602.09708
Recent advances in diffusion-based video generation have substantially improved visual fidelity and temporal coherence. However, most existing approaches remain task-specific and rely primarily on textual instructions, limiting their ability to handle multimodal inputs, contextual references, and diverse video generation and editing scenarios within a unified framework. Moreover, many video editing methods depend on carefully engineered pipelines tailored to individual operations, which hinders scalability and composability. In this paper, we propose Tele-Omni, a unified multimodal framework for video generation and editing that follows multimodal instructions, including text, images, and reference videos, within a single model. Tele-Omni leverages pretrained multimodal large language models to parse heterogeneous instructions and infer structured generation or editing intents, while diffusion-based generators perform high-quality video synthesis conditioned on these structured signals. To enable joint training across heterogeneous video tasks, we introduce a task-aware data processing pipeline that unifies multimodal inputs into a structured instruction format while preserving task-specific constraints. Tele-Omni supports a wide range of video-centric tasks, including text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing. By decoupling instruction parsing from video synthesis and combining it with task-aware data design, Tele-Omni achieves flexible multimodal control while maintaining strong temporal coherence and visual consistency. Experimental results demonstrate that Tele-Omni achieves competitive performance across multiple tasks.
近期基于扩散模型的视频生成技术在视觉保真度和时间连贯性方面取得了显著进步。然而,大多数现有方法仍然局限于特定任务,并主要依赖于文本指令,这限制了它们处理多模态输入、上下文引用以及多样化的视频生成与编辑场景的能力,在统一框架内尤其如此。此外,许多视频编辑方法依赖于为单独操作精心设计的流水线,这阻碍了其可扩展性和组合性。 在本文中,我们提出了Tele-Omni,这是一种用于视频生成和编辑的统一多模态框架,能够遵循包括文本、图像和参考视频在内的异构指令。Tele-Omni利用预训练的多模态大型语言模型来解析异构指令,并推断结构化的生成或编辑意图,同时基于这些结构化信号,扩散模型进行高质量的视频合成。 为了在不同类型的视频任务之间实现联合训练,我们引入了一种任务感知的数据处理流水线,将多种模态输入统一到一种结构化的指令格式中,同时保留了特定于任务的约束条件。Tele-Omni支持广泛的以视频为中心的任务,包括文本转视频生成、图像转视频生成、首尾帧视频生成、上下文中的视频生成和编辑。 通过解耦指令解析与视频合成,并结合任务感知的数据设计,Tele-Omni实现了灵活的多模态控制,同时保持了强大的时间连贯性和视觉一致性。实验结果表明,Tele-Omni在多个任务上取得了竞争性的性能。
https://arxiv.org/abs/2602.09609