Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generating the target concept, and (ii) interfering with the model's internal guidance mechanisms. To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations. Our evaluation framework includes adversarial attacks, novel probing techniques, and analysis of the model's alternative generations in place of the erased concept. Our results shed light on the tension between minimizing side effects and maintaining robustness to adversarial prompts. Broadly, our work underlines the importance of comprehensive evaluation for erasure in diffusion models.
概念擦除,即模型选择性防止生成特定概念的能力,已引起了越来越多的关注,并且出现了多种方法来应对这一挑战。然而,这些方法是否能够彻底清除目标概念仍不清楚。我们首先提出了两种关于扩散模型中擦除机制的概念模型:(i) 减少生成目标概念的可能性;(ii) 干扰模型的内部引导机制。为了全面评估某个概念是否已被真正从模型中删除,我们引入了一套独立的评估方法。我们的评估框架包括对抗性攻击、新颖的探测技术以及对模型在擦除特定概念后产生的替代内容进行分析。我们的结果揭示了在最小化副作用和保持对抗提示下的鲁棒性之间存在的矛盾。总的来说,我们的工作强调了在扩散模型中进行彻底评估的重要性。
https://arxiv.org/abs/2505.17013
Learning latent motion from Internet videos is crucial for building generalist robots. However, existing discrete latent action methods suffer from information loss and struggle with complex and fine-grained dynamics. We propose CoMo, which aims to learn more informative continuous motion representations from diverse, internet-scale videos. CoMo employs a early temporal feature difference mechanism to prevent model collapse and suppress static appearance noise, effectively discouraging shortcut learning problem. Furthermore, guided by the information bottleneck principle, we constrain the latent motion embedding dimensionality to achieve a better balance between retaining sufficient action-relevant information and minimizing the inclusion of action-irrelevant appearance noise. Additionally, we also introduce two new metrics for more robustly and affordably evaluating motion and guiding motion learning methods development: (i) the linear probing MSE of action prediction, and (ii) the cosine similarity between past-to-current and future-to-current motion embeddings. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate continuous pseudo actions for previously unseen video domains. This capability facilitates unified policy joint learning using pseudo actions derived from various action-less video datasets (such as cross-embodiment videos and, notably, human demonstration videos), potentially augmented with limited labeled robot data. Extensive experiments show that policies co-trained with CoMo pseudo actions achieve superior performance with both diffusion and autoregressive architectures in simulated and real-world settings.
从互联网视频中学习潜在运动对于构建通用型机器人至关重要。然而,现有的离散潜在动作方法存在信息损失的问题,并且难以处理复杂和细微的动态变化。为此我们提出了CoMo(Continuous Motion),旨在从多样化的、大规模的互联网视频中学习更为详尽的连续运动表示。 CoMo采用了早期时间特征差分机制来防止模型崩溃并抑制静态外观噪声,从而有效避免了捷径学习问题的发生。同时,遵循信息瓶颈原则,我们将潜在运动嵌入维度进行限制,以在保留足够的与动作相关的信息和最小化无关的外观噪声之间取得更好的平衡。 此外,我们还引入了两个新的评估指标,用于更加稳健且经济地评价运动并指导运动学习方法的发展:(i)动作预测线性探测MSE;(ii)过去到当前及未来到当前运动嵌入之间的余弦相似度。这两个指标对于衡量模型在不同时间和视角下保持一致性和相关性的能力至关重要。 最关键的是,CoMo展示出了强大的零样本泛化能力,使其能够为之前未见过的视频领域生成连续伪动作。这种能力使得利用从无标签视频数据集中提取的各种伪动作进行统一策略联合学习成为可能(例如跨实体视频和显著的人类演示视频),这在必要时可以结合有限标记的机器人数据进一步增强。 广泛的实验表明,与CoMo伪动作协同训练的策略在模拟和现实世界环境中使用扩散模型和自回归架构均表现出卓越性能。
https://arxiv.org/abs/2505.17006
We propose a general framework for conditional sampling in PDE-based inverse problems, targeting the recovery of whole solutions from extremely sparse or noisy measurements. This is accomplished by a function-space diffusion model and plug-and-play guidance for conditioning. Our method first trains an unconditional discretization-agnostic denoising model using neural operator architectures. At inference, we refine the samples to satisfy sparse observation data via a gradient-based guidance mechanism. Through rigorous mathematical analysis, we extend Tweedie's formula to infinite-dimensional Hilbert spaces, providing the theoretical foundation for our posterior sampling approach. Our method (FunDPS) accurately captures posterior distributions in function spaces under minimal supervision and severe data scarcity. Across five PDE tasks with only 3% observation, our method achieves an average 32% accuracy improvement over state-of-the-art fixed-resolution diffusion baselines while reducing sampling steps by 4x. Furthermore, multi-resolution fine-tuning ensures strong cross-resolution generalizability. To the best of our knowledge, this is the first diffusion-based framework to operate independently of discretization, offering a practical and flexible solution for forward and inverse problems in the context of PDEs. Code is available at this https URL
我们提出了一种基于偏微分方程(PDE)逆问题的条件采样通用框架,旨在从极度稀疏或噪声较大的测量数据中恢复完整的解。此目标通过函数空间扩散模型和插件播放指导来实现条件设置。我们的方法首先使用神经算子架构训练一个无条件且与离散化无关的去噪模型。在推断阶段,我们利用基于梯度的引导机制细化样本以满足稀疏观测数据的要求。通过严格的数学分析,我们将Tweedie公式扩展到无限维希尔伯特空间中,为我们的后验采样方法提供了理论基础。我们的方法(FunDPS)在极小监督和极端数据稀缺条件下准确捕捉函数空间中的后验分布。在五项仅包含3%观测数据的PDE任务上,与最先进的固定分辨率扩散基准相比,我们的方法平均提高了32%的准确性,并将采样步骤减少了4倍。此外,多分辨率微调确保了强大的跨分辨率泛化能力。据我们所知,这是首个独立于离散化的基于扩散的方法框架,为偏微分方程上下文中的正向和逆向问题提供了一种实用且灵活的解决方案。代码可在此网址获得:[请在这里插入实际链接]
https://arxiv.org/abs/2505.17004
In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only $\frac{\text{response length}}{3}$. We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple's capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at this https URL.
在这项工作中,我们提出了Dimple,这是首个离散扩散多模态大型语言模型(DMLLM)。我们观察到,使用纯粹的离散扩散方法进行训练会导致显著的训练不稳定、次优性能和严重的长度偏差问题。为了应对这些挑战,我们设计了一种新的训练范式,该范式结合了初始自回归阶段与后续的扩散阶段。这种方法生成了Dimple-7B模型,其在相同的语料库上进行了训练,并使用了类似于LLaVA-NEXT的训练管道。最终,Dimple-7B以3.9%的优势超越了LLaVA-NEXT,这表明DMLLM可以实现与自回归模型相当的性能。 为了提高推理效率,我们提出了一种名为“自信解码”的解码策略,该策略在每个步骤中动态调整生成令牌的数量,显著减少了生成迭代次数。在自回归模型中,生成期间的前向迭代次数等于响应长度。然而,在使用自信解码时,Dimple所需的迭代次数仅为响应长度的$\frac{1}{3}$。 此外,我们重新实现了自回归模型中的填充技术,并展示了这种技术对大多数基准评估性能影响不大,但提供了1.5倍到7倍的速度提升。我们也探讨了Dimple利用结构先验精准控制其响应的能力。这些先验使得以不同于指令或链式思考提示的方式生成结构化回复成为可能,从而可以精确地控制回复格式和长度,而这在自回归模型中是难以实现的。 总的来说,这项工作验证了DMLLM的可行性和优势,并提高了它的推理效率和可控性。代码与模型可在[此处](https://this-url.com)获取。
https://arxiv.org/abs/2505.16990
Video virtual try-on aims to seamlessly dress a subject in a video with a specific garment. The primary challenge involves preserving the visual authenticity of the garment while dynamically adapting to the pose and physique of the subject. While existing methods have predominantly focused on image-based virtual try-on, extending these techniques directly to videos often results in temporal inconsistencies. Most current video virtual try-on approaches alleviate this challenge by incorporating temporal modules, yet still overlook the critical spatiotemporal pose interactions between human and garment. Effective pose interactions in videos should not only consider spatial alignment between human and garment poses in each frame but also account for the temporal dynamics of human poses throughout the entire video. With such motivation, we propose a new framework, namely Dynamic Pose Interaction Diffusion Models (DPIDM), to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on. Technically, DPIDM introduces a skeleton-based pose adapter to integrate synchronized human and garment poses into the denoising network. A hierarchical attention module is then exquisitely designed to model intra-frame human-garment pose interactions and long-term human pose dynamics across frames through pose-aware spatial and temporal attention mechanisms. Moreover, DPIDM capitalizes on a temporal regularized attention loss between consecutive frames to enhance temporal consistency. Extensive experiments conducted on VITON-HD, VVT and ViViD datasets demonstrate the superiority of our DPIDM against the baseline methods. Notably, DPIDM achieves VFID score of 0.506 on VVT dataset, leading to 60.5% improvement over the state-of-the-art GPD-VVTO approach.
视频虚拟试衣的目标是将视频中的主体无缝地穿上特定的衣物。主要挑战在于在动态适应主体姿势和体型的同时,保持服装的真实视觉效果。尽管现有的方法大多集中在基于图像的虚拟试穿技术上,但直接将其应用到视频中通常会导致时间上的不一致性。目前大多数视频虚拟试衣的方法通过加入时间模块来缓解这一问题,但仍忽略了人类与衣物之间的关键时空姿态互动。 为了有效解决视频中的姿势交互,在每一帧中不仅需要考虑人体和衣物姿势的空间对齐,还需要考虑到整个视频中的人体姿势的动态变化。基于此动机,我们提出了一种新的框架——动态姿态互动扩散模型(Dynamic Pose Interaction Diffusion Models, DPIDM),利用扩散模型深入探索动态姿态互动在视频虚拟试衣中的应用。 技术上,DPIDM引入了一个骨架基础的姿态适配器,将同步的人体和衣物姿势整合到去噪网络中。随后设计了一种分层注意力模块,通过基于姿态的空域和时间域注意机制来建模帧内人体与衣物的姿势互动以及跨帧长时间段内的动态变化。此外,DPIDM利用连续帧之间的正则化注意损失来增强时间一致性。 在VITON-HD、VVT 和ViViD 数据集上进行的大量实验表明了我们提出的DPIDM方法相对于基线方法的优势。值得注意的是,在VVT数据集中,DPIDM达到了VFID得分为0.506,比最先进的GPD-VVTO方法提高了60.5%。
https://arxiv.org/abs/2505.16980
Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets. Code is publicly available at: this https URL.
扩散模型在虚拟试穿(VTON)任务中已显示出初步的成功。典型的双分支架构包括两个UNet,分别用于隐式的服装变形和合成图像生成,并已成为执行VTON任务的标准方法。然而,由于扩散模型固有的随机性,保留给定服装的形状及每一个细节的问题仍然具有挑战性。为了解决这个问题,我们新颖地提出了利用视觉对应关系作为先验知识来控制扩散过程的方法,而不是简单地将整个服装输入到UNet中作为外观参考。 具体来说,我们将精细的外观和纹理细节解释为一组结构化的语义点,并通过局部流扭曲匹配服装中的语义点与目标人体上的语义点。然后,这些2D点被增强为带有目标人物深度/法线图的3D感知线索。这种对应关系模仿了将衣物穿在人身上的过程,而3D感知线索则充当语义点匹配来监督扩散模型训练。此外,还设计了一种以点为中心的扩散损失函数,以便充分利用语义点匹配。 大量的实验表明,我们的方法能够很好地保持服装细节,并通过VITON-HD和DressCode数据集上的最先进的VTON性能得到了验证。代码在以下网址公开提供:[此链接](https://this-url.com)(请将链接替换为实际的公开代码地址)。
https://arxiv.org/abs/2505.16977
Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., 1,024 X 1,024). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., 4,096 X 4,096 and 8,192 X 8,192) with higher visual fidelity and more creative regional details.
当代扩散模型在文本到图像的生成方面表现出卓越的能力,但仍受限于固定的分辨率(例如,1024x1024)。近期的研究进展使得通过循环利用预训练的扩散模型,并借助区域去噪或膨胀采样/卷积技术来实现无需微调的高分辨率图像生成成为可能。然而,这些模型在同时保持全局语义结构和生成具有创意性的局部细节方面仍面临挑战。 为了解决这个问题,我们提出了C-Upscale,这是一种新的无需微调的图像上采样方法,它基于从给定的全局提示以及通过多模态大语言模型估算出的区域提示中提取出的全局-区域先验。在技术实现上,低分辨率图像中的低频部分被识别为全局结构先验,以鼓励高分辨率生成过程中的全局语义一致性。接下来,我们执行区域注意力控制,在区域去噪过程中筛选全局提示与每个区域之间的交叉注意,从而缓解对象重复的问题,并形成一个区域注意力先验。估算出的包含丰富描述性细节的区域提示进一步充当区域语义先验,以激发局部细节生成的创造性。 无论是定量还是定性的评估都表明,我们的C-Upscale方法能够生成超高分辨率图像(例如4096x4096和8192x8192),并具备更高的视觉保真度以及更多具有创意性的区域细节。
https://arxiv.org/abs/2505.16976
Equivariant models have recently been shown to improve the data efficiency of diffusion policy by a significant margin. However, prior work that explored this direction focused primarily on point cloud inputs generated by multiple cameras fixed in the workspace. This type of point cloud input is not compatible with the now-common setting where the primary input modality is an eye-in-hand RGB camera like a GoPro. This paper closes this gap by incorporating into the diffusion policy model a process that projects features from the 2D RGB camera image onto a sphere. This enables us to reason about symmetries in SO(3) without explicitly reconstructing a point cloud. We perform extensive experiments in both simulation and the real world that demonstrate that our method consistently outperforms strong baselines in terms of both performance and sample efficiency. Our work is the first SO(3)-equivariant policy learning framework for robotic manipulation that works using only monocular RGB inputs.
最近的研究表明,等变模型(Equivariant models)能显著提升扩散策略的数据效率。然而,先前在这一方向上的研究主要集中在由工作空间中固定多摄像头生成的点云输入上。这种类型的点云输入并不适用于目前常见的主输入模式——手持RGB摄像头(如GoPro)。本文填补了这一空白,通过将一种过程整合到扩散策略模型中,该过程能够将二维RGB图像中的特征投影到一个球体表面。这使我们能够在不显式重建点云的情况下推理SO(3)对称性。我们在仿真和真实世界环境中进行了广泛的实验,结果表明我们的方法在性能和样本效率方面始终优于强大的基线算法。我们的工作首次提出了一种使用单目RGB输入的机器人操作等变策略学习框架。
https://arxiv.org/abs/2505.16969
In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: this https URL.
在这项工作中,我们介绍了LLaDA-V,这是一种纯扩散式的多模态大型语言模型(MLLM),它将视觉指令调优与掩码扩散模型结合在一起,标志着从当前多模态方法中主导的自回归范式中的重大转变。LLaDA-V基于LLaDA,一个典型的大型语言扩散模型,并集成了一个视觉编码器和MLP连接器,该连接器可以将视觉特征映射到语言嵌入空间中,从而实现有效的跨模态对齐。 我们的实证研究表明了几个引人注目的结果:首先,尽管在纯文本任务上LLaDA-V的语言模型不如如LLaMA3-8B和Qwen2-7B这样的同类模型强大,但LLaDA-V仍展示了令人鼓舞的多模态性能。当使用相同的指令数据进行训练时,在多模态任务中,LLaDA-V与LLaMA3-V同样具有竞争力,并且在数据可扩展性方面表现更好。它还缩小了与Qwen2-VL之间的性能差距,表明其架构对于多模态任务的有效性。 其次,LLaDA-V在多模态理解方面的性能优于现有的混合自回归-扩散和纯扩散型MLLM模型,在这一领域达到了最先进的水平。我们的研究结果表明大型语言扩散模型在多模态上下文中显示出潜力,并且需要在未来的研究中进一步探讨其应用前景。 项目页面及代码:[链接](this https URL)
https://arxiv.org/abs/2505.16933
Hallucinations -- plausible yet erroneous outputs -- remain a critical barrier to reliable deployment of large language models (LLMs). We present the first systematic study linking hallucination incidence to internal-state drift induced by incremental context injection. Using TruthfulQA, we construct two 16-round "titration" tracks per question: one appends relevant but partially flawed snippets, the other injects deliberately misleading content. Across six open-source LLMs, we track overt hallucination rates with a tri-perspective detector and covert dynamics via cosine, entropy, JS and Spearman drifts of hidden states and attention maps. Results reveal (1) monotonic growth of hallucination frequency and representation drift that plateaus after 5--7 rounds; (2) relevant context drives deeper semantic assimilation, producing high-confidence "self-consistent" hallucinations, whereas irrelevant context induces topic-drift errors anchored by attention re-routing; and (3) convergence of JS-Drift ($\sim0.69$) and Spearman-Drift ($\sim0$) marks an "attention-locking" threshold beyond which hallucinations solidify and become resistant to correction. Correlation analyses expose a seesaw between assimilation capacity and attention diffusion, clarifying size-dependent error modes. These findings supply empirical foundations for intrinsic hallucination prediction and context-aware mitigation mechanisms.
幻觉——尽管合理但错误的输出——依然是大规模语言模型(LLM)可靠部署的关键障碍。我们首次系统研究了由增量上下文注入引起的内部状态漂移与幻觉发生率之间的联系。使用TruthfulQA,我们在每个问题上构建两个16轮“滴定”轨道:一个附加相关但部分有缺陷的片段,另一个则注入故意误导的内容。在六个开源LLM中,我们利用三视角检测器跟踪显性幻觉率,并通过余弦、熵、JS和斯皮尔曼漂移分析隐藏状态及注意力图的变化来追踪隐性动态变化。研究结果揭示了以下几点: 1. 幻觉频率与表示漂移随轮次增加而单调增长,在5-7轮后达到平台期。 2. 相关上下文驱动语义深入吸收,产生高置信度的“自我一致”幻觉;而不相关上下文则通过注意力重新定向导致主题漂错。 3. JS漂移(约0.69)与斯皮尔曼漂移(接近于零)的收敛标志着一个“注意力锁定”的阈值,在此之后,幻觉固化并变得难以纠正。 相关性分析揭示了吸收能力和注意力扩散之间的跷跷板效应,澄清了大小依赖型错误模式。这些发现为内在幻觉预测和上下文感知缓解机制提供了实证基础。
https://arxiv.org/abs/2505.16894
Shared autonomy is an enabling technology that provides users with control authority over robots that would otherwise be difficult if not impossible to directly control. Yet, standard methods make assumptions that limit their adoption in practice-for example, prior knowledge of the user's goals or the objective (i.e., reward) function that they wish to optimize, knowledge of the user's policy, or query-level access to the user during training. Diffusion-based approaches to shared autonomy do not make such assumptions and instead only require access to demonstrations of desired behaviors, while allowing the user to maintain control authority. However, these advantages have come at the expense of high computational complexity, which has made real-time shared autonomy all but impossible. To overcome this limitation, we propose Consistency Shared Autonomy (CSA), a shared autonomy framework that employs a consistency model-based formulation of diffusion. Key to CSA is that it employs the distilled probability flow of ordinary differential equations (PF ODE) to generate high-fidelity samples in a single step. This results in inference speeds significantly than what is possible with previous diffusion-based approaches to shared autonomy, enabling real-time assistance in complex domains with only a single function evaluation. Further, by intervening on flawed actions at intermediate states of the PF ODE, CSA enables varying levels of assistance. We evaluate CSA on a variety of challenging simulated and real-world robot control problems, demonstrating significant improvements over state-of-the-art methods both in terms of task performance and computational efficiency.
共享自主性是一种使技术,它让用户能够控制那些原本难以直接操控的机器人。然而,标准方法会做出一些假设,限制了它们在实际中的采用——例如,需要预先了解用户的任务目标或他们希望优化的目标(即奖励函数)、用户策略的知识或者在训练过程中询问用户的需求。基于扩散的方法则不依赖于这些假设,仅需访问期望行为的演示,并允许用户保留控制权。然而,这种优势是以计算复杂度高为代价实现的,这使得实时共享自主性变得几乎不可能实现。 为了克服这一限制,我们提出了Consistency Shared Autonomy(CSA),这是一种采用基于一致性的模型扩散方法的共享自主框架。CSA的关键在于它使用普通微分方程中的概率流蒸馏来生成单一步骤中的高保真样本。这种方法使得推理速度远超之前的扩散方式在共享自主性上的表现,能够在复杂领域实现实时辅助,仅需一次函数评估即可完成。此外,通过在PF ODE(概率流微分方程)的中间状态干预错误行为,CSA能够提供不同程度的帮助。 我们在各种具有挑战性的模拟和真实世界机器人控制问题上对CSA进行了测试,展示了其在任务性能和计算效率方面显著优于最先进的方法。 翻译如下: 共享自主性是一种关键技术,它允许用户对其难以直接操控的机器人进行控制。然而,标准的方法往往做出一些限制其实际应用的假设——例如,假定事先了解用户的任务目标或他们想要优化的目标(即奖励函数)、知晓用户策略或者在训练过程中能够获取到用户的反馈信息。基于扩散的方法不依赖于这些假设,并且只需要访问期望行为的数据展示即可进行操作,同时允许用户保留控制权。但是,这种优势是以极高的计算复杂度为代价的,这使得实时共享自主性变得几乎不可能实现。 为了克服这一限制,我们提出了Consistency Shared Autonomy(CSA),这是一种基于一致性的模型扩散方法框架下的共享自主系统。CSA的核心在于它使用了普通微分方程中的概率流蒸馏技术,在单一步骤中生成高保真的样本。这种方法显著提高了推理速度,使得在复杂环境中实现实时辅助成为可能,并且只需进行一次函数评估即可完成。此外,通过干预PF ODE(概率流微分方程)中间状态下的错误行为,CSA能够提供不同程度的帮助。 我们在一系列具有挑战性的模拟和现实世界机器人控制问题上对CSA进行了测试,结果显示其在任务性能和计算效率方面均显著优于现有的最佳方法。
https://arxiv.org/abs/2505.16892
Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models, but naive post-training causes forgetting of pretrained knowledge and undermines zero-shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post-training. To address this, we introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment. We benchmark ten representative methods across three realistic task sequences and find that no approach excels on all fronts. Even joint "oracle" training does not succeed for every task, and cross-task generalization remains unsolved. We release all datasets, code, and evaluation tools to accelerate research in continual post-training for text-to-image models.
持续后期训练使单一的文本到图像扩散模型能够在不增加单独模型成本的情况下学习新任务,但简单的后期训练会导致遗忘预训练知识,并损害零样本组合性。我们观察到缺乏标准化的评估协议阻碍了相关研究的发展,特别是对于持续后期训练的研究。为此,我们引入了T2I-ConBench,这是一个用于文本到图像模型持续后期训练的统一基准测试平台。T2I-ConBench专注于两个实际场景:项目定制和领域增强,并从四个维度进行分析:(1)通用性保留;(2)目标任务性能;(3)灾难性遗忘;以及(4)跨任务泛化能力。它结合了自动化指标、人类偏好建模及视觉-语言问答,以进行全面评估。我们对十种代表性方法进行了三个实际的任务序列基准测试,并发现没有一种方法在所有方面都表现出色。即使联合“oracle”训练也不适用于每个任务,而跨任务的泛化问题仍未解决。我们发布了所有数据集、代码和评估工具,以加速文本到图像模型持续后期训练的研究进展。
https://arxiv.org/abs/2505.16875
Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$\times$ speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: this https URL
尽管视频扩散变压器(DiT)模型的生成质量十分出色,但由于其计算需求过大,实际部署受到了严重阻碍。这种低效率主要源自两个关键挑战:自注意力机制相对于标记长度呈二次复杂度增长,以及扩散模型的多步骤特性。为解决这些局限性,我们提出了Jenga,这是一种结合动态注意力裁剪与渐进式分辨率生成的新颖推理管道。我们的方法利用了以下两大洞见:(1)早期去噪阶段不需要高分辨率潜在特征,(2)后期阶段则无需密集型注意力机制。Jenga引入了一种基于块的注意力机制,在此机制中使用三维空间填充曲线动态选择相关的标记交互,并配合一种渐进式分辨率策略,在生成过程中逐步增加潜在特征的分辨率。 实验结果显示,与多种最先进的视频扩散模型相比,Jenga实现了显著的速度提升(在VBench数据集上速度提升了8.83倍且性能仅下降了0.01%),同时保持了相当的生成质量。作为即插即用解决方案,Jenga通过将推理时间从数分钟缩短至数秒,使得在现代硬件上进行高质量视频生成成为可能——而无需重新训练模型。 代码链接:[请在此处插入具体链接]
https://arxiv.org/abs/2505.16864
Recent progress in panoramic image generation has underscored two critical limitations in existing approaches. First, most methods are built upon diffusion models, which are inherently ill-suited for equirectangular projection (ERP) panoramas due to the violation of the identically and independently distributed (i.i.d.) Gaussian noise assumption caused by their spherical mapping. Second, these methods often treat text-conditioned generation (text-to-panorama) and image-conditioned generation (panorama outpainting) as separate tasks, relying on distinct architectures and task-specific data. In this work, we propose a unified framework, Panoramic AutoRegressive model (PAR), which leverages masked autoregressive modeling to address these challenges. PAR avoids the i.i.d. assumption constraint and integrates text and image conditioning into a cohesive architecture, enabling seamless generation across tasks. To address the inherent discontinuity in existing generative models, we introduce circular padding to enhance spatial coherence and propose a consistency alignment strategy to improve generation quality. Extensive experiments demonstrate competitive performance in text-to-image generation and panorama outpainting tasks while showcasing promising scalability and generalization capabilities.
最近在全景图像生成领域的进展凸显了现有方法的两个关键限制。首先,大多数方法基于扩散模型,而这些模型由于球形映射导致独立同分布(i.i.d.)高斯噪声假设被破坏,因此对于等距矩形投影(ERP)全景图而言并不适用。其次,这些方法通常将文本条件生成(从文本到全景图像)和图像条件生成(全景扩展)视为两个独立的任务,并依赖于不同的架构和任务特定的数据集。 在本文中,我们提出了一种统一框架——全景自动回归模型(PAR),该模型利用屏蔽自回归建模来应对这些挑战。PAR 模型避免了 i.i.d. 假设的限制,并将文本和图像条件融入到一个连贯的架构中,从而实现了跨任务无缝生成。为了克服现有生成模型中的内在不连续性问题,我们引入了循环填充以增强空间一致性,并提出了一致性对齐策略来提高生成质量。 大量的实验表明,在文本到图像生成和全景扩展任务中,PAR 模型展示了竞争力的表现、可扩展性和泛化能力。
https://arxiv.org/abs/2505.16862
Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.
现代视觉-语言模型(VLMs)能够解决需要视觉推理的广泛任务。在实际应用场景中,理想中的VLM应具备快速推断和可控生成能力(例如,确保输出符合预期格式)。然而,现有的自回归(AR)VLM如LLaVA在这两方面表现不佳。离散扩散模型(DMs)提供了有前景的替代方案,通过并行解码实现更快的推理,并通过文本填充实现双向上下文从而支持可控生成。尽管在纯语言场景中效果显著,但DMs在多模态任务中的潜力尚未被充分探索。 我们介绍了LaViDa,这是一个基于DM构建的一系列VLM家族。通过为DM配备视觉编码器并联合微调以适应多模态指令跟随,我们建立了LaViDa。为了克服开发过程中遇到的挑战,LaViDa引入了互补掩码、前缀KV缓存和时间步移位等新颖技术,这些技术分别有助于有效的训练、高效的推理以及高质量采样。 实验结果表明,在包括MMMU在内的多模态基准测试中,LaViDa能够达到与AR VLM相当或更优的性能,并且具备DM的独特优势,如灵活的速度质量权衡、可控性及双向推理能力。在COCO描述任务上,相较于Open-LLaVa-Next-8B,LaViDa以+4.1 CIDEr分的优势和1.92倍的速度提升取得领先地位。而在受控诗歌完成的双向任务中,它也实现了59%的性能改善。这些结果表明,LaViDa是AR VLM的一个强有力的替代选择。代码与模型将在最终版本中发布。
https://arxiv.org/abs/2505.16839
While foundation models (FMs), such as diffusion models and large vision-language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual understanding. To better assess the visual reasoning capabilities of FMs in educational settings, we introduce EduVisBench, a multi-domain, multi-level benchmark. EduVisBench features diverse STEM problem sets requiring visually grounded solutions, along with a fine-grained evaluation rubric informed by pedagogical theory. Our empirical analysis reveals that existing models frequently struggle with the inherent challenge of decomposing complex reasoning and translating it into visual representations aligned with human cognitive processes. To address these limitations, we propose EduVisAgent, a multi-agent collaborative framework that coordinates specialized agents for instructional planning, reasoning decomposition, metacognitive prompting, and visualization design. Experimental results show that EduVisAgent substantially outperforms all baselines, achieving a 40.2% improvement and delivering more educationally aligned visualizations. EduVisBench and EduVisAgent are available at this https URL and this https URL.
虽然基础模型(FMs),如扩散模型和大型视觉-语言模型(LVLMs),已在教育领域得到广泛应用,但它们生成有效的教学性可视化解释的能力仍然有限。大多数现有方法主要关注文本推理,忽视了结构化且可解释的可视化在支持概念理解中的关键作用。为了更好地评估基础模型在教育环境中进行视觉推理的能力,我们介绍了EduVisBench,这是一个跨领域的多层次基准测试。EduVisBench涵盖了一系列要求基于视觉解决方案的STEM问题集,并提供了一套由教学理论指导的细粒度评价标准。 我们的实证分析表明,现有模型经常难以应对将复杂推理分解并转化为与人类认知过程相一致的视觉表示这一固有挑战。为了解决这些局限性,我们提出了EduVisAgent,这是一个多代理协作框架,协调专门用于教育规划、推理拆解、元认知提示和可视化设计的代理。实验结果显示,EduVisAgent显著优于所有基线模型,在表现上提高了40.2%,并提供了更符合教学目标的视觉表示。 有关EduVisBench和EduVisAgent的更多信息,请访问以下链接: - EduVisBench: [URL] - EduVisAgent: [URL] 请将[URL]替换为实际网址。
https://arxiv.org/abs/2505.16832
A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch. We propose a diffusion-based method that takes speaker embeddings extracted from a pre-trained speaker recognition model and generates refined embeddings. For training, our approach progressively adds Gaussian noise to both clean and noisy speaker embeddings extracted from clean and noisy speech, respectively, via forward process of a diffusion model, and then reconstructs them to clean embeddings in the reverse process. While inferencing, all embeddings are regenerated via diffusion process. Our method needs neither speaker label nor any modification to the existing speaker recognition pipeline. Experiments on evaluation sets simulating environment mismatch scenarios show that our method can improve recognition accuracy by up to 19.6% over baseline models while retaining performance on conventional scenarios. We publish our code here this https URL
在将说话人识别系统部署到实际应用中时,一个主要挑战是由于环境不匹配导致的性能下降。我们提出了一种基于扩散模型的方法,该方法从预训练的说话人识别模型中提取说话人嵌入,并生成优化后的嵌入。在训练阶段,我们的方法通过逐步向干净和嘈杂语音分别提取的干净与嘈杂的说话人嵌入添加高斯噪声,进行正向过程处理,然后在反向过程中重建为干净的嵌入。而在推理阶段,所有嵌入都将通过扩散过程再生。我们这种方法不需要说话人的标签,也不需要对现有的说话人识别流程做任何修改。 实验表明,在模拟环境不匹配场景的数据集上,我们的方法相对于基准模型可以提高高达19.6%的识别准确率,并且在传统场景下性能也能得到保持。代码已发布于此链接: [这里应填写实际的URL地址]
https://arxiv.org/abs/2505.16798
Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA's best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at this https URL .
扩散变换器(DiT)能够提供最先进的图像质量,但它们的训练过程仍然非常耗时。最近的一个解决方案——表示对齐(REPA),通过将DiT隐藏特征与非生成性教师模型(例如DINO)的特征匹配来加速早期训练阶段,但在后期要么停滞不前,要么性能下降。我们发现这种失败的原因是能力错配:一旦生成型学生开始建模联合数据分布,教师较低维度的嵌入和注意力模式就会成为一种限制而非指导。 为解决这一问题,我们提出了HASTE(分阶段终止的整体对齐高效训练),这是一种两阶段的时间表,旨在保持有益的因素并剔除有害的部分。第一阶段应用整体对齐损失,在中期层面上同时从教师模型中提取注意力图(关系先验)和特征投影(语义锚点)到DiT中,从而实现快速收敛。第二阶段则进行一次性终止操作,在达到一个简单的触发器(例如固定迭代次数)时停用对齐损失,让DiT能够专注于去噪任务并充分利用其生成能力。 HASTE能够在不改变架构的情况下加速各种DiTs的训练过程。在ImageNet 256x256的数据集上,它仅通过50个周期就达到了基础SiT-XL/2模型的FID评分,并且在经过500个周期后能够匹配REPA的最佳FID评分,这相当于优化步骤减少了28倍。此外,HASTE还在MS-COCO数据集上的文字到图像DiTs任务中进行了改进,证明了其作为一种简单而原则性的方法,在多种任务的高效扩散训练中具有广泛的应用价值。 我们的代码可在提供的链接中获取(原文中的链接未给出具体地址,请访问原始论文或官方页面查找)。
https://arxiv.org/abs/2505.16792
Masked diffusion models (MDMs) have achieved notable progress in modeling discrete data, while their potential in molecular generation remains underexplored. In this work, we explore their potential and introduce the surprising result that naively applying standards MDMs severely degrades the performance. We identify the critical cause of this issue as a state-clashing problem-where the forward diffusion of distinct molecules collapse into a common state, resulting in a mixture of reconstruction targets that cannot be learned using typical reverse diffusion process with unimodal predictions. To mitigate this, we propose Masked Element-wise Learnable Diffusion (MELD) that orchestrates per-element corruption trajectories to avoid collision between distinct molecular graphs. This is achieved through a parameterized noise scheduling network that assigns distinct corruption rates to individual graph elements, i.e., atoms and bonds. Extensive experiments on diverse molecular benchmarks reveal that MELD markedly enhances overall generation quality compared to element-agnostic noise scheduling, increasing the chemical validity of vanilla MDMs on ZINC250K from 15% to 93%, Furthermore, it achieves state-of-the-art property alignment in conditional generation tasks.
掩码扩散模型(MDMs)在离散数据建模方面取得了显著进展,然而其在分子生成中的潜力尚未被充分探索。在这项工作中,我们探讨了它们的潜力,并引入了一个令人惊讶的结果:直接应用标准的MDMs会严重降低性能。我们将这一问题的关键原因归结为“状态冲突”问题——即不同分子的前向扩散过程最终收敛到同一个状态,这导致了一种混合重构目标,无法通过典型的单模逆向扩散过程进行学习。为解决这个问题,我们提出了掩码元素级可学习扩散(MELD),该方法通过为每个图元(如原子和键)分配不同的污染率来协调各个元素的腐败轨迹,从而避免不同分子图之间的冲突。这一机制由一个参数化的噪声调度网络实现。 在多种分子基准测试中进行广泛实验后发现,与不区分元素的噪声调度相比,MELD显著提高了整体生成质量。具体而言,它将纯MDMs模型在ZINC250K数据集上的化学有效性从15%提升到了93%,同时还在条件生成任务中达到了最先进的属性对齐水平。
https://arxiv.org/abs/2505.16790
Datasets in engineering domains are often small, sparsely labeled, and contain numerical as well as categorical conditions. Additionally. computational resources are typically limited in practical applications which hinders the adoption of generative models for engineering tasks. We introduce a novel masked-conditioning approach, that enables generative models to work with sparse, mixed-type data. We mask conditions during training to simulate sparse conditions at inference time. For this purpose, we explore the use of various sparsity schedules that show different strengths and weaknesses. In addition, we introduce a flexible embedding that deals with categorical as well as numerical conditions. We integrate our method into an efficient variational autoencoder as well as a latent diffusion model and demonstrate the applicability of our approach on two engineering-related datasets of 2D point clouds and images. Finally, we show that small models trained on limited data can be coupled with large pretrained foundation models to improve generation quality while retaining the controllability induced by our conditioning scheme.
工程领域的数据集通常规模较小、标注稀疏,并且包含数值型和分类条件。此外,在实际应用中计算资源有限,这阻碍了生成模型在工程任务中的使用。我们提出了一种新颖的掩码条件方法,使生成模型能够处理稀疏、混合类型的数据。我们在训练过程中遮蔽条件以模拟推理时的稀疏条件。为此,我们探索了不同类型的稀疏调度,它们表现出不同的优缺点。此外,我们引入了一种灵活嵌入方法,用于处理分类和数值条件。我们将该方法集成到高效的变分自动编码器以及潜在扩散模型中,并在两个与工程相关的2D点云和图像数据集上展示了我们的方法的适用性。最后,我们表明,在有限数据上训练的小型模型可以与大型预训练基础模型结合使用,以提高生成质量同时保持由我们的条件化方案诱导的可控性。
https://arxiv.org/abs/2505.16725