Modern video generative models based on diffusion models can produce very realistic clips, but they are computationally inefficient, often requiring minutes of GPU time for just a few seconds of video. This inefficiency poses a critical barrier to deploying generative video in applications that require real-time interactions, such as embodied AI and VR/AR. This paper explores a new strategy for camera-conditioned video generation of static scenes: using diffusion-based generative models to generate a sparse set of keyframes, and then synthesizing the full video through 3D reconstruction and rendering. By lifting keyframes into a 3D representation and rendering intermediate views, our approach amortizes the generation cost across hundreds of frames while enforcing geometric consistency. We further introduce a model that predicts the optimal number of keyframes for a given camera trajectory, allowing the system to adaptively allocate computation. Our final method, SRENDER, uses very sparse keyframes for simple trajectories and denser ones for complex camera motion. This results in video generation that is more than 40 times faster than the diffusion-based baseline in generating 20 seconds of video, while maintaining high visual fidelity and temporal stability, offering a practical path toward efficient and controllable video synthesis.
基于扩散模型的现代视频生成模型能够产生非常逼真的片段,但它们在计算上效率低下,通常需要数分钟的GPU时间才能生成短短几秒钟的视频。这种低效性对那些需要实时互动的应用程序(如具身人工智能和虚拟现实/增强现实)部署生成式视频构成了关键障碍。本文探讨了一种新的策略:针对静态场景进行相机条件下的视频生成——使用基于扩散模型的生成器来产生一组稀疏的关键帧,然后通过3D重建和渲染合成完整的视频。我们的方法通过将这些关键帧提升到三维表示中,并渲染中间视图,在数百帧之间分摊了生成成本,同时保持了几何一致性。我们进一步引入了一种预测给定相机轨迹所需最优关键帧数量的模型,使系统能够自适应地分配计算资源。最终的方法SRENDER针对简单路径使用非常稀疏的关键帧,而对复杂摄像机运动则采用更密集的关键帧设置。这使得在生成20秒视频时,相较于基于扩散的基本线程,速度提高了40多倍,同时保持了视觉保真度和时间稳定性,为高效的可控视频合成提供了一条实用路径。
https://arxiv.org/abs/2601.09697
Longitudinal brain MRI is essential for lifespan study, yet high attrition rates often lead to missing data, complicating analysis. Deep generative models have been explored, but most rely solely on image intensity, leading to two key limitations: 1) the fidelity or trustworthiness of the generated brain images are limited, making downstream studies questionable; 2) the usage flexibility is restricted due to fixed guidance rooted in the model structure, restricting full ability to versatile application scenarios. To address these challenges, we introduce DF-DiffCom, a Kolmogorov-Arnold Networks (KAN)-enhanced diffusion model that smartly leverages deformation fields for trustworthy longitudinal brain image completion. Trained on OASIS-3, DF-DiffCom outperforms state-of-the-art methods, improving PSNR by 5.6% and SSIM by 0.12. More importantly, its modality-agnostic nature allows smooth extension to varied MRI modalities, even to attribute maps such as brain tissue segmentation results.
纵向脑部MRI对于生命周期研究至关重要,但高流失率常常导致数据缺失,从而复杂化了分析过程。深度生成模型已经被探索,但是大多数此类模型仅依赖于图像强度信息,这带来了两个关键限制:1) 生成的脑部图像的质量或可信度有限,使得下游研究结果值得怀疑;2) 因为模型结构中的固定引导,其使用灵活性受限,无法完全适应多样的应用场景。为了应对这些挑战,我们引入了DF-DiffCom——一种增强型扩散模型,利用Kolmogorov-Arnold Networks (KAN) 智能地结合变形场进行可靠的纵向脑部图像补全。在OASIS-3数据集上训练后,DF-DiffCom超越了现有的最佳方法,PSNR提高了5.6%,SSIM提高了0.12。更重要的是,其模态无关特性允许无缝扩展到各种MRI模态,甚至可以应用于诸如脑组织分割结果之类的属性图。
https://arxiv.org/abs/2601.09572
Egocentric Human-Object Interaction (EHOI) analysis is crucial for industrial safety, yet the development of robust models is hindered by the scarcity of annotated domain-specific data. We address this challenge by introducing a data generation framework that combines synthetic data with a diffusion-based process to augment real-world images with realistic Personal Protective Equipment (PPE). We present GlovEgo-HOI, a new benchmark dataset for industrial EHOI, and GlovEgo-Net, a model integrating Glove-Head and Keypoint- Head modules to leverage hand pose information for enhanced interaction detection. Extensive experiments demonstrate the effectiveness of the proposed data generation framework and GlovEgo-Net. To foster further research, we release the GlovEgo-HOI dataset, augmentation pipeline, and pre-trained models at: GitHub project.
以自我为中心的人机交互(EHOI)分析对于工业安全至关重要,但缺乏标注的特定领域数据阻碍了稳健模型的发展。为了解决这一挑战,我们引入了一个结合合成数据和基于扩散的过程的数据生成框架,该框架能够用逼真的个人防护装备(PPE)增强现实世界图像。我们提出了GlovEgo-HOI,这是一个新的工业EHOI基准数据集,以及GlovEgo-Net模型,该模型整合了Glove-Head和Keypoint-Head模块以利用手部姿态信息来增强交互检测能力。大量的实验展示了所提出的生成框架和GlovEgo-Net的有效性。为了促进进一步的研究,我们在GitHub项目中发布了GlovEgo-HOI数据集、增强管道以及预训练模型:[GitHub项目链接]。 (注意:原文末尾提供的具体GitHub链接未给出,因此在翻译时保持了形式上的表述。)
https://arxiv.org/abs/2601.09528
Enabling humanoid robots to physically interact with humans is a critical frontier, but progress is hindered by the scarcity of high-quality Human-Humanoid Interaction (HHoI) data. While leveraging abundant Human-Human Interaction (HHI) data presents a scalable alternative, we first demonstrate that standard retargeting fails by breaking the essential contacts. We address this with PAIR (Physics-Aware Interaction Retargeting), a contact-centric, two-stage pipeline that preserves contact semantics across morphology differences to generate physically consistent HHoI data. This high-quality data, however, exposes a second failure: conventional imitation learning policies merely mimic trajectories and lack interactive understanding. We therefore introduce D-STAR (Decoupled Spatio-Temporal Action Reasoner), a hierarchical policy that disentangles when to act from where to act. In D-STAR, Phase Attention (when) and a Multi-Scale Spatial module (where) are fused by the diffusion head to produce synchronized whole-body behaviors beyond mimicry. By decoupling these reasoning streams, our model learns robust temporal phases without being distracted by spatial noise, leading to responsive, synchronized collaboration. We validate our framework through extensive and rigorous simulations, demonstrating significant performance gains over baseline approaches and a complete, effective pipeline for learning complex whole-body interactions from HHI data.
使类人机器人能够与人类进行物理互动是一项关键挑战,但进展受制于高质量的人机交互(HHoI)数据的稀缺。虽然利用丰富的互人类互动(HHI)数据提供了一种可扩展的替代方案,但我们首先证明了标准重定向方法在破坏重要接触点的情况下失败。为此,我们提出了一种物理感知互动重定位技术(PAIR),这是一种以接触为中心的两阶段管道,能够跨越形态差异保留接触语义,从而生成符合物理学原理的人机交互数据。然而,这种高质量的数据揭示了一个新的问题:传统的模仿学习策略只是复制轨迹而缺乏互动理解能力。因此,我们引入了D-STAR(解耦的空间时间动作推理器),这是一种分层策略,将决定何时行动与如何行动分离。在D-STAR中,通过扩散头部融合阶段注意力(何时)和多尺度空间模块(何处),生成超越模仿的同步全身行为。通过分离这些推理流,我们的模型能够专注于学习稳健的时间阶段而不受空间噪音干扰,从而实现反应灵敏且同步的合作。 我们通过广泛的严格模拟验证了该框架的有效性,展示了相比基线方法的重大性能改进,并提供了从HHI数据中有效学习复杂身体互动的完整、有效的管道。
https://arxiv.org/abs/2601.09518
Recent video diffusion models generate photorealistic, temporally coherent videos, yet they fall short as reliable world models for autonomous driving, where structured motion and physically consistent interactions are essential. Adapting these generalist video models to driving domains has shown promise but typically requires massive domain-specific data and costly fine-tuning. We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion learning from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion sequences, effectively "dressing" the motion with texture and lighting. This two-stage process mirrors a reasoning-rendering paradigm: first infer dynamics, then render appearance. Our experiments show this decoupled approach is exceptionally efficient: adapting SVD, we match prior SOTA models with less than 6% of their compute. Scaling to LTX, our MAD-LTX model outperforms all open-source competitors, and supports a comprehensive suite of text, ego, and object controls. Project page: this https URL
最近的视频扩散模型能够生成逼真的、在时间上连贯的视频,但它们作为自主驾驶所需的世界模型时却显得不足。因为在自主驾驶中,结构化的运动和物理一致性的互动至关重要。将这些通用型视频模型适应于驾驶领域显示出了一定的潜力,但这通常需要大量的特定领域的数据以及昂贵的微调过程。我们提出了一种高效的转换框架,该框架能够把通用型视频扩散模型转变成可控的驾驶世界模型,并且只需要最少的监督。 该方法的核心思想是将运动学习从外观合成中分离出来。首先,让模型适应于以简化形式预测结构化运动:即生成骨架化的代理(例如车辆、行人等)和场景元素组成的视频片段,重点在于物理合理性和社会合理性上。接着,使用相同的模型主体来根据这些运动序列生成真实的RGB视频,有效地为这种动态“穿衣”加上纹理和光照效果。这个两阶段的过程模仿了推理-渲染的模式:先推断动态行为,再呈现外观。 我们的实验表明这种方法的分离式方法异常高效:适应SVD时,我们使用不到其计算资源6%的部分即可与先前最佳模型匹敌。当我们扩展到LTX时,我们的MAD-LTX模型在所有开源竞争者中表现最佳,并且支持全面的文字、自我(例如自身车辆)和物体控制功能。 项目页面: [此链接](https://this-url.com)
https://arxiv.org/abs/2601.09452
Generating safe and reliable trajectories for autonomous vehicles in long-tail scenarios remains a significant challenge, particularly for high-lateral-acceleration maneuvers such as sharp turns, which represent critical safety situations. Existing trajectory planners exhibit systematic failures in these scenarios due to data imbalance. This results in insufficient modelling of vehicle dynamics, road geometry, and environmental constraints in high-risk situations, leading to suboptimal or unsafe trajectory prediction when vehicles operate near their physical limits. In this paper, we introduce ReflexDiffusion, a novel inference-stage framework that enhances diffusion-based trajectory planners through reflective adjustment. Our method introduces a gradient-based adjustment mechanism during the iterative denoising process: after each standard trajectory update, we compute the gradient between the conditional and unconditional noise predictions to explicitly amplify critical conditioning signals, including road curvature and lateral vehicle dynamics. This amplification enforces strict adherence to physical constraints, particularly improving stability during high-lateral-acceleration maneuvers where precise vehicle-road interaction is paramount. Evaluated on the nuPlan Test14-hard benchmark, ReflexDiffusion achieves a 14.1% improvement in driving score for high-lateral-acceleration scenarios over the state-of-the-art (SOTA) methods. This demonstrates that inference-time trajectory optimization can effectively compensate for training data sparsity by dynamically reinforcing safety-critical constraints near handling limits. The framework's architecture-agnostic design enables direct deployment to existing diffusion-based planners, offering a practical solution for improving autonomous vehicle safety in challenging driving conditions.
在长尾场景中为自动驾驶汽车生成安全且可靠的轨迹仍然是一个重大挑战,尤其是在需要高横向加速度的操作(如急转弯)的情况下,这种情况对于车辆的安全性尤为重要。现有的轨迹规划器由于数据不平衡,在这些场景中表现出系统性的失败,这导致对车辆动力学、道路几何和环境约束的建模不足,从而在车辆接近物理极限时产生次优或不安全的路径预测。 本文介绍了一种名为ReflexDiffusion的新颖推理阶段框架,通过反射调整增强了基于扩散的方法的轨迹规划器。我们的方法引入了迭代去噪过程中的梯度基调整机制:每次标准轨迹更新后,我们计算条件和非条件噪声预测之间的梯度,以明确放大关键调控信号,包括道路曲率和横向车辆动态。这种放大强制执行严格的物理约束遵守,尤其是在高横向加速度操作中,精确的车路相互作用至关重要。 在nuPlan Test14-hard基准测试上进行评估时,ReflexDiffusion在高横向加速度场景下比最先进的方法(SOTA)提高了驾驶评分14.1%。这表明推理时间轨迹优化可以通过动态强化处理极限附近的临界安全约束来有效补偿训练数据的稀疏性。 该框架的设计与架构无关,可以直接部署到现有的基于扩散的方法规划器中,提供了一种在挑战性驾驶条件下提高自动驾驶车辆安全性的确切解决方案。
https://arxiv.org/abs/2601.09377
We propose a timbre conversion model based on the Diffusion architecture de-signed to precisely translate music played by various instruments into piano ver-sions. The model employs a Pitch Encoder and Loudness Encoder to extract pitch and loudness features of the music, which serve as conditional inputs to the Dif-fusion Model's decoder, generating high-quality piano timbres. Case analysis re-sults show that the model performs excellently in terms of pitch accuracy and timbral similarity, maintaining stable conversion across different musical styles (classical, jazz, pop) and lengths (from short clips to full pieces). Particularly, the model maintains high sound quality and accuracy even when dealing with rapidly changing notes and complex musical structures, demonstrating good generaliza-tion capability. Additionally, the model has the potential for real-time musical conversion and is suitable for live performances and digital music creation tools. Future research will focus on enhancing the handling of loudness dynamics and incorporating additional musical features (such as timbral variations and rhythmic complexity) to improve the model's adaptability and expressiveness. We plan to explore the model's application potential in other timbre conversion tasks, such as converting vocals to instrumental sounds or integration with MIDI digital pianos, further expanding the application scope of the Diffusion-based timbre conversion model in the field of music generation.
我们提出了一种基于扩散架构的音色转换模型,旨在将由各种乐器演奏的音乐精确地转化为钢琴版本。该模型采用了一个音高编码器和响度编码器来提取音乐中的音高和响度特征,这些特征作为条件输入被送入扩散模型的解码器中,从而生成高质量的钢琴音色。 案例分析结果显示,该模型在音准准确性和音色相似性方面表现出色,并且能够保持不同音乐风格(古典、爵士、流行)和长度(从短片段到整首曲子)之间的稳定转换。特别是,在处理快速变化的音符及复杂的音乐结构时,模型仍能保持高质量的声音效果和准确性,展示了良好的泛化能力。 此外,该模型还具有实时音乐转换的潜力,并适用于现场表演和数字音乐创作工具中使用。未来的研究将集中在增强响度动态处理的能力上,并整合更多的音乐特征(如音色变化和节奏复杂性),以提升模型的适应性和表现力。我们计划探索该模型在其他音色转换任务中的应用潜能,例如将人声转化为乐器声音或与MIDI数字钢琴集成,从而进一步扩大基于扩散架构的音色转换模型在音乐生成领域的应用范围。
https://arxiv.org/abs/2601.09333
Magnetic resonance imaging (MRI) plays a vital role in clinical diagnostics, yet it remains hindered by long acquisition times and motion artifacts. Multi-contrast MRI reconstruction has emerged as a promising direction by leveraging complementary information from fully-sampled reference scans. However, existing approaches suffer from three major limitations: (1) superficial reference fusion strategies, such as simple concatenation, (2) insufficient utilization of the complementary information provided by the reference contrast, and (3) fixed under-sampling patterns. We propose an efficient and interpretable frequency error-guided reconstruction framework to tackle these issues. We first employ a conditional diffusion model to learn a Frequency Error Prior (FEP), which is then incorporated into a unified framework for jointly optimizing both the under-sampling pattern and the reconstruction network. The proposed reconstruction model employs a model-driven deep unfolding framework that jointly exploits frequency- and image-domain information. In addition, a spatial alignment module and a reference feature decomposition strategy are incorporated to improve reconstruction quality and bridge model-based optimization with data-driven learning for improved physical interpretability. Comprehensive validation across multiple imaging modalities, acceleration rates (4-30x), and sampling schemes demonstrates consistent superiority over state-of-the-art methods in both quantitative metrics and visual quality. All codes are available at this https URL.
磁共振成像(MRI)在临床诊断中扮演着重要角色,但其长采集时间和运动伪影问题仍然存在。多对比度MRI重建通过利用全采样参考扫描提供的互补信息成为了一个有前景的方向。然而,现有的方法面临着三个主要限制:(1) 浅层的参考融合策略,如简单的连接;(2) 对参考对比中提供互补信息的利用不足;以及 (3) 固定的欠采样模式。我们提出了一种有效的、可解释的频率误差引导重建框架来解决这些问题。首先,我们采用了一个条件扩散模型来学习一个频率误差先验(FEP),然后将其整合到统一框架中,以同时优化欠采样图案和重建网络。所提出的重建模型采用了基于模型驱动的深度折叠框架,共同利用了频域和图像领域的信息。此外,还集成了空间对齐模块和参考特征分解策略来提高重建质量,并且通过结合基于模型的优化与数据驱动学习方法,增强了物理可解释性。 在多个成像模式、加速率(4-30倍)以及采样方案上的全面验证表明,在定量指标和视觉质量方面,我们的方法优于现有最先进的方法。所有代码均可在提供的网址上获得。
https://arxiv.org/abs/2601.09316
Recent diffusion-based video generation models can synthesize visually plausible videos, yet they often struggle to satisfy physical constraints. A key reason is that most existing approaches remain single-stage: they entangle high-level physical understanding with low-level visual synthesis, making it hard to generate content that require explicit physical reasoning. To address this limitation, we propose a training-free three-stage pipeline,\textit{PhyRPR}:\textit{Phy\uline{R}eason}--\textit{Phy\uline{P}lan}--\textit{Phy\uline{R}efine}, which decouples physical understanding from visual synthesis. Specifically, \textit{PhyReason} uses a large multimodal model for physical state reasoning and an image generator for keyframe synthesis; \textit{PhyPlan} deterministically synthesizes a controllable coarse motion scaffold; and \textit{PhyRefine} injects this scaffold into diffusion sampling via a latent fusion strategy to refine appearance while preserving the planned dynamics. This staged design enables explicit physical control during generation. Extensive experiments under physics constraints show that our method consistently improves physical plausibility and motion controllability.
最近基于扩散模型的视频生成方法能够合成视觉上可信的视频,但这些模型通常难以满足物理约束。主要原因在于现有大多数方法仍处于单一阶段:它们将高层次的物理理解与低层次的视觉合成纠缠在一起,这使得生成需要明确物理推理的内容变得困难。为了克服这一限制,我们提出了一种无需训练的三阶段管道“PhyRPR”(\textit{PhyReason}--\textit{PhyPlan}--\textit{PhyRefine}),该方法将物理理解和视觉合成解耦。 具体来说,“PhyReason”使用大型多模态模型进行物理状态推理,并采用图像生成器来合成关键帧;“PhyPlan”确定性地合成一个可控的粗略运动框架;而“PhyRefine”则通过一种潜在融合策略将此框架注入扩散采样过程,以在保留计划动态的同时优化外观。这种分阶段的设计能够在视频生成过程中提供明确的物理控制。 在多种物理约束下的大量实验表明,我们的方法能持续提高物理可信度和运动可控性。
https://arxiv.org/abs/2601.09255
Substation meters play a critical role in monitoring and ensuring the stable operation of power grids, yet their detection of cracks and other physical defects is often hampered by a severe scarcity of annotated samples. To address this few-shot generation challenge, we propose a novel framework that integrates Knowledge Embedding and Hypernetwork-Guided Conditional Control into a Stable Diffusion pipeline, enabling realistic and controllable synthesis of defect images from limited data. First, we bridge the substantial domain gap between natural-image pre-trained models and industrial equipment by fine-tuning a Stable Diffusion backbone using DreamBooth-style knowledge embedding. This process encodes the unique structural and textural priors of substation meters, ensuring generated images retain authentic meter characteristics. Second, we introduce a geometric crack modeling module that parameterizes defect attributes--such as location, length, curvature, and branching pattern--to produce spatially constrained control maps. These maps provide precise, pixel-level guidance during generation. Third, we design a lightweight hypernetwork that dynamically modulates the denoising process of the diffusion model in response to the control maps and high-level defect descriptors, achieving a flexible balance between generation fidelity and controllability. Extensive experiments on a real-world substation meter dataset demonstrate that our method substantially outperforms existing augmentation and generation baselines. It reduces Frechet Inception Distance (FID) by 32.7%, increases diversity metrics, and--most importantly--boosts the mAP of a downstream defect detector by 15.3% when trained on augmented data. The framework offers a practical, high-quality data synthesis solution for industrial inspection systems where defect samples are rare.
变电站电表在监测和确保电网稳定运行中扮演着关键角色,然而由于标注样本的极度稀缺性,它们检测裂缝及其他物理缺陷的能力常常受限。为了解决这一少量数据生成挑战,我们提出了一种新颖框架,该框架将知识嵌入与超网络引导条件控制相结合,并将其整合到一个稳定的扩散模型(Stable Diffusion)管道中,从而能够从有限的数据中合成出真实且可控的缺陷图像。 首先,通过采用DreamBooth风格的知识嵌入方法对预训练的Stable Diffusion骨干网络进行微调,我们弥合了自然图象预训练模型与工业设备之间的巨大领域差距。这一过程编码了变电站电表特有的结构和纹理先验知识,确保生成的图像保留真实电表特性。 其次,我们引入了一个几何裂缝建模模块,该模块参数化缺陷属性(如位置、长度、曲率及分支模式)以生成空间受限控制图。这些地图在生成过程中提供精确到像素级别的指导信息。 第三,设计了一个轻量级超网络,在扩散模型去噪过程中动态调节响应于控制地图和高层次缺陷描述符,实现生成真实度与可控性的灵活平衡。 通过一个实际的变电站电表数据集进行广泛实验表明,我们的方法在性能上显著优于现有的增强和生成基准。它将Frechet Inception距离(FID)降低了32.7%,提高了多样性和关键性地提升了当训练数据增加时下游缺陷检测器的平均精度(mAP),增幅为15.3%。 该框架提供了一种实用且高质量的数据合成解决方案,适用于工业检查系统中缺陷样本稀缺的情况。
https://arxiv.org/abs/2601.09238
Reconstructing natural visual scenes from neural activity is a key challenge in neuroscience and computer vision. We present SpikeVAEDiff, a novel two-stage framework that combines a Very Deep Variational Autoencoder (VDVAE) and the Versatile Diffusion model to generate high-resolution and semantically meaningful image reconstructions from neural spike data. In the first stage, VDVAE produces low-resolution preliminary reconstructions by mapping neural spike signals to latent representations. In the second stage, regression models map neural spike signals to CLIP-Vision and CLIP-Text features, enabling Versatile Diffusion to refine the images via image-to-image generation. We evaluate our approach on the Allen Visual Coding-Neuropixels dataset and analyze different brain regions. Our results show that the VISI region exhibits the most prominent activation and plays a key role in reconstruction quality. We present both successful and unsuccessful reconstruction examples, reflecting the challenges of decoding neural activity. Compared with fMRI-based approaches, spike data provides superior temporal and spatial resolution. We further validate the effectiveness of the VDVAE model and conduct ablation studies demonstrating that data from specific brain regions significantly enhances reconstruction performance.
从神经活动重建自然视觉场景是神经科学和计算机视觉中的一个关键挑战。我们提出了一种新的两阶段框架SpikeVAEDiff,该框架结合了非常深的变分自动编码器(VDVAE)和Versatile Diffusion模型,能够从神经尖峰数据生成高分辨率且语义意义明确的图像重建。 在第一阶段,VDVAE通过将神经尖峰信号映射到潜在表示来生成低分辨率的初步重建。第二阶段中,回归模型将神经尖峰信号映射为CLIP-Vision和CLIP-Text特征,使Versatile Diffusion能够通过图像到图像的生成过程进一步细化图像。 我们在Allen Visual Coding-Neuropixels数据集上评估了这种方法,并分析了不同的大脑区域。结果表明,VISI区域表现出最为显著的激活,并且在重建质量方面发挥着关键作用。我们展示了成功的和不成功的重建实例,反映了从神经活动中解码信息所面临的挑战。 与基于fMRI的方法相比,尖峰数据提供了更好的时间分辨率和空间分辨率。此外,我们验证了VDVAE模型的有效性,并进行了消融研究,证明来自特定大脑区域的数据显著提升了重建性能。
https://arxiv.org/abs/2601.09213
Medical imaging datasets often suffer from class imbalance and limited availability of pathology-rich cases, which constrains the performance of machine learning models for segmentation, classification, and vision-language tasks. To address this challenge, we propose POWDR, a pathology-preserving outpainting framework for 3D MRI based on a conditioned wavelet diffusion model. Unlike conventional augmentation or unconditional synthesis, POWDR retains real pathological regions while generating anatomically plausible surrounding tissue, enabling diversity without fabricating lesions. Our approach leverages wavelet-domain conditioning to enhance high-frequency detail and mitigate blurring common in latent diffusion models. We introduce a random connected mask training strategy to overcome conditioning-induced collapse and improve diversity outside the lesion. POWDR is evaluated on brain MRI using BraTS datasets and extended to knee MRI to demonstrate tissue-agnostic applicability. Quantitative metrics (FID, SSIM, LPIPS) confirm image realism, while diversity analysis shows significant improvement with random-mask training (cosine similarity reduced from 0.9947 to 0.9580; KL divergence increased from 0.00026 to 0.01494). Clinically relevant assessments reveal gains in tumor segmentation performance using nnU-Net, with Dice scores improving from 0.6992 to 0.7137 when adding 50 synthetic cases. Tissue volume analysis indicates no significant differences for CSF and GM compared to real images. These findings highlight POWDR as a practical solution for addressing data scarcity and class imbalance in medical imaging. The method is extensible to multiple anatomies and offers a controllable framework for generating diverse, pathology-preserving synthetic data to support robust model development.
医学影像数据集常常因病灶不平衡和病理丰富病例的稀缺而受限,这限制了分割、分类和视觉语言任务中机器学习模型的表现。为了应对这一挑战,我们提出了一种基于条件小波扩散模型的POWDR框架,这是一种用于3D MRI的保留病理特性的扩展框架(outpainting framework)。与传统的增强方法或无条件合成不同,POWDR在生成周围组织时保留了真实的病灶区域,并确保其解剖学上的合理性,从而实现了多样性而无需制造病变。我们的方法利用小波域条件来增强高频细节并减少潜在扩散模型中的模糊现象。我们引入了一种随机连接蒙版训练策略以克服由条件引起的崩溃问题,并提高病变外的多样性。POWDR在使用BraTS数据集的脑部MRI上进行了评估,然后扩展到膝关节MRI,展示了其组织无关的应用性。定量指标(FID、SSIM、LPIPS)确认了图像的真实性,而多样性的分析显示随机蒙版训练显著改善了结果(余弦相似度从0.9947降至0.9580;KL散度从0.00026增至0.01494)。临床相关性评估表明,在使用nnU-Net进行肿瘤分割时,通过添加50个合成病例使得Dice分数从0.6992提升至0.7137。组织体积分析显示,与真实图像相比,CSF和GM没有显著差异。这些发现凸显了POWDR作为解决医学影像数据稀缺性和类别不平衡的实用解决方案的重要性。该方法可以扩展到多个解剖结构,并提供一个可控框架来生成多样且保留病理特性的合成数据以支持稳健模型开发。
https://arxiv.org/abs/2601.09044
Invisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerability of these schemes against sophisticated removal attacks remains essential to assess their reliability and guide robust design. In this work, we expose a fundamental vulnerability in invisible watermarks by reformulating watermark removal as a view synthesis problem. Our key insight is that generating a perceptually consistent alternative view of the same semantic content, akin to re-observing a scene from a shifted perspective, naturally removes the embedded watermark while preserving visual fidelity. This reveals a critical gap: watermarks robust to pixel-space and frequency-domain attacks remain vulnerable to semantic-preserving viewpoint transformations. We introduce a zero-shot diffusion-based framework that applies controlled geometric transformations in latent space, augmented with view-guided correspondence attention to maintain structural consistency during reconstruction. Operating on frozen pre-trained models without detector access or watermark knowledge, our method achieves state-of-the-art watermark suppression across 15 watermarking methods--outperforming 14 baseline attacks while maintaining superior perceptual quality across multiple datasets.
不可见水印已成为验证AI生成图像内容真伪的关键机制,各大平台已开始大规模部署水印方案。然而,评估这些方案抵御复杂移除攻击的能力仍然至关重要,以衡量其可靠性并指导稳健设计。在这项工作中,我们揭示了不可见水印的一个基本漏洞,通过将水印移除重新表述为视图合成问题来实现这一点。我们的关键见解是,生成同一语义内容的感知上一致的替代视角(类似于从稍有不同的角度重新观察一个场景),可以自然地去除嵌入的水印同时保持视觉保真度。这揭示了一个重要缺口:即使对像素空间和频域攻击具有鲁棒性的水印,在面对保持语义不变的视点变换时仍然脆弱。 我们引入了一种零样本扩散框架,该框架在潜在空间中应用受控几何变换,并通过视图引导对应注意力来维护重建过程中的结构一致性。此方法无需访问检测器或了解具体水印知识,仅基于预训练模型即可操作。我们的方法在15种不同的水印方案上实现了最先进的水印抑制效果,在超越了14种基线攻击的同时,还能在多个数据集上保持优越的感知质量。
https://arxiv.org/abs/2601.08832
Accurate delineation of acute ischemic stroke lesions in MRI is a key component of stroke diagnosis and management. In recent years, deep learning models have been successfully applied to the automatic segmentation of such lesions. While most proposed architectures are based on the U-Net framework, they primarily differ in their choice of loss functions and in the use of deep supervision, residual connections, and attention mechanisms. Moreover, many implementations are not publicly available, and the optimal configuration for acute ischemic stroke (AIS) lesion segmentation remains unclear. In this work, we introduce ISLA (Ischemic Stroke Lesion Analyzer), a new deep learning model for AIS lesion segmentation from diffusion MRI, trained on three multicenter databases totaling more than 1500 AIS participants. Through systematic optimization of the loss function, convolutional architecture, deep supervision, and attention mechanisms, we developed a robust segmentation framework. We further investigated unsupervised domain adaptation to improve generalization to an external clinical dataset. ISLA outperformed two state-of-the-art approaches for AIS lesion segmentation on an external test set. Codes and trained models will be made publicly available to facilitate reuse and reproducibility.
在MRI中准确地描绘急性缺血性脑卒中的病变是诊断和管理的重要组成部分。近年来,深度学习模型已被成功应用于此类病变的自动分割任务上。尽管大多数提出的架构基于U-Net框架,但它们主要通过选择不同的损失函数、深层监督、残差连接以及注意力机制来区别开来。此外,许多实现并未公开发布,并且针对急性缺血性脑卒中(AIS)病变分割的最佳配置仍然不明确。 在这项工作中,我们引入了ISLA(缺血性脑卒中病变分析仪),这是一种新的深度学习模型,用于从弥散加权MRI图像中进行AIS病变的分割。该模型是在三个多中心数据库上训练而成的,这些数据库包含超过1500名急性缺血性脑卒中的参与者的数据。通过系统地优化损失函数、卷积架构、深层监督以及注意力机制,我们开发了一个稳健的分割框架。此外,为了提高在外部临床数据集上的泛化能力,我们还研究了无监督领域自适应技术。 ISLA在外部分割测试集中超越了两个最先进的AIS病变分割方法。代码和训练好的模型将被公开发布,以促进再利用和可重复性研究。
https://arxiv.org/abs/2601.08732
Accurate and generalisable segmentation of stroke lesions from magnetic resonance imaging (MRI) is essential for advancing clinical research, prognostic modelling, and personalised interventions. Although deep learning has improved automated lesion delineation, many existing models are optimised for narrow imaging contexts and generalise poorly to independent datasets, modalities, and stroke stages. Here, we systematically evaluated stroke lesion segmentation using the nnU-Net framework across multiple heterogeneous, publicly available MRI datasets spanning acute and chronic stroke. Models were trained and tested on diffusion-weighted imaging (DWI), fluid-attenuated inversion recovery (FLAIR), and T1-weighted MRI, and evaluated on independent datasets. Across stroke stages, models showed robust generalisation, with segmentation accuracy approaching reported inter-rater reliability. Performance varied with imaging modality and training data characteristics. In acute stroke, DWI-trained models consistently outperformed FLAIR-based models, with only modest gains from multimodal combinations. In chronic stroke, increasing training set size improved performance, with diminishing returns beyond several hundred cases. Lesion volume was a key determinant of accuracy: smaller lesions were harder to segment, and models trained on restricted volume ranges generalised poorly. MRI image quality further constrained generalisability: models trained on lower-quality scans transferred poorly, whereas those trained on higher-quality data generalised well to noisier images. Discrepancies between predictions and reference masks were often attributable to limitations in manual annotations. Together, these findings show that automated lesion segmentation can approach human-level performance while identifying key factors governing generalisability and informing the development of lesion segmentation tools.
从磁共振成像(MRI)中准确且通用地分割脑卒中病变对于推动临床研究、预后建模和个性化干预至关重要。尽管深度学习已经改善了自动病变勾画的效果,但许多现有的模型针对狭窄的影像学背景进行优化,并在独立的数据集、模式以及不同阶段的脑卒中上推广效果不佳。在这里,我们使用nnU-Net框架系统地评估了跨越急性期和慢性期多种异质性公开MRI数据集上的脑卒中病变分割情况。模型是在扩散加权成像(DWI)、流体衰减反转恢复序列(FLAIR)和T1加权MRI上训练并测试的,并在独立的数据集中进行评价。 无论在哪一阶段,模型均表现出稳健的推广能力,其分割准确性接近报告的人为标记的一致性。性能随影像学模式和训练数据特性变化而不同:在急性期脑卒中时,基于DWI训练的模型始终优于FLAIR基线模型,多模态组合仅带来轻微提升;而在慢性期脑卒中时,增加训练集大小能够提高表现,但在几百个案例之后回报递减。病变体积是决定准确性的一个关键因素:较小的病变更难以分割,并且在受限体积范围训练的模型推广效果差。 此外,MRI图像质量也限制了通用性:基于低质量扫描训练出的模型不能很好地迁移到其他数据集,而基于高质量数据训练出来的模型则能够良好地适应噪声较大的影像。预测结果与参考掩模之间的差异往往归因于人工注释本身的局限性。综上所述,这些发现表明自动化病变分割可以接近人类水平的表现,并识别影响推广性的关键因素,从而为开发病变分割工具提供信息。
https://arxiv.org/abs/2601.08701
Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from their training data, leading to the reproduction of unsafe content such as NSFW imagery and copyrighted artistic styles. Such behaviors pose persistent safety and compliance risks in real-world deployments and cannot be reliably mitigated by post-hoc filtering, owing to the limited robustness of such mechanisms and a lack of fine-grained semantic control. Recent unlearning methods seek to erase harmful concepts at the model level, which exhibit the limitations of requiring costly retraining, degrading the quality of benign generations, or failing to withstand prompt paraphrasing and adversarial attacks. To address these challenges, we introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection. Without modifying the underlying IGMs, SafeRedir adaptively routes unsafe prompts toward safe semantic regions through token-level interventions in the embedding space. The framework comprises two core components: a latent-aware multi-modal safety classifier for identifying unsafe generation trajectories, and a token-level delta generator for precise semantic redirection, equipped with auxiliary predictors for token masking and adaptive scaling to localize and regulate the intervention. Empirical results across multiple representative unlearning tasks demonstrate that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks. Furthermore, SafeRedir generalizes effectively across a variety of diffusion backbones and existing unlearned models, validating its plug-and-play compatibility and broad applicability. Code and data are available at this https URL.
图像生成模型(IGM)尽管能够产生令人印象深刻且具有创意的内容,但往往会从其训练数据中记住大量不希望出现的概念,导致再现诸如非法色情内容和受版权保护的艺术风格等不安全内容。这种行为在现实世界的应用部署中带来了持续的安全性和合规性风险,并且无法通过事后过滤可靠地缓解这些问题,原因是这些机制的鲁棒性有限以及缺乏精细的语义控制能力。 最近提出的遗忘方法试图通过修改模型层面来消除有害概念,但这种方法存在局限性:需要进行成本高昂的重新训练、降低良性生成的质量,或者在面对提示改写和对抗性攻击时失效。为了解决这些挑战,我们引入了SafeRedir,这是一种轻量级的推理时间框架,用于通过提示嵌入重定向实现鲁棒遗忘。 SafeRedir无需修改基础IGM,在嵌入空间中通过对令牌级别的干预来引导不安全提示朝向安全语义区域。该框架由两个核心组件组成:一个是感知潜在变量的多模态安全分类器,用于识别不安全生成轨迹;另一个是令牌级增量生成器,配备辅助预测器以进行令牌屏蔽和自适应缩放,从而实现精确的语义重定向并定位及调节干预措施。 多项代表性遗忘任务中的实证结果表明,SafeRedir能够有效地执行遗忘操作、保持高语义和感知上的保真度、提供高质量图像,并增强对对抗性攻击的抵抗力。此外,SafeRedir在多种扩散骨干网络和现有未学习模型中表现良好,验证了其即插即用兼容性和广泛适用性。 代码和数据可在提供的链接(此 https URL)处获取。
https://arxiv.org/abs/2601.08623
For automated assessment of knee MRI scans, both accuracy and interpretability are essential for clinical use and adoption. Traditional radiomics rely on predefined features chosen at the population level; while more interpretable, they are often too restrictive to capture patient-specific variability and can underperform end-to-end deep learning (DL). To address this, we propose two complementary strategies that bring individuality and interpretability: radiomic fingerprints and healthy personas. First, a radiomic fingerprint is a dynamically constructed, patient-specific feature set derived from MRI. Instead of applying a uniform population-level signature, our model predicts feature relevance from a pool of candidate features and selects only those most predictive for each patient, while maintaining feature-level interpretability. This fingerprint can be viewed as a latent-variable model of feature usage, where an image-conditioned predictor estimates usage probabilities and a transparent logistic regression with global coefficients performs classification. Second, a healthy persona synthesises a pathology-free baseline for each patient using a diffusion model trained to reconstruct healthy knee MRIs. Comparing features extracted from pathological images against their personas highlights deviations from normal anatomy, enabling intuitive, case-specific explanations of disease manifestations. We systematically compare fingerprints, personas, and their combination across three clinical tasks. Experimental results show that both approaches yield performance comparable to or surpassing state-of-the-art DL models, while supporting interpretability at multiple levels. Case studies further illustrate how these perspectives facilitate human-explainable biomarker discovery and pathology localisation.
对于膝关节MRI扫描的自动化评估,准确性和可解释性对于临床使用和接受至关重要。传统放射组学依赖于在群体层面预先定义的特征选择;尽管更具可解释性,但它们往往过于严格,无法捕捉患者特异性变异,并且可能不如端到端深度学习(DL)表现良好。为了应对这一挑战,我们提出了两种互补策略,以引入个性和可解释性:放射组学指纹和健康人格。 首先,放射组学指纹是一种从MRI中动态构建的、特定于患者的特征集。与应用统一的群体层面签名不同,我们的模型会根据候选特征池预测特征的相关性,并仅选择对每位患者最具预测性的那些特征,同时保持了特征级别的可解释性。这种指纹可以视为一种特征使用的潜在变量模型,在该模型中,一个基于图像条件的预测器估计使用概率,而透明的逻辑回归则利用全局系数进行分类。 其次,健康人格为每个患者综合了一个无病理变化的基础线,这是通过训练扩散模型来重建健康膝关节MRI得到的结果。将病理性图像提取的特征与其相应的健康人格进行比较,可以突出偏离正常解剖结构的变化,并能够提供直观、特定病例的疾病表现解释。 我们系统地对比了指纹、健康人格及其组合在三项临床任务中的表现。实验结果显示,这两种方法都能获得与当前最先进的深度学习模型相当或更好的性能,同时支持多层面的可解释性。案例研究进一步说明了这些视角如何促进人类可以理解的生物标志物发现和病理定位。 这种方法不仅提高了膝关节MRI扫描自动评估的有效性和准确性,同时也增强了临床医生的理解能力和决策能力,从而推动了更个性化的医疗实践的发展。
https://arxiv.org/abs/2601.08604
Facade renovation offers a more sustainable alternative to full demolition, yet producing design proposals that preserve existing structures while expressing new intent remains challenging. Current workflows typically require detailed as-built modelling before design, which is time-consuming, labour-intensive, and often involves repeated revisions. To solve this issue, we propose a three-stage framework combining generative artificial intelligence (AI) and vision-language models (VLM) that directly processes rough structural sketch and textual descriptions to produce consistent renovation proposals. First, the input sketch is used by a fine-tuned VLM model to predict bounding boxes specifying where modifications are needed and which components should be added. Next, a stable diffusion model generates detailed sketches of new elements, which are merged with the original outline through a generative inpainting pipeline. Finally, ControlNet is employed to refine the result into a photorealistic image. Experiments on datasets and real industrial buildings indicate that the proposed framework can generate renovation proposals that preserve the original structure while improving facade detail quality. This approach effectively bypasses the need for detailed as-built modelling, enabling architects to rapidly explore design alternatives, iterate on early-stage concepts, and communicate renovation intentions with greater clarity.
立面翻新提供了一种比全面拆除更为可持续的替代方案,但要在保留现有结构的同时表达新的设计理念仍然具有挑战性。当前的工作流程通常需要在设计之前进行详细的竣工建模,这一过程耗时且劳动密集,并且往往涉及重复修订。为了解决这个问题,我们提出了一种三阶段框架,该框架结合了生成式人工智能(AI)和视觉语言模型(VLM),可以直接处理粗糙的结构草图和文本描述以生成一致的翻新提案。 首先,输入草图通过一个微调过的VLM模型来预测需要修改的边界框以及应添加哪些组件。接下来,使用稳定扩散模型生成新元素的详细草图,并通过生成式修复管道与原始轮廓合并。最后,采用ControlNet对结果进行细化,以生成逼真的图像。 在数据集和实际工业建筑上的实验表明,所提出的框架能够产生既能保留原有结构又能提高立面细节质量的翻新提案。这种方法有效地绕过了详细的竣工建模需求,使建筑师能够快速探索设计选择,在早期概念阶段迭代,并更清晰地传达翻新区的设计意图。
https://arxiv.org/abs/2601.08531
We introduce Spectral Generative Flow Models (SGFMs), a physics-inspired alternative to transformer-based large language models. Instead of representing text or video as sequences of discrete tokens processed by attention, SGFMs treat generation as the evolution of a continuous field governed by constrained stochastic dynamics in a multiscale wavelet basis. This formulation replaces global attention with local operators, spectral projections, and Navier--Stokes-like transport, yielding a generative mechanism grounded in continuity, geometry, and physical structure. Our framework provides three key innovations: (i) a field-theoretic ontology in which text and video are unified as trajectories of a stochastic partial differential equation; (ii) a wavelet-domain representation that induces sparsity, scale separation, and computational efficiency; and (iii) a constrained stochastic flow that enforces stability, coherence, and uncertainty propagation. Together, these components define a generative architecture that departs fundamentally from autoregressive modeling and diffusion-based approaches. SGFMs offer a principled path toward long-range coherence, multimodal generality, and physically structured inductive bias in next-generation generative models.
我们介绍了Spectral Generative Flow Models(SGFM),这是一种受物理启发的大语言模型的替代方案,不同于基于变压器的语言模型。SGFMs 不是将文本或视频表示为通过注意力处理的一系列离散标记,而是将其视为由多尺度小波基中约束随机动力学驱动的连续场的演变过程。这一形式化方法用局部操作符、谱投影和类似于Navier-Stokes传输的过程取代了全局注意力机制,从而构建了一种基于连续性、几何和物理结构的基础生成机制。 我们的框架提供了三个关键创新: (i) 一种将文本和视频统一为随机偏微分方程轨迹的场论本体; (ii) 引入稀疏性和计算效率的小波域表示方法,并实现了尺度分离; (iii) 施加了稳定性、连贯性和不确定性传播约束的受限随机流。 这些组件共同定义了一种根本不同于自回归模型和基于扩散的方法的生成架构。SGFMs 为下一代生成模型提供了一个原则性的路径,以实现长程一致性、多模态泛化以及物理结构化的归纳偏差。
https://arxiv.org/abs/2601.08893
Map matching for sparse trajectories is a fundamental problem for many trajectory-based applications, e.g., traffic scheduling and traffic flow analysis. Existing methods for map matching are generally based on Hidden Markov Model (HMM) or encoder-decoder framework. However, these methods continue to face significant challenges when handling noisy or sparsely sampled GPS trajectories. To address these limitations, we propose DiffMM, an encoder-diffusion-based map matching framework that produces effective yet efficient matching results through a one-step diffusion process. We first introduce a road segment-aware trajectory encoder that jointly embeds the input trajectory and its surrounding candidate road segments into a shared latent space through an attention mechanism. Next, we propose a one step diffusion method to realize map matching through a shortcut model by leveraging the joint embedding of the trajectory and candidate road segments as conditioning context. We conduct extensive experiments on large-scale trajectory datasets, demonstrating that our approach consistently outperforms state-of-the-art map matching methods in terms of both accuracy and efficiency, particularly for sparse trajectories and complex road network topologies.
地图匹配对于稀疏轨迹是许多基于轨迹的应用程序(如交通调度和交通流分析)中的一个基本问题。现有的地图匹配方法通常基于隐马尔可夫模型(HMM)或编码-解码框架。然而,这些方法在处理包含噪声或采样间隔较大的GPS轨迹时仍然面临重大挑战。为了解决这些问题,我们提出了DiffMM,这是一种以编码器扩散为基础的地图匹配框架,通过一次扩散过程生成高效且有效的匹配结果。 首先,我们介绍了一种基于道路段的轨迹编码器,该编码器通过注意力机制将输入轨迹及其周围候选道路段共同嵌入到共享潜在空间中。接下来,我们提出了一步扩散方法,以通过一个捷径模型实现地图匹配,并利用轨迹和候选道路段的联合嵌入作为条件上下文。 我们在大规模轨迹数据集上进行了广泛的实验,结果表明我们的方法在准确性和效率方面均优于现有的最先进的地图匹配方法,尤其是在处理稀疏轨迹和复杂道路网络拓扑时。
https://arxiv.org/abs/2601.08482