Facade renovation offers a more sustainable alternative to full demolition, yet producing design proposals that preserve existing structures while expressing new intent remains challenging. Current workflows typically require detailed as-built modelling before design, which is time-consuming, labour-intensive, and often involves repeated revisions. To solve this issue, we propose a three-stage framework combining generative artificial intelligence (AI) and vision-language models (VLM) that directly processes rough structural sketch and textual descriptions to produce consistent renovation proposals. First, the input sketch is used by a fine-tuned VLM model to predict bounding boxes specifying where modifications are needed and which components should be added. Next, a stable diffusion model generates detailed sketches of new elements, which are merged with the original outline through a generative inpainting pipeline. Finally, ControlNet is employed to refine the result into a photorealistic image. Experiments on datasets and real industrial buildings indicate that the proposed framework can generate renovation proposals that preserve the original structure while improving facade detail quality. This approach effectively bypasses the need for detailed as-built modelling, enabling architects to rapidly explore design alternatives, iterate on early-stage concepts, and communicate renovation intentions with greater clarity.
立面翻新提供了一种比全面拆除更为可持续的替代方案,但要在保留现有结构的同时表达新的设计理念仍然具有挑战性。当前的工作流程通常需要在设计之前进行详细的竣工建模,这一过程耗时且劳动密集,并且往往涉及重复修订。为了解决这个问题,我们提出了一种三阶段框架,该框架结合了生成式人工智能(AI)和视觉语言模型(VLM),可以直接处理粗糙的结构草图和文本描述以生成一致的翻新提案。 首先,输入草图通过一个微调过的VLM模型来预测需要修改的边界框以及应添加哪些组件。接下来,使用稳定扩散模型生成新元素的详细草图,并通过生成式修复管道与原始轮廓合并。最后,采用ControlNet对结果进行细化,以生成逼真的图像。 在数据集和实际工业建筑上的实验表明,所提出的框架能够产生既能保留原有结构又能提高立面细节质量的翻新提案。这种方法有效地绕过了详细的竣工建模需求,使建筑师能够快速探索设计选择,在早期概念阶段迭代,并更清晰地传达翻新区的设计意图。
https://arxiv.org/abs/2601.08531
The development of robust artificial intelligence models for histopathology diagnosis is severely constrained by the scarcity of expert-annotated lesion data, particularly for rare pathologies and underrepresented disease subtypes. While data augmentation offers a potential solution, existing methods fail to generate sufficiently realistic lesion morphologies that preserve the complex spatial relationships and cellular architectures characteristic of histopathological tissues. Here we present PathoGen, a diffusion-based generative model that enables controllable, high-fidelity inpainting of lesions into benign histopathology images. Unlike conventional augmentation techniques, PathoGen leverages the iterative refinement process of diffusion models to synthesize lesions with natural tissue boundaries, preserved cellular structures, and authentic staining characteristics. We validate PathoGen across four diverse datasets representing distinct diagnostic challenges: kidney, skin, breast, and prostate pathology. Quantitative assessment confirms that PathoGen outperforms state-of-the-art generative baselines, including conditional GAN and Stable Diffusion, in image fidelity and distributional similarity. Crucially, we show that augmenting training sets with PathoGen-synthesized lesions enhances downstream segmentation performance compared to traditional geometric augmentations, particularly in data-scarce regimes. Besides, by simultaneously generating realistic morphology and pixel-level ground truth, PathoGen effectively overcomes the manual annotation bottleneck. This approach offers a scalable pathway for developing generalizable medical AI systems despite limited expert-labeled data.
在组织病理学诊断中,开发稳健的人工智能模型受到专家注释的病变数据稀缺性的严重限制,尤其是在罕见疾病和代表性不足的亚型方面。虽然数据增强提供了一种潜在解决方案,但现有方法无法生成足够逼真的病变形态,这些形态能够保持组织病理标本特有的复杂空间关系和细胞结构。在此,我们介绍了PathoGen,这是一种基于扩散的生成模型,可以对良性组织病理学图像中的病灶进行可控且高保真度的修复。与传统的增强技术不同,PathoGen 利用扩散模型的迭代细化过程来合成具有自然组织边界、保持细胞结构和真实染色特征的病变。 我们在四个代表不同的诊断挑战的数据集上验证了 PathoGen:肾脏病理学、皮肤病理学、乳腺病理学和前列腺病理学。定量评估确认 PathoGen 在图像保真度和分布相似性方面优于最新的生成基线,包括条件 GAN 和 Stable Diffusion。关键的是,我们展示了使用 PathoGen 合成的病变数据增强训练集可以提高下游分割性能,特别是在数据稀缺的情况下,这比传统的几何增强方法更有效。 此外,通过同时生成逼真的形态和像素级别的真实标签,PathoGen 有效地克服了手动注释瓶颈问题。这种方法提供了一条可扩展的道路,即使在专家标记的数据有限的情况下也能开发出具有普适性的医学 AI 系统。
https://arxiv.org/abs/2601.08127
In this paper, we present an automated pipeline for generating domain-specific synthetic datasets with diffusion models, addressing the distribution shift between pre-trained models and real-world deployment environments. Our three-stage framework first synthesizes target objects within domain-specific backgrounds through controlled inpainting. The generated outputs are then validated via a multi-modal assessment that integrates object detection, aesthetic scoring, and vision-language alignment. Finally, a user-preference classifier is employed to capture subjective selection criteria. This pipeline enables the efficient construction of high-quality, deployable datasets while reducing reliance on extensive real-world data collection.
在这篇论文中,我们提出了一种使用扩散模型生成特定领域合成数据集的自动化流水线,以解决预训练模型与现实世界部署环境之间的分布偏移问题。我们的三阶段框架首先通过受控修复(inpainting)在特定领域的背景下合成了目标对象。接下来,通过对物体检测、美学评分和视觉-语言对齐进行多模态评估来验证生成的输出。最后,使用用户偏好分类器捕捉主观选择标准。该流水线能够高效地构建高质量且可部署的数据集,并减少了对大量现实世界数据收集的需求。
https://arxiv.org/abs/2601.08095
The transformative potential of 3D content creation has been progressively unlocked through advancements in generative models. Recently, intuitive drag editing with geometric changes has attracted significant attention in 2D editing yet remains challenging for 3D scenes. In this paper, we introduce 3DGS-Drag -- a point-based 3D editing framework that provides efficient, intuitive drag manipulation of real 3D scenes. Our approach bridges the gap between deformation-based and 2D-editing-based 3D editing methods, addressing their limitations to geometry-related content editing. We leverage two key innovations: deformation guidance utilizing 3D Gaussian Splatting for consistent geometric modifications and diffusion guidance for content correction and visual quality enhancement. A progressive editing strategy further supports aggressive 3D drag edits. Our method enables a wide range of edits, including motion change, shape adjustment, inpainting, and content extension. Experimental results demonstrate the effectiveness of 3DGS-Drag in various scenes, achieving state-of-the-art performance in geometry-related 3D content editing. Notably, the editing is efficient, taking 10 to 20 minutes on a single RTX 4090 GPU.
三维内容创作的变革潜力通过生成模型的进步逐步被解锁。最近,直观的拖拽编辑结合几何变化在二维编辑中引起了广泛关注,但在三维场景中的实现仍面临挑战。本文介绍了3DGS-Drag——一种基于点的三维编辑框架,它提供了一种高效且直观地操作真实三维场景的方法。我们的方法连接了基于变形和基于二维编辑的三维编辑方式之间的差距,并解决了它们在与几何相关的三维内容编辑方面的局限性。我们利用两个关键创新:使用3D高斯斑点进行一致的几何修改的变形引导,以及用于内容校正和视觉质量增强的扩散引导。一种渐进式编辑策略进一步支持了激进的三维拖拽编辑。我们的方法能够实现一系列编辑操作,包括运动变化、形状调整、纹理修复(inpainting)和内容扩展。实验结果表明,3DGS-Drag在各种场景中表现出色,在与几何相关的三维内容编辑方面达到了最先进的性能水平。值得注意的是,该编辑过程高效快捷,使用单个RTX 4090 GPU时,整个编辑过程仅需10到20分钟。
https://arxiv.org/abs/2601.07963
LiDAR scene synthesis is an emerging solution to scarcity in 3D data for robotic tasks such as autonomous driving. Recent approaches employ diffusion or flow matching models to generate realistic scenes, but 3D data remains limited compared to RGB datasets with millions of samples. We introduce R3DPA, the first LiDAR scene generation method to unlock image-pretrained priors for LiDAR point clouds, and leverage self-supervised 3D representations for state-of-the-art results. Specifically, we (i) align intermediate features of our generative model with self-supervised 3D features, which substantially improves generation quality; (ii) transfer knowledge from large-scale image-pretrained generative models to LiDAR generation, mitigating limited LiDAR datasets; and (iii) enable point cloud control at inference for object inpainting and scene mixing with solely an unconditional model. On the KITTI-360 benchmark R3DPA achieves state of the art performance. Code and pretrained models are available at this https URL.
LiDAR场景合成是解决机器人任务(如自动驾驶)中三维数据稀缺问题的一种新兴解决方案。最近的方法使用扩散模型或流匹配模型来生成逼真的场景,但与包含数百万样本的RGB数据集相比,3D数据仍然非常有限。我们引入了R3DPA,这是第一个利用图像预训练先验解锁LiDAR点云潜力,并采用自我监督三维表示以实现最佳性能的LiDAR场景生成方法。具体来说: 1. 我们将生成模型的中间特征与自监督三维特征对齐,这显著提高了生成质量。 2. 将大规模图像预训练生成模型的知识转移到LiDAR生成中,从而缓解了有限的LiDAR数据集问题。 3. 仅使用无条件模型就能够在推理过程中实现点云控制,进行对象修复和场景混合。 在KITTI-360基准测试上,R3DPA实现了最佳性能。代码和预训练模型可在提供的链接中获得。
https://arxiv.org/abs/2601.07692
Text-guided image generation has advanced rapidly with large-scale diffusion models, yet achieving precise stylization with visual exemplars remains difficult. Existing approaches often depend on task-specific retraining or expensive inversion procedures, which can compromise content integrity, reduce style fidelity, and lead to an unsatisfactory trade-off between semantic prompt adherence and style alignment. In this work, we introduce a training-free framework that reformulates style-guided synthesis as an in-context learning task. Guided by textual semantic prompts, our method concatenates a reference style image with a masked target image, leveraging a pretrained ReFlow-based inpainting model to seamlessly integrate semantic content with the desired style through multimodal attention fusion. We further analyze the imbalance and noise sensitivity inherent in multimodal attention fusion and propose a Dynamic Semantic-Style Integration (DSSI) mechanism that reweights attention between textual semantic and style visual tokens, effectively resolving guidance conflicts and enhancing output coherence. Experiments show that our approach achieves high-fidelity stylization with superior semantic-style balance and visual quality, offering a simple yet powerful alternative to complex, artifact-prone prior methods.
基于文本的图像生成技术随着大规模扩散模型的发展而迅速进步,但利用视觉样本实现精确风格化仍然面临挑战。现有方法通常依赖于特定任务的重新训练或昂贵的逆向工程过程,这可能会损害内容完整性、降低风格保真度,并导致语义提示遵从性和风格对齐之间的不理想权衡。在这项工作中,我们引入了一个无需训练的框架,将基于样式的合成重构为一种上下文学习任务。通过文本语义提示引导,我们的方法结合了参考风格图像和掩码目标图像,利用预训练的ReFlow基底修复模型来无缝融合语义内容与所需样式,借助多模态注意力融合技术实现这一目标。 我们进一步分析了多模态注意力融合内在的不平衡性和噪声敏感性,并提出了一种动态语义-风格集成(DSSI)机制,通过重新加权文本语义和风格视觉标记之间的注意来有效解决指导冲突并增强输出的一致性。实验表明,我们的方法能够实现高保真度的风格化,具有卓越的语义-风格平衡和视觉质量,并为复杂且易产生伪影的方法提供了一个简单而强大的替代方案。
https://arxiv.org/abs/2601.06605
Video outpainting extends a video beyond its original boundaries by synthesizing missing border content. Compared with image outpainting, it requires not only per-frame spatial plausibility but also long-range temporal coherence, especially when outpainted content becomes visible across time under camera or object motion. We propose GlobalPaint, a diffusion-based framework for spatiotemporal coherent video outpainting. Our approach adopts a hierarchical pipeline that first outpaints key frames and then completes intermediate frames via an interpolation model conditioned on the completed boundaries, reducing error accumulation in sequential processing. At the model level, we augment a pretrained image inpainting backbone with (i) an Enhanced Spatial-Temporal module featuring 3D windowed attention for stronger spatiotemporal interaction, and (ii) global feature guidance that distills OpenCLIP features from observed regions across all frames into compact global tokens using a dedicated extractor. Comprehensive evaluations on benchmark datasets demonstrate improved reconstruction quality and more natural motion compared to prior methods. Our demo page is this https URL
视频扩展(Video outpainting)是指通过合成缺失的边界内容,将视频延长到其原始范围之外的技术。与图像扩展相比,视频扩展不仅需要每个帧的空间合理性,还要求长时间内的时序一致性,特别是在摄像机或物体运动导致扩展的内容随时间显现出来的情况下。 我们提出了一种基于扩散模型的框架——GlobalPaint,用于时空一致性的视频扩展。我们的方法采用分层管道,首先对关键帧进行扩展,然后通过条件于完成边界的插值模型来补全中间帧,从而减少顺序处理中的误差累积。在模型层面,我们在预训练的图像修复骨干网络的基础上添加了两个模块:增强的空间-时间模块(Enhanced Spatial-Temporal module),其利用3D窗口注意机制实现更强的时空交互;全局特征引导(global feature guidance)则将OpenCLIP从所有帧中观察到的区域中提取出紧凑的全局标记。 在基准数据集上的全面评估表明,与先前的方法相比,我们的方法在重建质量和更自然的动作方面有所改进。有关我们演示页面的信息,请参阅此链接:https://example.com/(请根据实际情况替换链接)。
https://arxiv.org/abs/2601.06413
In this paper, we introduce Object-WIPER, a training-free framework for removing dynamic objects and their associated visual effects from videos, and inpainting them with semantically consistent and temporally coherent content. Our approach leverages a pre-trained text-to-video diffusion transformer (DiT). Given an input video, a user-provided object mask, and query tokens describing the target object and its effects, we localize relevant visual tokens via visual-text cross-attention and visual self-attention. This produces an intermediate effect mask that we fuse with the user mask to obtain a final foreground token mask to replace. We first invert the video through the DiT to obtain structured noise, then reinitialize the masked tokens with Gaussian noise while preserving background tokens. During denoising, we copy values for the background tokens saved during inversion to maintain scene fidelity. To address the lack of suitable evaluation, we introduce a new object removal metric that rewards temporal consistency among foreground tokens across consecutive frames, coherence between foreground and background tokens within each frame, and dissimilarity between the input and output foreground tokens. Experiments on DAVIS and a newly curated real-world associated effect benchmark (WIPER-Bench) show that Object-WIPER surpasses both training-based and training-free baselines in terms of the metric, achieving clean removal and temporally stable reconstruction without any retraining. Our new benchmark, source code, and pre-trained models will be publicly available.
在这篇论文中,我们介绍了Object-WIPER,这是一种无需训练的框架,用于从视频中移除动态对象及其相关视觉效果,并用语义一致且时间连贯的内容进行修补。我们的方法利用了一个预训练的文本到视频扩散变换器(DiT)。给定输入视频、用户提供的物体掩码以及描述目标物体及其影响的查询令牌后,我们通过视觉-文本交叉注意力和视觉自我注意来定位相关的视觉令牌。这会产生一个中间效果掩码,我们将此与用户的掩码融合以获得最终的前景令牌掩码进行替换。首先,我们通过DiT对视频进行逆向处理,得到结构化的噪声,然后在保留背景令牌的同时用高斯噪声重新初始化被屏蔽的令牌。在降噪过程中,我们会复制保存的背景令牌值以保持场景的真实性。为了应对合适的评估方法不足的问题,我们引入了一种新的对象移除度量标准,该标准奖励连续帧之间前景令牌的时间一致性、每个帧内前景和背景令牌之间的连贯性以及输入与输出前景令牌间的差异。 在DAVIS数据集和一个新的现实世界关联效果基准(WIPER-Bench)上的实验表明,Object-WIPER在所提出的度量标准上优于训练基线和无训练基线,在不进行重新训练的情况下实现了干净的移除和时间稳定重建。我们的新基准、源代码以及预训练模型将公开提供。
https://arxiv.org/abs/2601.06391
We propose a novel framework for decomposing arbitrarily posed humans into animatable multi-layered 3D human avatars, separating the body and garments. Conventional single-layer reconstruction methods lock clothing to one identity, while prior multi-layer approaches struggle with occluded regions. We overcome both limitations by encoding each layer as a set of 2D Gaussians for accurate geometry and photorealistic rendering, and inpainting hidden regions with a pretrained 2D diffusion model via score-distillation sampling (SDS). Our three-stage training strategy first reconstructs the coarse canonical garment via single-layer reconstruction, followed by multi-layer training to jointly recover the inner-layer body and outer-layer garment details. Experiments on two 3D human benchmark datasets (4D-Dress, Thuman2.0) show that our approach achieves better rendering quality and layer decomposition and recomposition than the previous state-of-the-art, enabling realistic virtual try-on under novel viewpoints and poses, and advancing practical creation of high-fidelity 3D human assets for immersive applications. Our code is available at this https URL
我们提出了一种新的框架,用于将任意姿态的人体分解为可动画的多层3D人体化身,将身体和服装分开。传统的单层重建方法会将服装固定在一个身份上,而先前的多层方法则难以处理被遮挡区域的问题。通过编码每一层为一组2D高斯分布,并使用预训练的二维扩散模型通过评分蒸馏采样(SDS)来填补隐藏区域,我们克服了这些限制以实现准确的几何结构和逼真的渲染效果。我们的三阶段训练策略首先通过单层重建粗略地重构服装,然后进行多层训练以同时恢复内层身体和外层服装细节。 在两个3D人体基准数据集(4D-Dress 和 Thuman2.0)上的实验表明,与之前的最新技术相比,我们的方法实现了更好的渲染质量和层次分解重组效果。这使得在新的视角和姿势下进行逼真的虚拟试穿成为可能,并推动了高保真度的3D人类资产在沉浸式应用中的实际创建。我们的代码可在 [这个链接](https://this-URL) 获取。
https://arxiv.org/abs/2601.05853
Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of atomic actions via Chain-of-Thought reasoning. Further, we also construct I2E-Bench, a benchmark designed for multi-instance spatial reasoning and high-precision editing. Experimental results on I2E-Bench and multiple public benchmarks demonstrate that I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.
现有的文本引导图像编辑方法主要依赖于端到端的像素级修复范式。尽管这种方法在简单场景中取得了成功,但在需要精确局部控制和复杂多对象空间推理的组合编辑任务上仍然面临重大挑战。这一范式受到以下限制:1)规划与执行之间隐式的耦合;2)缺乏以物体级别为单位的操作精细度;3)依赖于无结构、像素中心化的建模方式。为了克服这些局限,我们提出了I2E(图像到编辑),一种新的“分解然后行动”范式,将图像编辑重新审视为在结构化环境中的可操作交互过程。I2E利用一个分解器将非结构化的图片转换成离散的、可以操控的对象层,并引入了具有物理感知能力的视觉-语言-动作代理来通过链式思维推理解析复杂的指令并将其转化为一系列原子行动。此外,我们还构建了I2E-Bench,这是一个专门设计用于多实例空间推理和高精度编辑任务的基准测试平台。 在I2E-Bench和其他多个公开基准上的实验结果表明,I2E在处理复杂组合指令、保持物理合理性以及保证多次编辑稳定性方面显著优于现有最先进的方法。
https://arxiv.org/abs/2601.03741
Solving inverse problems in imaging requires models that support efficient inference, uncertainty quantification, and principled probabilistic reasoning. Energy-Based Models (EBMs), with their interpretable energy landscapes and compositional structure, are well-suited for this task but have historically suffered from high computational costs and training instability. To overcome the historical shortcomings of EBMs, we introduce a fast distillation strategy to transfer the strengths of pre-trained diffusion models into multi-scale EBMs. These distilled EBMs enable efficient sampling and preserve the interpretability and compositionality inherent to potential-based frameworks. Leveraging EBM compositionality, we propose Annealed Langevin Posterior Sampling (ALPS) algorithm for Maximum-A-Posteriori (MAP), Minimum Mean Square Error (MMSE), and uncertainty estimates for inverse problems in imaging. Unlike diffusion models that use complex guidance strategies for latent variables, we perform annealing on static posterior distributions that are well-defined and composable. Experiments on image inpainting and MRI reconstruction demonstrate that our method matches or surpasses diffusion-based baselines in both accuracy and efficiency, while also supporting MAP recovery. Overall, our framework offers a scalable and principled solution for inverse problems in imaging, with potential for practical deployment in scientific and clinical settings. ALPS code is available at the GitHub repository \href{this https URL}{ALPS}.
解决成像中的逆问题需要支持高效推理、不确定性量化和原理性概率推理的模型。基于能量的模型(EBM)由于其可解释的能量景观和组合结构,非常适合这一任务,但历史上一直面临计算成本高和训练不稳定的缺点。为克服EBM的历史不足,我们引入了一种快速蒸馏策略,将预先训练好的扩散模型的优势转移到多尺度EBM中。这些经过蒸馏的EBM能够高效采样,并保持基于势能框架固有的可解释性和组合性。 利用EBM的组合特性,我们提出了一种用于最大后验概率(MAP)、最小均方误差(MMSE)和逆问题不确定性估计的退火Langevin后验抽样(ALPS)算法。与扩散模型使用的复杂潜在变量指导策略不同,我们在静态且可组合的良好定义后验分布上进行退火处理。 在图像修复和MRI重建实验中显示,我们的方法在精度和效率方面均可以匹敌或超越基于扩散的基线方法,并同时支持MAP恢复。总的来说,我们提出的框架为成像中的逆问题提供了一种可扩展且原理性的解决方案,在科学和临床环境中具有实际应用潜力。 ALPS代码可在GitHub仓库\href{this https URL}{ALPS}中获取。
https://arxiv.org/abs/2601.02594
Video recognition models remain vulnerable to adversarial attacks, while existing diffusion-based purification methods suffer from inefficient sampling and curved trajectories. Directly regressing clean videos from adversarial inputs often fails to recover faithful content due to the subtle nature of perturbations; this necessitates physically shattering the adversarial structure. Therefore, we propose Flow Matching for Adversarial Video Purification FMVP. FMVP physically shatters global adversarial structures via a masking strategy and reconstructs clean video dynamics using Conditional Flow Matching (CFM) with an inpainting objective. To further decouple semantic content from adversarial noise, we design a Frequency-Gated Loss (FGL) that explicitly suppresses high-frequency adversarial residuals while preserving low-frequency fidelity. We design Attack-Aware and Generalist training paradigms to handle known and unknown threats, respectively. Extensive experiments on UCF-101 and HMDB-51 demonstrate that FMVP outperforms state-of-the-art methods (DiffPure, Defense Patterns (DP), Temporal Shuffling (TS) and FlowPure), achieving robust accuracy exceeding 87% against PGD and 89% against CW attacks. Furthermore, FMVP demonstrates superior robustness against adaptive attacks (DiffHammer) and functions as a zero-shot adversarial detector, attaining detection accuracies of 98% for PGD and 79% for highly imperceptible CW attacks.
视频识别模型仍然容易受到对抗性攻击的影响,而现有的基于扩散的净化方法则面临采样效率低下和轨迹弯曲的问题。直接从对抗输入中回归干净视频通常无法恢复忠实的内容,因为这些扰动往往是细微且难以察觉的;这需要物理上破坏对抗结构。因此,我们提出了用于对抗视频净化的流匹配(Flow Matching for Adversarial Video Purification, FMVP)。FMVP通过掩码策略物理性地破坏全局对抗结构,并使用带有修复目标的条件流匹配(Conditional Flow Matching, CFM)重建干净的视频动态。 为了进一步将语义内容与对抗噪声解耦,我们设计了一种频率门控损失(Frequency-Gated Loss, FGL),它明确抑制高频率对抗残差同时保持低频保真度。此外,我们还设计了针对已知威胁和未知威胁的攻击感知训练范式和通才训练范式。 在UCF-101和HMDB-51数据集上进行的广泛实验表明,FMVP优于最先进的方法(DiffPure、Defense Patterns (DP)、Temporal Shuffling (TS) 和 FlowPure),对抗PGD攻击时达到了超过87%的鲁棒准确率,在对抗CW攻击时则超过了89%。此外,FMVP还表现出更强的适应性攻击抵御能力(如DiffHammer)并能够充当零样本对抗检测器,在识别PGD和高度不可察觉的CW攻击方面分别达到98%和79%的检测精度。
https://arxiv.org/abs/2601.02228
Marine oil spills are urgent environmental hazards that demand rapid and reliable detection to minimise ecological and economic damage. While Synthetic Aperture Radar (SAR) imagery has become a key tool for large-scale oil spill monitoring, most existing detection methods rely on deep learning-based segmentation applied to single SAR images. These static approaches struggle to distinguish true oil spills from visually similar oceanic features (e.g., biogenic slicks or low-wind zones), leading to high false positive rates and limited generalizability, especially under data-scarce conditions. To overcome these limitations, we introduce Oil Spill Change Detection (OSCD), a new bi-temporal task that focuses on identifying changes between pre- and post-spill SAR images. As real co-registered pre-spill imagery is not always available, we propose the Temporal-Aware Hybrid Inpainting (TAHI) framework, which generates synthetic pre-spill images from post-spill SAR data. TAHI integrates two key components: High-Fidelity Hybrid Inpainting for oil-free reconstruction, and Temporal Realism Enhancement for radiometric and sea-state consistency. Using TAHI, we construct the first OSCD dataset and benchmark several state-of-the-art change detection models. Results show that OSCD significantly reduces false positives and improves detection accuracy compared to conventional segmentation, demonstrating the value of temporally-aware methods for reliable, scalable oil spill monitoring in real-world scenarios.
海洋石油泄漏是紧迫的环境危害,需要快速且可靠的检测以减少生态和经济损失。尽管合成孔径雷达(SAR)图像已成为大规模油污监测的关键工具,但大多数现有的检测方法依赖于应用于单张SAR图像的基于深度学习的分割技术。这些静态方法难以区分真正的油污与视觉上相似的海洋特征(例如生物性油膜或低风区),导致较高的误报率和较低的泛化能力,尤其是在数据稀缺的情况下尤为明显。 为了克服这些限制,我们引入了油污变化检测(OSCD),这是一种新的双时相任务,重点是识别前后SAR图像中的变化。由于实际配准的前泄露图像并不总是可用,我们提出了时空感知混合修复框架(TAHI),该框架从后泄露SAR数据中生成合成的前泄漏图像。TAHI整合了两个关键组件:高保真度混合修复用于无油重建和时间现实增强以实现辐射特性和海况的一致性。 利用TAHI,我们构建了第一个OSCD数据集,并对几种最先进的变化检测模型进行了基准测试。结果表明,与传统的分割方法相比,OSCD显著降低了误报率并提高了检测准确性,证明了时空感知方法在真实场景中可靠、大规模监测石油泄漏的价值。
https://arxiv.org/abs/2601.02139
Reconstructing complete and animatable 3D human avatars from monocular videos remains challenging, particularly under severe occlusions. While 3D Gaussian Splatting has enabled photorealistic human rendering, existing methods struggle with incomplete observations, often producing corrupted geometry and temporal inconsistencies. We present InpaintHuman, a novel method for generating high-fidelity, complete, and animatable avatars from occluded monocular videos. Our approach introduces two key innovations: (i) a multi-scale UV-parameterized representation with hierarchical coarse-to-fine feature interpolation, enabling robust reconstruction of occluded regions while preserving geometric details; and (ii) an identity-preserving diffusion inpainting module that integrates textual inversion with semantic-conditioned guidance for subject-specific, temporally coherent completion. Unlike SDS-based methods, our approach employs direct pixel-level supervision to ensure identity fidelity. Experiments on synthetic benchmarks (PeopleSnapshot, ZJU-MoCap) and real-world scenarios (OcMotion) demonstrate competitive performance with consistent improvements in reconstruction quality across diverse poses and viewpoints.
从单目视频中重建完整且可动画化的3D人体化身仍然是一项挑战,尤其是在严重遮挡的情况下。尽管3D高斯点阵技术(Gaussian Splatting)已经实现了逼真的3D人物渲染,现有的方法在处理不完整的观察数据时往往存在问题,经常会生成有缺陷的几何结构和时间上的不一致性。我们提出了InpaintHuman这一创新方法,旨在从被遮挡的单目视频中生成高质量、完整且可动画化的化身。我们的方法引入了两个关键创新: (i) 多尺度UV参数化表示以及分层粗到细特征插值,能够稳健地重建被遮挡区域的同时保留几何细节; (ii) 一个保持身份特性的扩散式修补模块,该模块结合文本逆向工程技术与基于语义引导的方式,用于实现针对特定对象的、时间上一致化的完成。 与其他基于SDS(Screen Door Shader)的方法不同的是,我们的方法直接采用像素级别的监督机制来确保身份特征的一致性。在合成基准测试(PeopleSnapshot, ZJU-MoCap)和现实世界场景(OcMotion)上的实验表明,我们的方法具有竞争力,并且在各种姿态和视角下都一致地提高了重建质量。
https://arxiv.org/abs/2601.02098
Text-to-image (T2I) diffusion models such as SDXL and FLUX have achieved impressive photorealism, yet small-scale distortions remain pervasive in limbs, face, text and so on. Existing refinement approaches either perform costly iterative re-generation or rely on vision-language models (VLMs) with weak spatial grounding, leading to semantic drift and unreliable local edits. To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop. Specifically, we design (1) a perception agent that learns contextual saliency for fine-grained distortion localization under text-image consistency cues, (2) a reasoning agent that performs human-aligned inferential diagnosis via progressive preference alignment, and (3) an action agent that adaptively plans localized inpainting guided by user preference. This design integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified, self-corrective decision process. To enable fine-grained supervision and quantitative evaluation, we further construct GenBlemish-27K, a dataset of 6K T2I images with 27K annotated artifact regions across 12 categories. Extensive experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization and human preference alignment, establishing a new paradigm for self-corrective and perceptually reliable T2I generation.
文本到图像(T2I)扩散模型,如SDXL和FLUX,在实现令人印象深刻的照片逼真度方面取得了显著成就,但仍存在四肢、面部、文字等部位的小范围失真问题。现有改进方法要么进行昂贵的迭代重新生成,要么依赖于空间定位能力较弱的视觉-语言模型(VLMs),这会导致语义漂移和不可靠的局部编辑。为解决这一差距,我们提出了Agent Retoucher,这是一种分层决策驱动框架,将后生成修正重构为人眼感知、推理与行动的循环过程。具体来说,该框架设计了: 1. 一个感知代理,学习在文本-图像一致性线索下,对细粒度失真进行定位的上下文显著性; 2. 一个推理代理,通过渐进式偏好对齐执行人类一致性的推断诊断; 3. 一个行动代理,根据用户偏好指导局部修复绘画的自适应计划。 该设计将感知证据、语言推理和可控修正整合为一种统一的自我校正决策过程。为了实现细粒度监督与量化评估,我们进一步构建了GenBlemish-27K数据集,其中包括6000张T2I图像及其27,000个标注的缺陷区域,这些区域分布在12类中。广泛的实验表明,Agent Retoucher在感知质量、失真定位和人类偏好对齐方面始终优于最先进的方法,从而为自我校正且感知可靠的文本到图像生成建立了新的范式。
https://arxiv.org/abs/2601.02046
We study timbre transfer as an inference-time editing problem for music audio. Starting from a strong pre-trained latent diffusion model, we introduce a lightweight procedure that requires no additional training: (i) a dimension-wise noise injection that targets latent channels most informative of instrument identity, and (ii) an early-step clamping mechanism that re-imposes the input's melodic and rhythmic structure during reverse diffusion. The method operates directly on audio latents and is compatible with text/audio conditioning (e.g., CLAP). We discuss design choices,analyze trade-offs between timbral change and structural preservation, and show that simple inference-time controls can meaningfully steer pre-trained models for style-transfer use cases.
我们将音色转换视为音乐音频在推理阶段的编辑问题。基于一个强大的预训练潜在扩散模型,我们引入了一个无需额外训练的轻量级过程:(i) 针对最能体现乐器身份信息的潜在通道进行维度噪声注入;(ii) 在逆向扩散早期步骤中使用一种限制机制来重新施加输入的旋律和节奏结构。该方法直接在音频潜在空间上操作,并且与文本/音频条件(例如CLAP)兼容。我们讨论了设计选择,分析了音色变化和结构保留之间的权衡,并展示了简单的推理时间控制可以有意义地引导预训练模型以支持风格转换的应用场景。
https://arxiv.org/abs/2601.01294
We present a lightweight two-stage framework for joint geometry and color inpainting of damaged 3D objects, motivated by the digital restoration of cultural heritage artifacts. The pipeline separates damage localization from reconstruction. In the first stage, a 2D convolutional network predicts damage masks on RGB slices extracted from a voxelized object, and these predictions are aggregated into a volumetric mask. In the second stage, a diffusion-based 3D U-Net performs mask-conditioned inpainting directly on voxel grids, reconstructing geometry and color while preserving observed regions. The model jointly predicts occupancy and color using a composite objective that combines occupancy reconstruction with masked color reconstruction and perceptual regularization. We evaluate the approach on a curated set of textured artifacts with synthetically generated damage using standard geometric and color metrics. Compared to symmetry-based baselines, our method produces more complete geometry and more coherent color reconstructions at a fixed 32^3 resolution. Overall, the results indicate that explicit mask conditioning is a practical way to guide volumetric diffusion models for joint 3D geometry and color inpainting.
我们提出了一种轻量级的两阶段框架,用于受损三维物体的几何和颜色修补,该框架受到了文化遗迹数字修复工作的启发。此流程将损伤定位与重建过程分开。在第一阶段中,一个2D卷积网络会预测从体素化对象提取的颜色图层中的损坏掩码,并将这些预测汇总成体积掩码。在第二阶段中,基于扩散的3D U-Net直接在体素网格上进行条件修复(即利用掩码信息),同时重建几何形状和颜色并保持可见区域不变。模型通过结合占用度重构、屏蔽颜色重构以及感知正则化的目标函数来联合预测占据情况和颜色。 我们在一组精心挑选的具有合成损坏的文化艺术品上使用标准几何和颜色指标评估了该方法的效果。与基于对称性的基线方法相比,我们的方法在固定的32^3分辨率下生成了更完整的几何重建及更加一致的颜色恢复结果。总体而言,这些结果显示明确地进行掩码条件设置是一种实用的方法,能够引导体积扩散模型实现联合的三维几何和颜色修补任务。
https://arxiv.org/abs/2601.00368
Audio-driven visual dubbing aims to synchronize a video's lip movements with new speech, but is fundamentally challenged by the lack of ideal training data: paired videos where only a subject's lip movements differ while all other visual conditions are identical. Existing methods circumvent this with a mask-based inpainting paradigm, where an incomplete visual conditioning forces models to simultaneously hallucinate missing content and sync lips, leading to visual artifacts, identity drift, and poor synchronization. In this work, we propose a novel self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem. Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data: a lip-altered companion video for each real sample, forming visually aligned video pairs. A DiT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete and aligned input video frames to focus solely on precise, audio-driven lip modifications. This complete, frame-aligned input conditioning forms a rich visual context for the editor, providing it with complete identity cues, scene interactions, and continuous spatiotemporal dynamics. Leveraging this rich context fundamentally enables our method to achieve highly accurate lip sync, faithful identity preservation, and exceptional robustness against challenging in-the-wild scenarios. We further introduce a timestep-adaptive multi-phase learning strategy as a necessary component to disentangle conflicting editing objectives across diffusion timesteps, thereby facilitating stable training and yielding enhanced lip synchronization and visual fidelity. Additionally, we propose ContextDubBench, a comprehensive benchmark dataset for robust evaluation in diverse and challenging practical application scenarios.
音频驱动的视觉配音旨在将视频中的嘴唇动作与新的语音同步,但这一目标在很大程度上受到理想训练数据缺乏的限制:即只有主体的唇部运动不同而所有其他视觉条件都相同的配对视频。现有方法通过一种基于掩码的修补(inpainting)范式来规避这个问题,在这种范式中,不完整的视觉条件迫使模型同时填补缺失内容并同步嘴唇动作,从而导致视觉伪影、身份漂移和较差的时间同步。 在本工作中,我们提出了一种新的自引导框架,将视觉配音从一个病态的修补任务转变为一个良好的视频到视频编辑问题。我们的方法采用扩散变压器(Diffusion Transformer),最初作为数据生成器,来合成理想的训练数据:为每个真实样本创建一个唇部调整后的伴生视频,形成可视对齐的视频配对。然后,基于DiT(Diffusion Image-to-Image)的音频驱动编辑器在这些配对上进行端到端训练,利用完整的、帧对齐的输入视频帧来专注于精确的音频驱动嘴唇修改。这种完整且帧对齐的输入条件为编辑器提供了丰富的视觉上下文,包括完整的身份线索、场景互动和持续的空间时间动态。 通过这一丰富背景的基本要素使我们的方法能够实现高度准确的唇部同步、忠实的身份保持以及在具有挑战性的现实场景中表现出色的能力。此外,我们还引入了一种时序自适应多阶段学习策略作为必要组件,以分离扩散步骤中的冲突编辑目标,从而促进稳定的训练并提高唇部同步和视觉保真度。最后,为了对稳健性进行全面评估,在各种实际应用的挑战性场景下,我们提出了ContextDubBench这一全面基准数据集。
https://arxiv.org/abs/2512.25066
Detecting surface changes from satellite imagery is critical for rapid disaster response and environmental monitoring, yet remains challenging due to the complex interplay between atmospheric noise, seasonal variations, and sensor artifacts. Here we show that deep learning can leverage the temporal redundancy of satellite time series to detect anomalies at unprecedented sensitivity, by learning to predict what the surface should look like in the absence of change. We train an inpainting model built upon the SATLAS foundation model to reconstruct the last frame of a Sentinel-2 time series from preceding acquisitions, using globally distributed training data spanning diverse climate zones and land cover types. When applied to regions affected by sudden surface changes, the discrepancy between prediction and observation reveals anomalies that traditional change detection methods miss. We validate our approach on earthquake-triggered surface ruptures from the 2023 Turkey-Syria earthquake sequence, demonstrating detection of a rift feature in Tepehan with higher sensitivity and specificity than temporal median or Reed-Xiaoli anomaly detectors. Our method reaches detection thresholds approximately three times lower than baseline approaches, providing a path towards automated, global-scale monitoring of surface changes from freely available multi-spectral satellite data.
从卫星图像中检测地表变化对于快速灾难响应和环境监测至关重要,但由于大气噪声、季节性变化以及传感器伪影之间的复杂相互作用,这一任务仍然具有挑战性。在这里,我们展示了深度学习可以利用卫星时间序列中的时间冗余来以前所未有的灵敏度检测异常情况,通过学习在没有地表变化的情况下表面应该呈现的样子来进行预测。我们使用全球分布的训练数据(涵盖不同的气候区域和土地覆盖类型)对SATLAS基础模型进行训练,以重建Sentinel-2时间序列的最后一帧图像,该图像基于之前的获取。 当应用于受突然地表变化影响的地区时,预测与观测之间的差异揭示了传统变化检测方法未能发现的异常情况。我们通过在2023年土耳其-叙利亚地震序列中由地震引发的地表破裂事件上验证我们的方法,并展示了比时间中位数或刘小莉异常检测器更高的灵敏度和特异性,在Tepehan地区发现了裂谷特征。 我们的方法能够达到大约是基线方法三倍的检测阈值,这为从免费提供的多光谱卫星数据进行自动化、全球规模的地表变化监测提供了一条路径。
https://arxiv.org/abs/2512.23986
Existing video omnimatte methods typically rely on slow, multi-stage, or inference-time optimization pipelines that fail to fully exploit powerful generative priors, producing suboptimal decompositions. Our key insight is that, if a video inpainting model can be finetuned to remove the foreground-associated effects, then it must be inherently capable of perceiving these effects, and hence can also be finetuned for the complementary task: foreground layer decomposition with associated effects. However, although naïvely finetuning the inpainting model with LoRA applied to all blocks can produce high-quality alpha mattes, it fails to capture associated effects. Our systematic analysis reveals this arises because effect-related cues are primarily encoded in specific DiT blocks and become suppressed when LoRA is applied across all blocks. To address this, we introduce EasyOmnimatte, the first unified, end-to-end video omnimatte method. Concretely, we finetune a pretrained video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact: an Effect Expert, where LoRA is applied only to effect-sensitive DiT blocks to capture the coarse structure of the foreground and associated effects, and a fully LoRA-finetuned Quality Expert learns to refine the alpha matte. During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps. This design eliminates the need for two full diffusion passes, significantly reducing computational cost without compromising output quality. Ablation studies validate the effectiveness of this Dual-Expert strategy. Experiments demonstrate that EasyOmnimatte sets a new state-of-the-art for video omnimatte and enables various downstream tasks, significantly outperforming baselines in both quality and efficiency.
现有的视频全遮罩(omnimatte)方法通常依赖于慢速、多阶段或推理时间优化的管道,这些方法未能充分利用强大的生成先验知识,导致分解效果不佳。我们的关键洞察是:如果一个视频修复模型可以通过微调来移除与前景相关的效应,则该模型必然具有感知这些效应的能力,并且也可以被微调用于互补任务——即带有相关效应的前景层分解。 然而,尽管直接使用LoRA(低秩适应)对所有块进行微调可以生成高质量的alpha遮罩,但它无法捕捉到相关的效果。我们系统性的分析发现这主要是因为与效果相关的线索主要编码在特定的DiT(差异变换)模块中,在整个模型的所有块上应用LoRA时会抑制这些效应。 为了解决这个问题,我们引入了EasyOmnimatte——首个统一、端到端的视频全遮罩方法。具体而言,我们微调一个预训练的视频修复扩散模型以学习两个互补专家,同时保持其原始权重不变:一个是效果专家(Effect Expert),仅在敏感于效应的DiT模块上应用LoRA来捕捉前景及其相关效应的粗略结构;另一个是完全使用LoRA进行微调的质量专家(Quality Expert),用于精细化alpha遮罩。在采样过程中,效果专家会在早期、高噪声步骤中去噪,而在后期、低噪声阶段则由质量专家接管。 这种设计消除了需要两次完整的扩散过程的必要性,大幅降低了计算成本而不牺牲输出质量。消融研究验证了双专家策略的有效性。实验表明,EasyOmnimatte在视频全遮罩任务上达到了新的最先进水平,并且支持多种下游任务,在质量和效率方面均显著优于基线方法。
https://arxiv.org/abs/2512.21865