Personalized image synthesis has emerged as a pivotal application in text-to-image generation, enabling the creation of images featuring specific subjects in diverse contexts. While diffusion models have dominated this domain, auto-regressive models, with their unified architecture for text and image modeling, remain underexplored for personalized image generation. This paper investigates the potential of optimizing auto-regressive models for personalized image synthesis, leveraging their inherent multimodal capabilities to perform this task. We propose a two-stage training strategy that combines optimization of text embeddings and fine-tuning of transformer layers. Our experiments on the auto-regressive model demonstrate that this method achieves comparable subject fidelity and prompt following to the leading diffusion-based personalization methods. The results highlight the effectiveness of auto-regressive models in personalized image generation, offering a new direction for future research in this area.
个性化图像合成已成为文本到图像生成领域的一个关键应用,能够创建包含特定主题的多样背景图片。尽管扩散模型在这一领域占据了主导地位,但自回归模型凭借其统一处理文本和图像建模的架构,在个性化图像生成方面仍较少被探索。本文研究了优化自回归模型以用于个性化图像合成的潜力,利用其固有的多模态能力来执行这项任务。我们提出了一种两阶段训练策略,结合了对文本嵌入的优化和转换器层的微调。我们在自回归模型上的实验表明,该方法在主题忠实度和遵循提示方面与领先的基于扩散的方法相当。研究结果突显了自回归模型在个性化图像生成中的有效性,并为这一领域的未来研究提供了新的方向。
https://arxiv.org/abs/2504.13162
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at this https URL ChallengeCVPR-NTIRE2025.
本文介绍了针对2025年NTIRE挑战赛的短形式用户生成内容(UGC)视频质量评估和增强的回顾。该挑战赛包含两个赛道:(i) 高效视频质量评估(KVQ),以及(ii) 基于扩散方法的图像超分辨率(KwaiSR)。 赛道1旨在推动轻量级且高效的视频质量评估(VQA)模型的发展,重点在于消除对模型集成、冗余权重及其他计算成本较高的组件的依赖,在之前的IQA/VQA竞赛中这些问题普遍存在。赛道2引入了一个专为单张图像超分辨率设计的新短形式UGC数据集,即KwaiSR数据集。该数据集包括1,800对合成生成的S-UGC图像和1,900张真实世界的S-UGC图像,并按照8:1:1的比例分配到训练、验证和测试集合中。 挑战赛的主要目标是推动研究工作,提升像Kwai和TikTok这样的短形式UGC平台上的用户体验。该挑战吸引了266名参与者并收到了18份有效的最终提交作品及其对应的事实表,为短形式UGC视频质量评估和图像超分辨率领域的发展做出了重大贡献。 该项目在以下网址公开发布:[ChallengeCVPR-NTIRE2025](https://challengecvpr-ntire2025.org/)
https://arxiv.org/abs/2504.13131
Flow matching models have emerged as a strong alternative to diffusion models, but existing inversion and editing methods designed for diffusion are often ineffective or inapplicable to them. The straight-line, non-crossing trajectories of flow models pose challenges for diffusion-based approaches but also open avenues for novel solutions. In this paper, we introduce a predictor-corrector-based framework for inversion and editing in flow models. First, we propose Uni-Inv, an effective inversion method designed for accurate reconstruction. Building on this, we extend the concept of delayed injection to flow models and introduce Uni-Edit, a region-aware, robust image editing approach. Our methodology is tuning-free, model-agnostic, efficient, and effective, enabling diverse edits while ensuring strong preservation of edit-irrelevant regions. Extensive experiments across various generative models demonstrate the superiority and generalizability of Uni-Inv and Uni-Edit, even under low-cost settings. Project page: this https URL
流动匹配模型作为扩散模型的强有力替代方案已经出现,但为扩散模型设计的现有逆向工程和编辑方法通常对它们无效或不适用。流动模型中的直线、非交叉轨迹路径给基于扩散的方法带来了挑战,同时也为新的解决方案开辟了道路。在本文中,我们介绍了一种基于预测-校正框架用于流动模型的逆向工程和编辑方法。首先,我们提出Uni-Inv,一种设计用来进行精确重构的有效逆向工程技术。在此基础上,我们将延迟注入的概念扩展到流动模型,并引入Uni-Edit,这是一种区域感知、稳健的图像编辑方法。我们的方法无需调优、对模型无依赖、高效且有效,在确保非相关编辑区域的强大保留的同时,能够实现多样化的编辑。在各种生成模型上的广泛实验表明,即使在低成本设置下,Uni-Inv和Uni-Edit也展示了其优越性和泛化能力。 项目页面:[此链接](https://this-url.com)
https://arxiv.org/abs/2504.13109
Computer vision is transforming fashion through Virtual Try-On (VTON) and Virtual Try-Off (VTOFF). VTON generates images of a person in a specified garment using a target photo and a standardized garment image, while a more challenging variant, Person-to-Person Virtual Try-On (p2p-VTON), uses a photo of another person wearing the garment. VTOFF, on the other hand, extracts standardized garment images from clothed individuals. We introduce TryOffDiff, a diffusion-based VTOFF model. Built on a latent diffusion framework with SigLIP image conditioning, it effectively captures garment properties like texture, shape, and patterns. TryOffDiff achieves state-of-the-art results on VITON-HD and strong performance on DressCode dataset, covering upper-body, lower-body, and dresses. Enhanced with class-specific embeddings, it pioneers multi-garment VTOFF, the first of its kind. When paired with VTON models, it improves p2p-VTON by minimizing unwanted attribute transfer, such as skin color. Code is available at: this https URL
计算机视觉通过虚拟试穿(Virtual Try-On,VTON)和虚拟脱衣(Virtual Try-Off,VTOFF)技术正在改变时尚行业。VTON利用目标照片和标准服装图像生成某人在指定服装下的形象,而更具有挑战性的Person-to-Person Virtual Try-On (p2p-VTON)变体则使用另一人穿着该服装的照片作为输入。相反地,VTOFF从穿衣的人身上提取出标准化的服装图像。我们推出了TryOffDiff,这是一种基于扩散模型(diffusion-based)的VTOFF模型。它建立在带有SigLIP图像条件化的潜在扩散框架之上,并能有效地捕捉纹理、形状和图案等衣物属性。TryOffDiff在VITON-HD数据集上达到了最先进的性能,在DressCode数据集上的表现也很强,涵盖上衣、下装及连衣裙等多种类型。通过加入特定类别的嵌入(class-specific embeddings),它开创了多件服装的虚拟脱衣技术先河。与VTON模型结合使用时,它可以改善p2p-VTON性能,并减少诸如肤色等不必要的属性转移问题。 代码可在以下网址获取:[请在此插入链接]
https://arxiv.org/abs/2504.13078
Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at this https URL.
近期在视频生成领域的进展主要得益于扩散模型和自回归框架,但仍然存在一些关键挑战:如何在遵循提示、视觉质量、运动动态以及时长之间取得平衡。为了提高时间上的视觉质量而妥协了运动动态,受限的视频长度(5-10秒)优先考虑分辨率,以及缺乏基于镜头意识生成的能力,因为通用多模态大规模语言模型(MLLMs)无法理解电影语法,如镜头构图、演员表情和摄像机移动。这些相互交织的局限性阻碍了现实主义长篇合成和专业电影风格生成的发展。 为了克服这些限制,我们提出了SkyReels-V2,这是一种无限长度电影生成模型,该模型结合了多模态大规模语言模型(MLLM)、多阶段预训练、强化学习和扩散强迫框架。首先,我们设计了一种综合的视频结构表示方法,该方法将多模态LLM提供的通用描述与子专家模型的详细镜头语言相结合。借助人工注释,我们随后训练了一个统一的视频说明器SkyCaptioner-V1,用于有效地标记视频数据。 其次,我们为基本的视频生成建立了逐步分辨率预训练,并通过四个阶段的后训练增强:初始概念平衡监督微调(SFT)改善了基准质量;带有手工标注和合成失真数据的运动特定强化学习(RL)训练解决了动态伪影问题;我们的扩散强迫框架结合非递减噪声时间表,能够在一个高效的搜索空间中进行长视频合成;最后高质量的SFT增强了视觉保真度。 所有代码和模型都可在以下链接获取:[此URL](https://this https URL "提供正确的GitHub或项目网站链接")。请注意,在引用具体网址时,请替换“此 https URL”为实际提供的项目地址。
https://arxiv.org/abs/2504.13074
Scene-level 3D generation represents a critical frontier in multimedia and computer graphics, yet existing approaches either suffer from limited object categories or lack editing flexibility for interactive applications. In this paper, we present HiScene, a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation and delivers high-fidelity scenes with compositional identities and aesthetic scene content. Our key insight is treating scenes as hierarchical "objects" under isometric views, where a room functions as a complex object that can be further decomposed into manipulatable items. This hierarchical approach enables us to generate 3D content that aligns with 2D representations while maintaining compositional structure. To ensure completeness and spatial alignment of each decomposed instance, we develop a video-diffusion-based amodal completion technique that effectively handles occlusions and shadows between objects, and introduce shape prior injection to ensure spatial coherence within the scene. Experimental results demonstrate that our method produces more natural object arrangements and complete object instances suitable for interactive applications, while maintaining physical plausibility and alignment with user inputs.
场景级别的3D生成在多媒体和计算机图形学领域代表了一个重要的前沿,然而现有的方法要么受限于物体类别的数量,要么缺乏用于交互式应用程序的编辑灵活性。本文介绍了HiScene,这是一个新颖的分层框架,它弥合了2D图像生成与3D对象生成之间的差距,并提供了具有组合身份和审美场景内容的高保真场景。我们的核心见解是将场景视为等距视角下的“层级化物体”,其中房间作为一个复杂的物体可以进一步分解为可操作的物品。这种分层方法使我们能够生成与2D表示对齐的同时保持组合结构的3D内容。 为了确保每个分解实例的完整性和空间一致性,我们开发了一种基于视频扩散的非模态补全技术,该技术有效处理了对象之间的遮挡和阴影问题,并引入形状先验注入来保证场景内的空间连贯性。实验结果表明,我们的方法能够生成更适合交互式应用且物理上合理的自然物体排列和完整物体实例,同时与用户输入保持一致。
https://arxiv.org/abs/2504.13072
Text-to-image models based on diffusion processes, such as DALL-E, Stable Diffusion, and Midjourney, are capable of transforming texts into detailed images and have widespread applications in art and design. As such, amateur users can easily imitate professional-level paintings by collecting an artist's work and fine-tuning the model, leading to concerns about artworks' copyright infringement. To tackle these issues, previous studies either add visually imperceptible perturbation to the artwork to change its underlying styles (perturbation-based methods) or embed post-training detectable watermarks in the artwork (watermark-based methods). However, when the artwork or the model has been published online, i.e., modification to the original artwork or model retraining is not feasible, these strategies might not be viable. To this end, we propose a novel method for data-use auditing in the text-to-image generation model. The general idea of ArtistAuditor is to identify if a suspicious model has been finetuned using the artworks of specific artists by analyzing the features related to the style. Concretely, ArtistAuditor employs a style extractor to obtain the multi-granularity style representations and treats artworks as samplings of an artist's style. Then, ArtistAuditor queries a trained discriminator to gain the auditing decisions. The experimental results on six combinations of models and datasets show that ArtistAuditor can achieve high AUC values (> 0.937). By studying ArtistAuditor's transferability and core modules, we provide valuable insights into the practical implementation. Finally, we demonstrate the effectiveness of ArtistAuditor in real-world cases by an online platform Scenario. ArtistAuditor is open-sourced at this https URL.
基于扩散过程的文本到图像模型(如DALL-E、Stable Diffusion和Midjourney)能够将文字转化为详细的图片,在艺术与设计领域有着广泛应用。业余用户可以通过收集艺术家作品并对其进行微调,轻松模仿专业级别的画作,这引发了一系列关于艺术品版权侵权的问题。为了应对这些问题,以往的研究要么向艺术品中添加视觉上不可察觉的扰动以改变其潜在风格(基于扰动的方法),要么在训练后将可检测水印嵌入到艺术品中(基于水印的方法)。然而,当艺术品或模型已在线发布,即无法对原始作品进行修改或者重新训练时,这些策略可能不再可行。为此,我们提出了一种针对文本到图像生成模型的数据使用审计的新方法——ArtistAuditor。ArtistAuditor的基本思想是通过分析与风格相关的特征来识别可疑的模型是否经过特定艺术家作品的微调。具体来说,ArtistAuditor利用一个样式提取器获取多粒度样式表示,并将艺术品视为艺术家样式的样本。然后,它查询训练好的鉴别器以获得审计决策。在六个不同模型和数据集组合上的实验结果显示,ArtistAuditor能够达到很高的AUC值(> 0.937)。通过研究ArtistAuditor的可迁移性和核心模块,我们提供了对其实际应用中的宝贵见解。最后,我们在一个在线平台场景中展示了ArtistAuditor在真实世界案例中的有效性。ArtistAuditor是开源的,并可在以下网址获取:[此链接](请将方括号内的文本替换为实际网址)。
https://arxiv.org/abs/2504.13061
Remote Sensing Image Super-Resolution (RSISR) reconstructs high-resolution (HR) remote sensing images from low-resolution inputs to support fine-grained ground object interpretation. Existing methods face three key challenges: (1) Difficulty in extracting multi-scale features from spatially heterogeneous RS scenes, (2) Limited prior information causing semantic inconsistency in reconstructions, and (3) Trade-off imbalance between geometric accuracy and visual quality. To address these issues, we propose the Texture Transfer Residual Denoising Dual Diffusion Model (TTRD3) with three innovations: First, a Multi-scale Feature Aggregation Block (MFAB) employing parallel heterogeneous convolutional kernels for multi-scale feature extraction. Second, a Sparse Texture Transfer Guidance (STTG) module that transfers HR texture priors from reference images of similar scenes. Third, a Residual Denoising Dual Diffusion Model (RDDM) framework combining residual diffusion for deterministic reconstruction and noise diffusion for diverse generation. Experiments on multi-source RS datasets demonstrate TTRD3's superiority over state-of-the-art methods, achieving 1.43% LPIPS improvement and 3.67% FID enhancement compared to best-performing baselines. Code/model: this https URL.
遥感图像超分辨率(RSISR)技术旨在从低分辨率输入中重建高分辨率的遥感影像,以支持对地面物体进行精细解释。现有方法面临三大挑战:(1) 难以提取空间异质性遥感场景中的多尺度特征;(2) 有限的先验信息导致重构图像语义不一致;(3) 几何精度与视觉质量之间的权衡问题。为解决这些问题,我们提出了纹理转移残差去噪双扩散模型(TTRD3),其具有三大创新:首先,使用平行异构卷积核进行多尺度特征提取的多尺度特征聚合模块(MFAB)。其次,一个稀疏纹理转移引导(STTG)模块,该模块从类似场景的参考图像中转移高分辨率纹理先验。第三,残差扩散和噪声扩散相结合以实现确定性重构与多样化生成的残差去噪双扩散模型框架(RDDM)。在多源遥感数据集上的实验表明,TTRD3优于现有先进方法,在LPIPS指标上较最佳基线提升1.43%,FID指标上提升3.67%。代码/模型:[此链接](https://this https URL)。 请注意,最后一行中的“this https URL”应被实际的访问链接替换。
https://arxiv.org/abs/2504.13026
We present a novel approach to training specialized instruction-based image-editing diffusion models, addressing key challenges in structural preservation with input images and semantic alignment with user prompts. We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences without relying on extensive human annotations or curating a large dataset. Our method significantly improves the realism and alignment with instructions in two ways. First, the proposed models achieve precise and structurally coherent modifications in complex scenes while maintaining high fidelity in instruction-irrelevant areas. Second, they capture fine nuances in the desired edit by leveraging a visual prompt, enabling detailed control over visual edits without lengthy textual prompts. This approach simplifies users' efforts to achieve highly specific edits, requiring only 5 reference images depicting a certain concept for training. Experimental results demonstrate that our models can perform intricate edits in complex scenes, after just 10 training steps. Finally, we showcase the versatility of our method by applying it to robotics, where enhancing the visual realism of simulated environments through targeted sim-to-real image edits improves their utility as proxies for real-world settings.
我们提出了一种新颖的方法,用于训练基于指令的图像编辑扩散模型,解决了输入图像结构保存和用户提示语义对齐的关键挑战。我们引入了一个在线强化学习框架,该框架使扩散模型能够与人类偏好保持一致,而无需依赖大量的手动标注或整理大规模数据集。我们的方法通过以下两种方式显著提高了现实性和与指令的一致性: 首先,所提出的模型在复杂场景中实现了精确且结构上连贯的修改,同时在与指令无关的区域保持了高度保真度。 其次,它们利用视觉提示捕捉所需的编辑中的细微差别,从而能够在没有冗长文本提示的情况下实现对视觉编辑的精细控制。这种方法简化了用户进行特定编辑的努力,只需使用5个描绘某个概念的参考图像进行训练即可。 实验结果表明,在仅经过10次训练步骤后,我们的模型就能够执行复杂场景中的精细编辑。最后,我们通过在机器人技术领域应用该方法来展示其多功能性,即通过对模拟环境中的目标化仿真到真实(sim-to-real)图像编辑增强视觉逼真度,从而提高这些环境作为现实世界设置代理的实用性。
https://arxiv.org/abs/2504.12833
As digital content becomes increasingly ubiquitous, the need for robust watermark removal techniques has grown due to the inadequacy of existing embedding techniques, which lack robustness. This paper introduces a novel Saliency-Aware Diffusion Reconstruction (SADRE) framework for watermark elimination on the web, combining adaptive noise injection, region-specific perturbations, and advanced diffusion-based reconstruction. SADRE disrupts embedded watermarks by injecting targeted noise into latent representations guided by saliency masks although preserving essential image features. A reverse diffusion process ensures high-fidelity image restoration, leveraging adaptive noise levels determined by watermark strength. Our framework is theoretically grounded with stability guarantees and achieves robust watermark removal across diverse scenarios. Empirical evaluations on state-of-the-art (SOTA) watermarking techniques demonstrate SADRE's superiority in balancing watermark disruption and image quality. SADRE sets a new benchmark for watermark elimination, offering a flexible and reliable solution for real-world web content. Code is available on~\href{this https URL}{\textbf{this https URL}}.
随着数字内容的普及,由于现有嵌入技术缺乏鲁棒性,去除水印的需求日益增加。本文介绍了一种新颖的基于显著性的扩散重建(Saliency-Aware Diffusion Reconstruction, SADRE)框架,用于在网络上消除水印。该框架结合了自适应噪声注入、特定区域扰动和先进的扩散式重构方法。通过在由显著性掩码引导的潜在表示中注入有针对性的噪声,SADRE能够破坏嵌入的水印,同时保留图像的基本特征。逆向扩散过程确保了高保真度的图像恢复,利用自适应噪声级别来确定水印强度。我们的框架具有理论上的稳定性保证,并在各种场景下实现了鲁棒的水印去除效果。实证评估表明,在最先进的(State-of-the-Art, SOTA)水印技术上,SADRE在破坏水印和保持图像质量方面表现出色。SADRE为水印消除设定了新的基准,提供了一种灵活且可靠的解决方案,适用于现实世界中的网络内容。 代码可以在[这里](this https URL)获取。
https://arxiv.org/abs/2504.12809
Drag-driven editing has become popular among designers for its ability to modify complex geometric structures through simple and intuitive manipulation, allowing users to adjust and reshape content with minimal technical skill. This drag operation has been incorporated into numerous methods to facilitate the editing of 2D images and 3D meshes in design. However, few studies have explored drag-driven editing for the widely-used 3D Gaussian Splatting (3DGS) representation, as deforming 3DGS while preserving shape coherence and visual continuity remains challenging. In this paper, we introduce ARAP-GS, a drag-driven 3DGS editing framework based on As-Rigid-As-Possible (ARAP) deformation. Unlike previous 3DGS editing methods, we are the first to apply ARAP deformation directly to 3D Gaussians, enabling flexible, drag-driven geometric transformations. To preserve scene appearance after deformation, we incorporate an advanced diffusion prior for image super-resolution within our iterative optimization process. This approach enhances visual quality while maintaining multi-view consistency in the edited results. Experiments show that ARAP-GS outperforms current methods across diverse 3D scenes, demonstrating its effectiveness and superiority for drag-driven 3DGS editing. Additionally, our method is highly efficient, requiring only 10 to 20 minutes to edit a scene on a single RTX 3090 GPU.
基于拖拽的编辑方法因其能够通过简单的直观操作修改复杂的几何结构而在设计师中广受欢迎,使用户可以凭借最少的技术知识调整和重塑内容。这一拖拽操作已被应用于多种方法来简化2D图像和3D网格的设计编辑过程。然而,对于广泛使用的三维高斯点阵(3DGS)表示形式而言,鲜有研究探讨基于拖拽的编辑方法的应用,因为保持形状连贯性和视觉连续性的同时进行变形仍然具有挑战性。 在本文中,我们提出了一种名为ARAP-GS的框架,该框架是基于“尽可能刚性”的变形原理(As-Rigid-As-Possible, ARAP)构建的3DGS拖拽驱动编辑系统。与先前的3DGS编辑方法不同,我们首次直接将ARAP变形应用于三维高斯点上,这使得灵活、基于拖拽的几何变换成为可能。 为了在变形后保持场景外观的一致性,我们在迭代优化过程中引入了一种先进的扩散先验用于图像超分辨率处理。这种方法不仅提升了视觉质量,还确保了编辑结果中的多视图一致性。 实验结果显示,ARAP-GS在各种3D场景中均优于现有方法,证明了其在基于拖拽的3DGS编辑方面具有显著的效果和优越性。此外,我们的方法非常高效,在一台配备RTX 3090 GPU的设备上进行场景编辑仅需10至20分钟时间。
https://arxiv.org/abs/2504.12788
The rapid advancement of diffusion models and personalization techniques has made it possible to recreate individual portraits from just a few publicly available images. While such capabilities empower various creative applications, they also introduce serious privacy concerns, as adversaries can exploit them to generate highly realistic impersonations. To counter these threats, anti-personalization methods have been proposed, which add adversarial perturbations to published images to disrupt the training of personalization models. However, existing approaches largely overlook the intrinsic multi-image nature of personalization and instead adopt a naive strategy of applying perturbations independently, as commonly done in single-image settings. This neglects the opportunity to leverage inter-image relationships for stronger privacy protection. Therefore, we advocate for a group-level perspective on privacy protection against personalization. Specifically, we introduce Cross-image Anti-Personalization (CAP), a novel framework that enhances resistance to personalization by enforcing style consistency across perturbed images. Furthermore, we develop a dynamic ratio adjustment strategy that adaptively balances the impact of the consistency loss throughout the attack iterations. Extensive experiments on the classical CelebHQ and VGGFace2 benchmarks show that CAP substantially improves existing methods.
扩散模型和个性化技术的迅速发展使得仅通过几张公开图像就能复原个人肖像成为可能。虽然这种能力为各种创意应用提供了支持,但也带来了严重的隐私问题,因为对手可以利用这些技术生成高度逼真的伪装画像。为了应对这些威胁,反个性化方法被提出,它们通过对发布的图片添加对抗性扰动来扰乱个性化模型的训练过程。然而,现有的方法大多忽视了个人化内在的多图像特性,并且采用了一种类似于单一图像设置中的简单策略,在这种情况下独立地应用扰动,从而忽略了利用图像间关系以增强隐私保护的机会。 因此,我们提倡从群体层面出发,采取一种对抗个性化的隐私保护视角。具体来说,我们引入了一个名为跨图反个性化(Cross-image Anti-Personalization, CAP)的新框架,通过强制执行被扰动图片间的风格一致性来提高对个性化的抵抗能力。此外,我们还开发了一种动态比率调整策略,在整个攻击迭代过程中自适应地平衡一致性损失的影响。 在经典的人脸数据集CelebHQ和VGGFace2上的广泛实验表明,CAP显著提升了现有方法的性能。
https://arxiv.org/abs/2504.12747
Robotic manipulation faces critical challenges in understanding spatial affordances--the "where" and "how" of object interactions--essential for complex manipulation tasks like wiping a board or stacking objects. Existing methods, including modular-based and end-to-end approaches, often lack robust spatial reasoning capabilities. Unlike recent point-based and flow-based affordance methods that focus on dense spatial representations or trajectory modeling, we propose A0, a hierarchical affordance-aware diffusion model that decomposes manipulation tasks into high-level spatial affordance understanding and low-level action execution. A0 leverages the Embodiment-Agnostic Affordance Representation, which captures object-centric spatial affordances by predicting contact points and post-contact trajectories. A0 is pre-trained on 1 million contact points data and fine-tuned on annotated trajectories, enabling generalization across platforms. Key components include Position Offset Attention for motion-aware feature extraction and a Spatial Information Aggregation Layer for precise coordinate mapping. The model's output is executed by the action execution module. Experiments on multiple robotic systems (Franka, Kinova, Realman, and Dobot) demonstrate A0's superior performance in complex tasks, showcasing its efficiency, flexibility, and real-world applicability.
机器人操作在理解空间可及性(即物体交互的“何地”和“如何”)方面面临重大挑战,这对于像擦拭板子或堆叠物品这样的复杂任务至关重要。现有的方法,包括模块化方法和端到端方法,通常缺乏强大的空间推理能力。与最近基于点的方法和基于流的方法不同,这些方法专注于密集的空间表示或轨迹建模,我们提出了一种层次化的、具有感知性的扩散模型A0,它将操作任务分解为高层次的空间可及性理解和低层次的动作执行。A0利用了无实体依赖的可及性表示,通过预测接触点和接触后的轨迹来捕捉以物体为中心的空间可及性。该模型在100万个接触点的数据上进行了预训练,并在标注过的轨迹数据上进行微调,从而可以在不同的平台上实现泛化。其关键组件包括位置偏移注意机制(用于运动感知特征提取)以及空间信息聚合层(用于精确坐标映射)。A0的输出由动作执行模块负责执行。 实验结果显示,在多个机器人系统(Franka、Kinova、Realman和Dobot)上,A0在复杂任务中的表现优于现有方法,展示了其效率、灵活性和现实世界应用的能力。
https://arxiv.org/abs/2504.12636
We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.
我们提出了一种神经网络结构FramePack,用于训练视频生成中的下一帧(或下一帧片段)预测模型。FramePack可以压缩输入的视频帧,使Transformer的上下文长度固定不变,无论视频时长如何。因此,我们可以使用与图像扩散类似计算瓶颈的方式来处理大量视频帧,并且能够显著提高训练批次大小(批次大小变得与图像扩散训练相当)。此外,我们还提出了一种反漂移采样方法,在逆时间顺序生成帧并利用早期确定的端点来避免曝光偏差(即多次迭代中错误的积累)。最后,我们展示了现有的视频扩散模型可以使用FramePack进行微调,并且由于下一帧预测支持更加平衡的扩散调度器和较少极端的流偏移步长,这些模型的视觉质量可能会得到提升。
https://arxiv.org/abs/2504.12626
Restoring images afflicted by complex real-world degradations remains challenging, as conventional methods often fail to adapt to the unique mixture and severity of artifacts present. This stems from a reliance on indirect cues which poorly capture the true perceptual quality deficit. To address this fundamental limitation, we introduce AdaQual-Diff, a diffusion-based framework that integrates perceptual quality assessment directly into the generative restoration process. Our approach establishes a mathematical relationship between regional quality scores from DeQAScore and optimal guidance complexity, implemented through an Adaptive Quality Prompting mechanism. This mechanism systematically modulates prompt structure according to measured degradation severity: regions with lower perceptual quality receive computationally intensive, structurally complex prompts with precise restoration directives, while higher quality regions receive minimal prompts focused on preservation rather than intervention. The technical core of our method lies in the dynamic allocation of computational resources proportional to degradation severity, creating a spatially-varying guidance field that directs the diffusion process with mathematical precision. By combining this quality-guided approach with content-specific conditioning, our framework achieves fine-grained control over regional restoration intensity without requiring additional parameters or inference iterations. Experimental results demonstrate that AdaQual-Diff achieves visually superior restorations across diverse synthetic and real-world datasets.
修复受到复杂现实世界退化影响的图像仍然具有挑战性,因为传统方法往往无法适应存在的独特混合和严重程度的艺术瑕疵。这源于对间接线索的依赖,这些线索未能充分捕捉到真正的感知质量缺陷。为了解决这一根本限制,我们引入了AdaQual-Diff,这是一种基于扩散框架的方法,它将感知质量评估直接整合到了生成修复过程中。我们的方法建立了一个数学关系,将DeQAScore中的区域质量得分与最优引导复杂度联系起来,并通过自适应质量提示机制实现这一点。 这种机制根据测量到的退化严重程度系统地调节提示结构:感知质量较低的区域会收到计算密集型、结构复杂的提示,带有精确的修复指令;而高质量的区域则会收到最小化的提示,专注于保护而非干预。我们方法的技术核心在于动态分配与降级严重性成比例的计算资源,创建一个空间变化的引导场,以数学精度指导扩散过程。 通过将这种方法与内容特定条件相结合,我们的框架能够在不增加额外参数或推理迭代的情况下实现对区域修复强度的精细控制。实验结果表明,AdaQual-Diff在各种合成和现实世界的数据集中实现了视觉上更优的修复效果。
https://arxiv.org/abs/2504.12605
The widespread adoption of diffusion models in image generation has increased the demand for privacy-compliant unlearning. However, due to the high-dimensional nature and complex feature representations of diffusion models, achieving selective unlearning remains challenging, as existing methods struggle to remove sensitive information while preserving the consistency of non-sensitive regions. To address this, we propose an Automatic Dataset Creation Framework based on prompt-based layered editing and training-free local feature removal, constructing the ForgetMe dataset and introducing the Entangled evaluation metric. The Entangled metric quantifies unlearning effectiveness by assessing the similarity and consistency between the target and background regions and supports both paired (Entangled-D) and unpaired (Entangled-S) image data, enabling unsupervised evaluation. The ForgetMe dataset encompasses a diverse set of real and synthetic scenarios, including CUB-200-2011 (Birds), Stanford-Dogs, ImageNet, and a synthetic cat dataset. We apply LoRA fine-tuning on Stable Diffusion to achieve selective unlearning on this dataset and validate the effectiveness of both the ForgetMe dataset and the Entangled metric, establishing them as benchmarks for selective unlearning. Our work provides a scalable and adaptable solution for advancing privacy-preserving generative AI.
在图像生成领域,扩散模型的广泛应用增加了对符合隐私要求的“遗忘”机制的需求。然而,由于扩散模型具有高维特性和复杂的特征表示,实现选择性遗忘仍然极具挑战性。现有的方法难以同时去除敏感信息并保持非敏感区域的一致性。为了解决这个问题,我们提出了一种基于提示分层编辑和无训练本地特征移除的自动数据集创建框架,并构建了ForgetMe数据集以及介绍了纠缠评估指标(Entangled)。该指标通过评估目标区域与背景区域之间的相似性和一致性来量化遗忘的效果,并支持配对(Entangled-D)及非配对(Entangled-S)图像数据,从而实现无监督评价。 ForgetMe数据集包含了广泛的现实和合成场景,包括CUB-200-2011(鸟类)、斯坦福犬、ImageNet以及一个合成的猫数据集。我们通过在Stable Diffusion上应用低秩适应(Low-Rank Adaptation, LoRA)微调方法,在该数据集上实现了选择性遗忘,并验证了ForgetMe数据集和纠缠度量的有效性,从而确立了它们作为选择性遗忘基准的地位。 我们的工作提供了一个可扩展且灵活的解决方案,以推进保护隐私的生成式人工智能技术。
https://arxiv.org/abs/2504.12574
Generating natural and physically plausible character motion remains challenging, particularly for long-horizon control with diverse guidance signals. While prior work combines high-level diffusion-based motion planners with low-level physics controllers, these systems suffer from domain gaps that degrade motion quality and require task-specific fine-tuning. To tackle this problem, we introduce UniPhys, a diffusion-based behavior cloning framework that unifies motion planning and control into a single model. UniPhys enables flexible, expressive character motion conditioned on multi-modal inputs such as text, trajectories, and goals. To address accumulated prediction errors over long sequences, UniPhys is trained with the Diffusion Forcing paradigm, learning to denoise noisy motion histories and handle discrepancies introduced by the physics simulator. This design allows UniPhys to robustly generate physically plausible, long-horizon motions. Through guided sampling, UniPhys generalizes to a wide range of control signals, including unseen ones, without requiring task-specific fine-tuning. Experiments show that UniPhys outperforms prior methods in motion naturalness, generalization, and robustness across diverse control tasks.
生成自然且物理上合理的角色运动仍然具有挑战性,尤其是在需要多样化引导信号的长期控制中。尽管先前的工作将基于扩散的高层次运动规划器与低层次的物理控制器相结合,但这些系统存在领域差距问题,这些问题会降低动作质量并要求特定任务上的微调。为了解决这个问题,我们介绍了UniPhys,这是一个基于扩散的行为克隆框架,它将运动规划和控制统一在一个单一模型中。UniPhys能够根据多模态输入(如文本、轨迹和目标)生成灵活且表现力强的角色动作。 为了应对长期序列中的累积预测误差问题,UniPhys使用了扩散强迫范式进行训练,学习去除噪声的动作历史并处理物理仿真器引入的差异。这种设计使UniPhys能够在广泛的控制信号下稳健地生成物理上合理的长时运动,并能够推广到包括未见过的信号在内的各种情况,而无需特定任务上的微调。 实验结果表明,在多样化的控制任务中,UniPhys在动作自然度、泛化能力和鲁棒性方面都超过了先前的方法。
https://arxiv.org/abs/2504.12540
How diffusion models generalize beyond their training set is not known, and is somewhat mysterious given two facts: the optimum of the denoising score matching (DSM) objective usually used to train diffusion models is the score function of the training distribution; and the networks usually used to learn the score function are expressive enough to learn this score to high accuracy. We claim that a certain feature of the DSM objective -- the fact that its target is not the training distribution's score, but a noisy quantity only equal to it in expectation -- strongly impacts whether and to what extent diffusion models generalize. In this paper, we develop a mathematical theory that partly explains this 'generalization through variance' phenomenon. Our theoretical analysis exploits a physics-inspired path integral approach to compute the distributions typically learned by a few paradigmatic under- and overparameterized diffusion models. We find that the distributions diffusion models effectively learn to sample from resemble their training distributions, but with 'gaps' filled in, and that this inductive bias is due to the covariance structure of the noisy target used during training. We also characterize how this inductive bias interacts with feature-related inductive biases.
扩散模型如何在其训练数据集之外泛化仍然是未知的,并且考虑到两个事实,这一点显得有些神秘:用于训练扩散模型的去噪分数匹配(DSM)目标函数的最优解通常是训练分布的得分函数;而通常用来学习该得分函数的网络足够表达性以实现高精度的学习。我们主张,DSM目标的一个特定特征——即其目标不是训练分布的得分,而是仅在期望值上等于该得分的噪声量——对扩散模型是否以及在何种程度上泛化产生了重大影响。在这篇论文中,我们发展了一个数学理论,部分解释了这种“通过方差泛化”的现象。 我们的理论分析借鉴了一种物理启发式的路径积分方法,用于计算几个典型欠参数和过参数扩散模型所学习的分布情况。我们发现,扩散模型有效学习到并采样的分布与训练分布相似,但具有填补‘空白’的特点,并且这种归纳偏差归因于在训练过程中使用的噪声目标的相关结构。此外,还分析了这种归纳偏置如何与与特征相关的归纳偏置相互作用。 简而言之,这篇论文通过数学理论探讨了扩散模型的泛化机制及其背后的物理原理,揭示了其学习到的分布特性以及该过程中的相关性和偏差因素。
https://arxiv.org/abs/2504.12532
Mobile robots on construction sites require accurate pose estimation to perform autonomous surveying and inspection missions. Localization in construction sites is a particularly challenging problem due to the presence of repetitive features such as flat plastered walls and perceptual aliasing due to apartments with similar layouts inter and intra floors. In this paper, we focus on the global re-positioning of a robot with respect to an accurate scanned mesh of the building solely using LiDAR data. In our approach, a neural network is trained on synthetic LiDAR point clouds generated by simulating a LiDAR in an accurate real-life large-scale mesh. We train a diffusion model with a PointNet++ backbone, which allows us to model multiple position candidates from a single LiDAR point cloud. The resulting model can successfully predict the global position of LiDAR in confined and complex sites despite the adverse effects of perceptual aliasing. The learned distribution of potential global positions can provide multi-modal position distribution. We evaluate our approach across five real-world datasets and show the place recognition accuracy of 77% +/-2m on average while outperforming baselines at a factor of 2 in mean error.
施工现场的移动机器人需要精确的姿态估计来执行自主测量和检查任务。由于存在诸如平坦抹灰墙等重复特征以及因楼层内和楼层间公寓布局相似而导致的感知歧义,因此在建筑工地进行定位是一项特别具有挑战性的问题。本文专注于仅使用LiDAR数据,通过与建筑物准确扫描网格的全局重新定位移动机器人的位置。我们的方法训练了一个神经网络,该网络是在精确的真实大型网格中模拟LiDAR生成合成LiDAR点云的基础上进行训练的。我们用PointNet++作为骨干网来训练扩散模型,这使我们能够从单个LiDAR点云中建模多个可能的位置候选。所得模型可以在狭窄和复杂的环境中成功预测LiDAR的全局位置,即使存在感知歧义带来的不利影响也不例外。所学的潜在全球位置分布可以提供多模式位置分布。我们在五个真实数据集上评估了我们的方法,并展示了平均77%±2米的地方识别精度,同时在均方误差上比基线方法提高了两倍。
https://arxiv.org/abs/2504.12412
Current learning-based subject customization approaches, predominantly relying on U-Net architectures, suffer from limited generalization ability and compromised image quality. Meanwhile, optimization-based methods require subject-specific fine-tuning, which inevitably degrades textual controllability. To address these challenges, we propose InstantCharacter, a scalable framework for character customization built upon a foundation diffusion transformer. InstantCharacter demonstrates three fundamental advantages: first, it achieves open-domain personalization across diverse character appearances, poses, and styles while maintaining high-fidelity results. Second, the framework introduces a scalable adapter with stacked transformer encoders, which effectively processes open-domain character features and seamlessly interacts with the latent space of modern diffusion transformers. Third, to effectively train the framework, we construct a large-scale character dataset containing 10-million-level samples. The dataset is systematically organized into paired (multi-view character) and unpaired (text-image combinations) subsets. This dual-data structure enables simultaneous optimization of identity consistency and textual editability through distinct learning pathways. Qualitative experiments demonstrate the advanced capabilities of InstantCharacter in generating high-fidelity, text-controllable, and character-consistent images, setting a new benchmark for character-driven image generation. Our source code is available at this https URL.
当前基于学习的主体定制方法主要依赖于U-Net架构,这些方法存在泛化能力有限和图像质量受损的问题。与此同时,优化方法需要特定主体的微调,这不可避免地会降低文本控制性。为了应对这些挑战,我们提出了InstantCharacter——一个建立在基础扩散变换器之上的可扩展框架,用于角色定制。InstantCharacter展示了三个基本优势:首先,它实现了跨不同角色外观、姿态和风格的开放式个性化定制,并保持了高保真度的结果。其次,该框架引入了一个具有堆叠式变压器编码器的可扩展适配器,它可以有效地处理开放域的角色特征,并与现代扩散变换器的潜在空间无缝交互。第三,为了有效训练此框架,我们构建了一个包含数千万级样本的大规模角色数据集。该数据集系统地组织成配对(多视角角色)和非配对(文本-图像组合)子集。这种双重数据结构通过不同的学习路径同时优化身份一致性与文本编辑性。定性的实验展示了InstantCharacter在生成高保真、文本可控和角色一致的图像方面的高级能力,为基于角色的图像生成设定了新的基准。我们的源代码可在该链接获取:[提供的URL]。
https://arxiv.org/abs/2504.12395