Anomaly detection is a crucial task in computer vision, yet collecting real-world defect images is inherently difficult due to the rarity and unpredictability of anomalies. Consequently, researchers have turned to synthetic methods for training data augmentation. However, existing synthetic strategies (e.g., naive cut-and-paste or inpainting) overlook the underlying physical causes of defects, leading to inconsistent, low-fidelity anomalies that hamper model generalization to real-world complexities. In this thesis, we introduced a novel pipeline that generates synthetic anomalies through Math-Physics model guidance, refines them via a Coarse-to-Fine approach and employs a bi-level optimization strategy with a Synthesis Quality Estimator(SQE). By incorporating physical modeling of cracks, corrosion, and deformation, our method produces realistic defect masks, which are subsequently enhanced in two phases. The first stage (npcF) enforces a PDE-based consistency to achieve a globally coherent anomaly structure, while the second stage (npcF++) further improves local fidelity using wavelet transforms and boundary synergy blocks. Additionally, we leverage SQE-driven weighting, ensuring that high-quality synthetic samples receive greater emphasis during training. To validate our approach, we conducted comprehensive experiments on three widely adopted industrial anomaly detection benchmarks: MVTec AD, VisA, and BTAD. Across these datasets, the proposed pipeline achieves state-of-the-art (SOTA) results in both image-AUROC and pixel-AUROC, confirming the effectiveness of our MaPhC2F and BiSQAD.
异常检测是计算机视觉中的一个重要任务,但由于缺陷图像在现实世界中极为稀有且难以预测,收集这些图像非常困难。因此,研究人员转向合成方法来增强训练数据。然而,现有的合成策略(如简单的剪切粘贴或修复)忽略了缺陷的物理成因,导致生成的异常不一致且质量低劣,限制了模型对现实复杂情况的泛化能力。在本论文中,我们引入了一种新型管道,该管道通过数学-物理模型指导生成合成异常,并采用粗到细的方法进行优化,同时利用双层级优化策略和合成质量评估器(SQE)。我们的方法通过对裂缝、腐蚀和变形等缺陷进行物理建模来产生现实的缺陷掩膜,在两个阶段进一步增强它们。第一阶段(npcF)通过基于PDE的一致性实现全局连贯的异常结构,而第二阶段(npcF++)则利用小波变换和边界协同块进一步提升局部细节的真实性。 此外,我们还采用了SQE驱动的加权机制,确保在训练过程中对高质量合成样本给予更多关注。为了验证我们的方法的有效性,在三个广泛采用的工业异常检测基准数据集(MVTec AD、VisA 和 BTAD)上进行了全面实验。在这些数据集中,所提出的管道在图像AUROC和像素AUROC方面均达到了最先进的(SOTA)结果,这证实了我们MaPhC2F和BiSQAD的有效性。
https://arxiv.org/abs/2504.12970
Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with F&W+ latent space. Specifically, the multimodal encoder aims to enhance the multi-scale structures with additional semantic segmentation edge texture modalities through a gated mask-aware attention module. Afterwards, a pre-modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the F&W+ latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft-update Mean Latent module to capture more diversified in-domain patterns for generating high-fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, we show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts and it supports the completion of out-of-domain images effectively.
生成对抗网络(GAN)逆向技术在图像修复领域中展示了卓越的性能,其目标是利用未遮挡的内容恢复丢失或损坏的纹理。之前的基于 GAN 逆向的方法通常会使用经过良好训练的 GAN 模型作为有效的先验条件来生成缺失区域的真实感部分。尽管这些方法表现优秀,但它们忽略了输入图像和输出图像中未遮挡区域应当保持一致这一严格的约束条件,这导致了 GAN 逆向与图像修复之间的差距,并因此降低了性能。此外,现有的 GAN 逆向方法通常仅考虑输入图像的单一模式,忽视了其他在改进方面有帮助的辅助线索。 为了应对这些挑战,我们提出了一种新颖的 GAN 逆向方法,称为 MMInvertFill,用于图像修复。MMInvertFill 主要包含一个多模态引导编码器和一个 F&W+ 潜空间中的 GAN 发生器。具体来说,多模态编码器旨在通过门控掩码感知注意模块增强多层次结构,并且引入了额外的语义分割边缘纹理模式。随后,我们提出预调制技术将这些结构编码为样式向量。为了缓解明显的颜色差异和语义不一致的问题,我们引进 F&W+ 潜空间来弥合 GAN 逆向与图像修复之间的差距。 进一步地,为了重构忠实且逼真的图像,我们设计了一个简单而有效的软更新均值潜空间模块,以捕捉更多的域内模式,并为大量损坏生成高质量的纹理。在六个具有挑战性的数据集上的广泛实验中,我们的 MMInvertFill 从定性和定量上都超越了当前最佳方法,并支持跨域图像的有效完成任务。
https://arxiv.org/abs/2504.12844
Recent advancements in image editing have utilized large-scale multimodal models to enable intuitive, natural instruction-driven interactions. However, conventional methods still face significant challenges, particularly in spatial reasoning, precise region segmentation, and maintaining semantic consistency, especially in complex scenes. To overcome these challenges, we introduce SmartFreeEdit, a novel end-to-end framework that integrates a multimodal large language model (MLLM) with a hypergraph-enhanced inpainting architecture, enabling precise, mask-free image editing guided exclusively by natural language instructions. The key innovations of SmartFreeEdit include:(1)the introduction of region aware tokens and a mask embedding paradigm that enhance the spatial understanding of complex scenes;(2) a reasoning segmentation pipeline designed to optimize the generation of editing masks based on natural language instructions;and (3) a hypergraph-augmented inpainting module that ensures the preservation of both structural integrity and semantic coherence during complex edits, overcoming the limitations of local-based image generation. Extensive experiments on the Reason-Edit benchmark demonstrate that SmartFreeEdit surpasses current state-of-the-art methods across multiple evaluation metrics, including segmentation accuracy, instruction adherence, and visual quality preservation, while addressing the issue of local information focus and improving global consistency in the edited image. Our project will be available at this https URL.
近期的图像编辑技术利用大规模多模态模型实现了直观且自然的语言驱动交互。然而,传统方法在空间推理、精确区域分割以及保持语义一致性方面仍面临重大挑战,尤其是在复杂场景中更为明显。为了解决这些问题,我们引入了SmartFreeEdit——一种结合了超图增强的图像修复架构和多模态大型语言模型(MLLM)的全新端到端框架,它支持仅通过自然语言指令进行无遮罩、精确的图像编辑。SmartFreeEdit的主要创新点包括: 1. 引入区域感知令牌与掩码嵌入范式,以增强对复杂场景的空间理解; 2. 设计了一种推理分割流水线,旨在根据自然语言指令优化生成编辑掩码的过程; 3. 采用超图增强的图像修复模块,在进行复杂编辑时确保结构完整性和语义连贯性的保持,从而克服基于局部信息生成图像的局限性。 在Reason-Edit基准测试中的广泛实验表明,SmartFreeEdit在多个评估指标(包括分割准确性、指令遵循度和视觉质量保留)上均超越了当前最先进的方法,并解决了局部信息关注不足及改进编辑后图像全局一致性的难题。有关我们项目的更多信息可在此URL访问:[https://此链接待补充](https://此链接待补充)。
https://arxiv.org/abs/2504.12704
Reconstructing 4D dynamic scenes from casually captured monocular videos is valuable but highly challenging, as each timestamp is observed from a single viewpoint. We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views - synthesizing multi-view videos from a monocular input. Unlike existing methods that either solely leverage geometric priors for supervision or use generative priors while overlooking geometry, we integrate both. This reformulates view augmentation as a video inpainting task, where observed views are warped into new viewpoints based on monocular depth priors. To achieve this, we train a video inpainting model on unposed web videos with synthetically generated masks that mimic warping occlusions, ensuring spatially and temporally consistent completion of missing regions. To further mitigate inaccuracies in monocular depth priors, we introduce an iterative view augmentation strategy and a robust reconstruction loss. Experiments demonstrate that our method effectively improves monocular 4D scene reconstruction and completion.
从随意捕捉的单目视频中重建动态的四维场景具有很高的价值,但同时也极具挑战性,因为每个时间戳只能从单一视角观察。为此,我们提出了Vivid4D这一创新方法,它通过增强观测视角来提升四维单目视频合成的质量——即从单目输入生成多视点视频。与现有的要么单纯依赖几何先验进行监督、要么利用生成式先验而忽视了几何特性的方法不同,我们的方法整合了这两者。 我们将视角增强重新定义为一个视频修复任务,在此过程中,根据单目深度先验将已观测到的视角扭曲至新的视点。为了实现这一点,我们训练了一个视频修复模型,并在未经过预设处理的网络视频上进行训练,同时生成模仿视差遮挡效果的合成掩码,以确保缺失区域的空间和时间一致性得到妥善完成。 为进一步减少单目深度先验中的不准确性,我们引入了一种迭代视角增强策略及一种鲁棒重建损失函数。实验结果表明,我们的方法能够有效提升基于单目的四维场景重建与填补的效果。
https://arxiv.org/abs/2504.11092
Implicit neural representation (INR) has emerged as a powerful paradigm for visual data representation. However, classical INR methods represent data in the original space mixed with different frequency components, and several feature encoding parameters (e.g., the frequency parameter $\omega$ or the rank $R$) need manual configurations. In this work, we propose a self-evolving cross-frequency INR using the Haar wavelet transform (termed CF-INR), which decouples data into four frequency components and employs INRs in the wavelet space. CF-INR allows the characterization of different frequency components separately, thus enabling higher accuracy for data representation. To more precisely characterize cross-frequency components, we propose a cross-frequency tensor decomposition paradigm for CF-INR with self-evolving parameters, which automatically updates the rank parameter $R$ and the frequency parameter $\omega$ for each frequency component through self-evolving optimization. This self-evolution paradigm eliminates the laborious manual tuning of these parameters, and learns a customized cross-frequency feature encoding configuration for each dataset. We evaluate CF-INR on a variety of visual data representation and recovery tasks, including image regression, inpainting, denoising, and cloud removal. Extensive experiments demonstrate that CF-INR outperforms state-of-the-art methods in each case.
隐式神经表示(INR)作为一种强大的视觉数据表达范式已经出现。然而,传统的INR方法在原始空间中用不同的频率成分混合表示数据,并且需要手动配置若干特征编码参数(如频率参数$\omega$或秩$R$)。在这项工作中,我们提出了一种使用Haar小波变换的自进化跨频带隐式神经表示(CF-INR),该方法将数据分解为四个频率分量并在小波空间中应用INRs。通过这种方式,CF-INR能够分别表征不同的频率成分,从而提高数据表示的准确性。 为了更精确地描述跨频带组件,我们提出了一个适用于CF-INR的自进化跨频带张量分解范式,该方法自动更新每个频率分量的秩参数$R$和频率参数$\omega$。通过自我进化的优化过程,这种自进化范例消除了这些参数的手动调整需求,并为每种数据集学习定制化的跨频带特征编码配置。 我们在多种视觉数据表示与恢复任务上评估了CF-INR的表现,包括图像回归、修复(inpainting)、去噪和去除云层。广泛的实验结果表明,在所有情况下,CF-INR均优于当前最佳方法。
https://arxiv.org/abs/2504.10929
Inpainting has recently emerged as a valuable and interesting technology to employ in the analysis of medical imaging data, in particular brain MRI. A wide variety of methodologies for inpainting MRI have been proposed and demonstrated on tasks including anomaly detection. In this work we investigate the statistical relationship between inpainted brain structures and the amount of subject-specific conditioning information, i.e. the other areas of the image that are masked. In particular, we analyse the distribution of inpainting results when masking additional regions of the image, specifically the contra-lateral structure. This allows us to elucidate where in the brain the model is drawing information from, and in particular, what is the importance of hemispherical symmetry? Our experiments interrogate a diffusion inpainting model through analysing the inpainting of subcortical brain structures based on intensity and estimated area change. We demonstrate that some structures show a strong influence of symmetry in the conditioning of the inpainting process.
最近,图像修复(Inpainting)技术在医学影像数据分析中,尤其是脑部MRI分析方面展现出了其价值和趣味性。许多用于MRI图像修复的方法已经被提出,并且在包括异常检测在内的任务中得到了验证。在这项工作中,我们调查了修复后的脑结构与其特定受试者的条件信息之间的统计关系,即那些被遮蔽的其他区域。具体来说,我们分析了当屏蔽额外区域(特别是对侧结构)时图像修复结果的分布情况。这使我们能够揭示模型从大脑何处获取信息,并特别指出半球对称性的重要性是什么。 我们的实验通过分析基于强度和估计面积变化进行弥散MRI图像修复的过程来探究一个图像修复模型。结果显示,某些结构在图像修复过程中的条件化过程中表现出强烈的对称性影响。
https://arxiv.org/abs/2504.10039
Single-image 3D scene reconstruction presents significant challenges due to its inherently ill-posed nature and limited input constraints. Recent advances have explored two promising directions: multiview generative models that train on 3D consistent datasets but struggle with out-of-distribution generalization, and 3D scene inpainting and completion frameworks that suffer from cross-view inconsistency and suboptimal error handling, as they depend exclusively on depth data or 3D smoothness, which ultimately degrades output quality and computational performance. Building upon these approaches, we present GaussVideoDreamer, which advances generative multimedia approaches by bridging the gap between image, video, and 3D generation, integrating their strengths through two key innovations: (1) A progressive video inpainting strategy that harnesses temporal coherence for improved multiview consistency and faster convergence. (2) A 3D Gaussian Splatting consistency mask to guide the video diffusion with 3D consistent multiview evidence. Our pipeline combines three core components: a geometry-aware initialization protocol, Inconsistency-Aware Gaussian Splatting, and a progressive video inpainting strategy. Experimental results demonstrate that our approach achieves 32% higher LLaVA-IQA scores and at least 2x speedup compared to existing methods while maintaining robust performance across diverse scenes.
单张图像的三维场景重建面临巨大的挑战,由于其本质上是一个病态问题,并且输入信息有限。最近的研究主要探索了两个有前景的方向:多视图生成模型训练在3D一致的数据集上,但难以处理出分布数据的一般化;以及依赖于深度数据或三维平滑性的3D场景修复和补全框架,在跨视角一致性方面存在不足,并且误差处理欠佳,最终导致输出质量降低及计算性能下降。基于这些方法的进展,我们提出了GaussVideoDreamer,该模型通过连接图像、视频与三维生成之间的差距,利用两个关键创新来推动生成多媒体的方法:(1)渐进式视频修复策略,借助时间一致性实现多视图一致性的改进和更快收敛。(2)3D高斯点阵一致性掩模,为视频扩散过程提供基于3D一致性多视角证据的指导。我们的流程整合了三个核心组件:几何感知初始化协议、跨一致性意识高斯点阵以及渐进式视频修复策略。实验结果表明,相较于现有的方法,我们提出的方法在LLaVA-IQA评分上提高了32%,同时至少提升了两倍的速度,并保持了在各种场景中的稳定性能。
https://arxiv.org/abs/2504.10001
We propose an adaptation to the training of Vision Transformers (ViTs) that allows for an explicit modeling of objects during the attention computation. This is achieved by adding a new branch to selected attention layers that computes an auxiliary loss which we call the object-focused attention (OFA) loss. We restrict the attention to image patches that belong to the same object class, which allows ViTs to gain a better understanding of configural (or holistic) object shapes by focusing on intra-object patches instead of other patches such as those in the background. Our proposed inductive bias fits easily into the attention framework of transformers since it only adds an auxiliary loss over selected attention layers. Furthermore, our approach has no additional overhead during inference. We also experiment with multiscale masking to further improve the performance of our OFA model and give a path forward for self-supervised learning with our method. Our experimental results demonstrate that ViTs with OFA achieve better classification results than their base models, exhibit a stronger generalization ability to out-of-distribution (OOD) and adversarially corrupted images, and learn representations based on object shapes rather than spurious correlations via general textures. For our OOD setting, we generate a novel dataset using the COCO dataset and Stable Diffusion inpainting which we plan to share with the community.
我们提出了一种对视觉变换器(ViT)训练的适应性调整,该调整允许在注意力计算过程中显式地对物体进行建模。通过向选定的关注层添加一个新的分支来实现这一点,这个新分支计算一个辅助损失,即所谓的对象聚焦关注(OFA)损失。我们将注意力限制在属于同一物体类别的图像补丁上,这使ViT能够更好地理解配置图(或整体)的物体形状,方法是将重点放在物体内部的补丁上,而不是背景中的其他补丁。我们提出的归纳偏差很容易融入到变换器的关注框架中,因为它仅向选定的关注层添加了一个辅助损失,并且在推理过程中没有额外的开销。此外,我们还通过多尺度掩码实验来进一步提高我们的OFA模型的表现,并为使用我们方法进行自监督学习提供了一条路径。实验证明了,带有OFA的ViT比其基础模型有更好的分类效果,更强地泛化到分布外(OOD)和对抗性损坏的图像上,并且基于物体形状而非偶然关联的一般纹理来学习表示。对于我们的OOD设置,我们使用COCO数据集和Stable Diffusion inpainting生成了一个新的数据集,计划与社区分享。
https://arxiv.org/abs/2504.08166
Deep neural networks (DNNs) have demonstrated remarkable success, yet their wide adoption is often hindered by their opaque decision-making. To address this, attribution methods have been proposed to assign relevance values to each part of the input. However, different methods often produce entirely different relevance maps, necessitating the development of standardized metrics to evaluate them. Typically, such evaluation is performed through perturbation, wherein high- or low-relevance regions of the input image are manipulated to examine the change in prediction. In this work, we introduce a novel approach, which harnesses image generation models to perform targeted perturbation. Specifically, we focus on inpainting only the high-relevance pixels of an input image to modify the model's predictions while preserving image fidelity. This is in contrast to existing approaches, which often produce out-of-distribution modifications, leading to unreliable results. Through extensive experiments, we demonstrate the effectiveness of our approach in generating meaningful rankings across a wide range of models and attribution methods. Crucially, we establish that the ranking produced by our metric exhibits significantly higher correlation with human preferences compared to existing approaches, underscoring its potential for enhancing interpretability in DNNs.
深度神经网络(DNN)展现了显著的成功,但其广泛采用往往受到其不透明决策过程的阻碍。为了解决这一问题,已经提出了归因方法来给输入的每一部分分配相关性值。然而,不同的方法常常产生完全不同的相关性图谱,这需要开发标准化的评估指标来进行评价。通常,这种评估是通过扰动进行的,即操纵输入图像中的高或低相关区域以观察预测的变化。 在本文中,我们引入了一种新颖的方法,利用图像生成模型执行定向扰动。具体来说,我们将仅对输入图像中的高相关像素进行修复(inpainting),同时修改模型的预测并保持图像的真实性。这与现有方法形成对比,后者通常会产生出分布的数据修改,导致结果不可靠。 通过广泛的实验,我们展示了该方法在广泛的不同模型和归因方法中生成有意义排名的有效性。尤其重要的是,我们证明了我们的度量所生成的排序与人类偏好之间的相关性显著高于现有的方法,强调了其增强DNN可解释性的潜力。
https://arxiv.org/abs/2504.06800
Neural representations for video (NeRV) have gained considerable attention for their strong performance across various video tasks. However, existing NeRV methods often struggle to capture fine spatial details, resulting in vague reconstructions. In this paper, we present a Frequency Separation and Augmentation based Neural Representation for video (FANeRV), which addresses these limitations with its core Wavelet Frequency Upgrade this http URL block explicitly separates input frames into high and low-frequency components using discrete wavelet transform, followed by targeted enhancement using specialized modules. Finally, a specially designed gated network effectively fuses these frequency components for optimal reconstruction. Additionally, convolutional residual enhancement blocks are integrated into the later stages of the network to balance parameter distribution and improve the restoration of high-frequency details. Experimental results demonstrate that FANeRV significantly improves reconstruction performance and excels in multiple tasks, including video compression, inpainting, and interpolation, outperforming existing NeRV methods.
基于神经表示的视频方法(NeRV)因其在各种视频任务中的出色表现而获得了广泛的关注。然而,现有的NeRV方法常常难以捕捉精细的空间细节,导致重建效果模糊不清。为此,在本文中我们提出了一种新的方法——基于频率分离和增强的神经视频表示(FANeRV),它通过核心的小波频谱升级模块解决了这些问题。 FANeRV的核心在于其采用了离散小波变换来将输入帧显式地分离为高频与低频成分,随后利用专门设计的模块对特定成分进行针对性提升。最终,一个特别设计的门控网络有效融合了这些频率组件以实现最优重建效果。此外,在网络后期阶段还集成了卷积残差增强块,这有助于平衡参数分布并进一步改善高频细节的恢复。 实验结果表明,FANeRV在视频压缩、修复和插值等多任务上显著提升了重建性能,并超越了现有的NeRV方法的表现。
https://arxiv.org/abs/2504.06755
Product posters, which integrate subject, scene, and text, are crucial promotional tools for attracting customers. Creating such posters using modern image generation methods is valuable, while the main challenge lies in accurately rendering text, especially for complex writing systems like Chinese, which contains over 10,000 individual characters. In this work, we identify the key to precise text rendering as constructing a character-discriminative visual feature as a control signal. Based on this insight, we propose a robust character-wise representation as control and we develop TextRenderNet, which achieves a high text rendering accuracy of over 90%. Another challenge in poster generation is maintaining the fidelity of user-specific products. We address this by introducing SceneGenNet, an inpainting-based model, and propose subject fidelity feedback learning to further enhance fidelity. Based on TextRenderNet and SceneGenNet, we present PosterMaker, an end-to-end generation framework. To optimize PosterMaker efficiently, we implement a two-stage training strategy that decouples text rendering and background generation learning. Experimental results show that PosterMaker outperforms existing baselines by a remarkable margin, which demonstrates its effectiveness.
产品海报集成了主题、场景和文本,是吸引顾客的关键促销工具。利用现代图像生成方法创建这样的海报非常有价值,然而主要挑战在于准确渲染文字,尤其是对于包含超过10,000个独特字符的复杂书写系统如中文来说更是如此。在这项工作中,我们确定了精确文字渲染的关键在于构建具有区分度的字符视觉特征作为控制信号。基于这一洞察,我们提出了一种稳健的逐字符表示形式作为控制,并开发了TextRenderNet,实现了超过90%的文字渲染准确率。 海报生成中的另一个挑战是保持用户特定产品的忠实性(保真度)。为了解决这个问题,我们引入了SceneGenNet,这是一个基于图像修复的方法模型,并提出了主题忠实反馈学习以进一步提升保真度。结合TextRenderNet和SceneGenNet,我们推出了PosterMaker,一个端到端的生成框架。 为了高效地优化PosterMaker,我们实施了一种两阶段训练策略,将文字渲染与背景生成的学习过程解耦。实验结果表明,PosterMaker显著优于现有的基准模型,这证明了其有效性。
https://arxiv.org/abs/2504.06632
Image forgery detection and localization (IFDL) is of vital importance as forged images can spread misinformation that poses potential threats to our daily lives. However, previous methods still struggled to effectively handle forged images processed with diverse forgery operations in real-world scenarios. In this paper, we propose a novel Reinforced Multi-teacher Knowledge Distillation (Re-MTKD) framework for the IFDL task, structured around an encoder-decoder \textbf{C}onvNeXt-\textbf{U}perNet along with \textbf{E}dge-Aware Module, named Cue-Net. First, three Cue-Net models are separately trained for the three main types of image forgeries, i.e., copy-move, splicing, and inpainting, which then serve as the multi-teacher models to train the target student model with Cue-Net through self-knowledge distillation. A Reinforced Dynamic Teacher Selection (Re-DTS) strategy is developed to dynamically assign weights to the involved teacher models, which facilitates specific knowledge transfer and enables the student model to effectively learn both the common and specific natures of diverse tampering traces. Extensive experiments demonstrate that, compared with other state-of-the-art methods, the proposed method achieves superior performance on several recently emerged datasets comprised of various kinds of image forgeries.
图像伪造检测和定位(IFDL)至关重要,因为伪造的图片可以传播错误信息,对我们的日常生活造成潜在威胁。然而,先前的方法在处理真实场景中经过多种伪造操作处理的伪造图片时仍然存在困难。本文提出了一种新的强化多教师知识蒸馏(Re-MTKD)框架来解决IFDL任务,该框架围绕编码器-解码器**C**onvNeXt-**U**perNet以及**E**dge-Aware Module构建,命名为Cue-Net。 首先,分别训练三个Cue-Net模型以处理三种主要类型的图像伪造:复制粘贴、拼接和填充。这些模型作为多教师模型参与自我知识蒸馏过程来培训目标学生模型。此外,开发了一种强化动态教师选择(Re-DTS)策略,用于根据具体情况为涉及的教师模型分配权重,这有助于特定的知识转移,并使学生模型能够有效地学习各种篡改痕迹中普遍和特定的性质。 大量的实验表明,与现有的其他先进方法相比,所提出的方法在包含多种类型图像伪造的几个最近出现的数据集上取得了优越的性能。
https://arxiv.org/abs/2504.05224
Visible watermark removal which involves watermark cleaning and background content restoration is pivotal to evaluate the resilience of watermarks. Existing deep neural network (DNN)-based models still struggle with large-area watermarks and are overly dependent on the quality of watermark mask prediction. To overcome these challenges, we introduce a novel feature adapting framework that leverages the representation modeling capacity of a pre-trained image inpainting model. Our approach bridges the knowledge gap between image inpainting and watermark removal by fusing information of the residual background content beneath watermarks into the inpainting backbone model. We establish a dual-branch system to capture and embed features from the residual background content, which are merged into intermediate features of the inpainting backbone model via gated feature fusion modules. Moreover, for relieving the dependence on high-quality watermark masks, we introduce a new training paradigm by utilizing coarse watermark masks to guide the inference process. This contributes to a visible image removal model which is insensitive to the quality of watermark mask during testing. Extensive experiments on both a large-scale synthesized dataset and a real-world dataset demonstrate that our approach significantly outperforms existing state-of-the-art methods. The source code is available in the supplementary materials.
可见水印去除技术,包括水印清洗和背景内容恢复,在评估水印的鲁棒性方面至关重要。现有的基于深度神经网络(DNN)的模型在处理大面积水印时仍然面临挑战,并且过度依赖于水印掩码预测的质量。为了解决这些问题,我们引入了一个新颖的功能适应框架,该框架利用了预训练图像修复模型的表示建模能力。我们的方法通过将残余背景内容的信息融入到图像修复主干模型中,填补了图像修复和水印去除之间的知识空白。我们建立了一个双分支系统来捕获并嵌入来自水印下的剩余背景内容的功能,并通过门控特征融合模块将其合并到图像修复主干模型的中间特性中。此外,为了减轻对高质量水印掩码的依赖性,我们引入了一种新的训练范式,利用粗略水印掩码来指导推理过程。这有助于创建一个在测试过程中对水印掩码质量不敏感的可见图片去除模型。广泛的实验结果表明,在大规模合成数据集和真实世界数据集上,我们的方法显著优于现有的最先进的方法。源代码可在补充材料中找到。
https://arxiv.org/abs/2504.04687
Inpainting for real-world human and pedestrian removal in high-resolution video clips presents significant challenges, particularly in achieving high-quality outcomes, ensuring temporal consistency, and managing complex object interactions that involve humans, their belongings, and their shadows. In this paper, we introduce VIP (Video Inpainting Pipeline), a novel promptless video inpainting framework for real-world human removal applications. VIP enhances a state-of-the-art text-to-video model with a motion module and employs a Variational Autoencoder (VAE) for progressive denoising in the latent space. Additionally, we implement an efficient human-and-belongings segmentation for precise mask generation. Sufficient experimental results demonstrate that VIP achieves superior temporal consistency and visual fidelity across diverse real-world scenarios, surpassing state-of-the-art methods on challenging datasets. Our key contributions include the development of the VIP pipeline, a reference frame integration technique, and the Dual-Fusion Latent Segment Refinement method, all of which address the complexities of inpainting in long, high-resolution video sequences.
在高分辨率视频片段中去除真实世界中的行人和人类,以实现高质量的结果、确保时间一致性以及处理涉及人类及其随身物品和阴影的复杂对象交互,面临着重大挑战。本文介绍了VIP(Video Inpainting Pipeline),这是一个新颖的无提示视频修复框架,专门用于现实世界的行人移除应用。VIP通过引入一个运动模块增强了最先进的文本到视频模型,并利用变分自编码器(VAE)在潜在空间中进行渐进式去噪处理。此外,我们还实施了一个高效的分割方法来生成精确的人及其随身物品的掩码。充分的实验结果表明,VIP在各种现实世界场景中实现了卓越的时间一致性和视觉保真度,在具有挑战性的数据集上超越了现有最先进的方法。我们的主要贡献包括开发VIP管道、参考帧集成技术以及双融合潜在段细化方法,这些都解决了长时间高分辨率视频序列中的修复复杂性问题。
https://arxiv.org/abs/2504.03041
Interactive 3D generation is gaining momentum and capturing extensive attention for its potential to create immersive virtual experiences. However, a critical challenge in current 3D generation technologies lies in achieving real-time interactivity. To address this issue, we introduce WonderTurbo, the first real-time interactive 3D scene generation framework capable of generating novel perspectives of 3D scenes within 0.72 seconds. Specifically, WonderTurbo accelerates both geometric and appearance modeling in 3D scene generation. In terms of geometry, we propose StepSplat, an innovative method that constructs efficient 3D geometric representations through dynamic updates, each taking only 0.26 seconds. Additionally, we design QuickDepth, a lightweight depth completion module that provides consistent depth input for StepSplat, further enhancing geometric accuracy. For appearance modeling, we develop FastPaint, a 2-steps diffusion model tailored for instant inpainting, which focuses on maintaining spatial appearance consistency. Experimental results demonstrate that WonderTurbo achieves a remarkable 15X speedup compared to baseline methods, while preserving excellent spatial consistency and delivering high-quality output.
交互式3D生成技术正逐渐兴起,并因其创造沉浸式虚拟体验的潜力而备受关注。然而,当前3D生成技术的一个关键挑战在于实现实时互动性。为了解决这一问题,我们推出了WonderTurbo——首个能够实现实时交互式3D场景生成框架,它能够在0.72秒内生成新的3D场景视角。具体而言,WonderTurbo通过加速几何和外观建模来提高3D场景生成的速度。 在几何方面,我们提出了StepSplat方法,这是一种创新的动态更新技术,能够构建高效的3D几何表示,并且每次仅需花费0.26秒。此外,我们设计了QuickDepth轻量级深度完成模块,为StepSplat提供了一致性的深度输入,从而进一步提高几何精度。 在外观建模方面,我们开发了FastPaint——一个专为即时修复绘制而定制的两步扩散模型,专注于保持空间外观的一致性。实验结果表明,与基准方法相比,WonderTurbo实现了15倍的速度提升,并且能够维持出色的空间一致性并输出高质量的结果。
https://arxiv.org/abs/2504.02261
Image inpainting aims to fill the missing region of an image. Recently, there has been a surge of interest in foreground-conditioned background inpainting, a sub-task that fills the background of an image while the foreground subject and associated text prompt are provided. Existing background inpainting methods typically strictly preserve the subject's original position from the source image, resulting in inconsistencies between the subject and the generated background. To address this challenge, we propose a new task, the "Text-Guided Subject-Position Variable Background Inpainting", which aims to dynamically adjust the subject position to achieve a harmonious relationship between the subject and the inpainted background, and propose the Adaptive Transformation Agent (A$^\text{T}$A) for this task. Firstly, we design a PosAgent Block that adaptively predicts an appropriate displacement based on given features to achieve variable subject-position. Secondly, we design the Reverse Displacement Transform (RDT) module, which arranges multiple PosAgent blocks in a reverse structure, to transform hierarchical feature maps from deep to shallow based on semantic information. Thirdly, we equip A$^\text{T}$A with a Position Switch Embedding to control whether the subject's position in the generated image is adaptively predicted or fixed. Extensive comparative experiments validate the effectiveness of our A$^\text{T}$A approach, which not only demonstrates superior inpainting capabilities in subject-position variable inpainting, but also ensures good performance on subject-position fixed inpainting.
图像修复技术旨在填补图像中的缺失区域。近期,人们对前景条件下的背景修复(即在给定前景对象和相关文本提示的情况下填充图像的背景)产生了浓厚的兴趣。现有的背景修复方法通常严格保持主体在源图中的原始位置不变,这会导致修复后的背景与主体之间产生不协调的问题。为解决这一挑战,我们提出了一项新任务:“基于文本指导的位置可变背景修复”,旨在动态调整对象位置,以实现生成的背景与对象之间的和谐关系,并为此任务提出了自适应变换代理(Adaptive Transformation Agent, A$^\text{T}$A)。具体来说: 1. 我们设计了一个PosAgent模块,该模块根据给定的功能特性自适应地预测一个合适的位移量来实现位置可变性。 2. 我们还设计了逆向位移变换(Reverse Displacement Transform, RDT)模块。这个模块采用了多个PosAgent块的反向结构,能够基于语义信息将深层次特征图转化为浅层特征图。 3. 为了控制生成图像中对象的位置是自适应预测还是固定不变,我们在A$^\text{T}$A模型中添加了位置切换嵌入(Position Switch Embedding)。 通过广泛的对比实验验证了我们提出的A$^\text{T}$A方法的有效性。我们的方法不仅在处理位置可变修复时表现出卓越的填充能力,而且还在确保对象位置固定不变的情况下保持良好的性能。
https://arxiv.org/abs/2504.01603
Nuclear instance segmentation plays a vital role in disease diagnosis within digital pathology. However, limited labeled data in pathological images restricts the overall performance of nuclear instance segmentation. To tackle this challenge, we propose a novel data augmentation framework Instance Migration Diffusion Model (IM-Diffusion), IM-Diffusion designed to generate more varied pathological images by constructing diverse nuclear layouts and internuclear spatial relationships. In detail, we introduce a Nuclear Migration Module (NMM) which constructs diverse nuclear layouts by simulating the process of nuclear migration. Building on this, we further present an Internuclear-regions Inpainting Module (IIM) to generate diverse internuclear spatial relationships by structure-aware inpainting. On the basis of the above, IM-Diffusion generates more diverse pathological images with different layouts and internuclear spatial relationships, thereby facilitating downstream tasks. Evaluation on the CoNSeP and GLySAC datasets demonstrate that the images generated by IM-Diffusion effectively enhance overall instance segmentation performance. Code will be made public later.
核实例分割在数字病理学中的疾病诊断中扮演着至关重要的角色。然而,由于病理图像中的标注数据有限,这限制了核实例分割的整体性能。为了解决这一挑战,我们提出了一种新颖的数据增强框架——实例迁移扩散模型(IM-Diffusion)。IM-Diffusion 旨在通过构建多样化的核布局和细胞间空间关系来生成更多样化的病理图像。 具体来说,我们引入了一个名为核迁移模块(NMM)的组件,该模块通过模拟核迁移的过程来构造多样的核布局。在此基础上,进一步提出了一种细胞间区域修复模块(IIM),用于通过结构感知的修补技术生成多样化的细胞间空间关系。基于上述设计,IM-Diffusion 能够生成具有不同布局和细胞间空间关系的多样化病理图像,从而为下游任务提供支持。 在 CoNSeP 和 GLySAC 数据集上的评估表明,由 IM-Diffusion 生成的图像有效提升了整体实例分割性能。代码将在后续公开发布。
https://arxiv.org/abs/2504.01577
We present Pro-DG, a framework for procedurally controllable photo-realistic facade generation that combines a procedural shape grammar with diffusion-based image synthesis. Starting from a single input image, we reconstruct its facade layout using grammar rules, then edit that structure through user-defined transformations. As facades are inherently multi-hierarchical structures, we introduce hierarchical matching procedure that aligns facade structures at different levels which is used to introduce control maps to guide a generative diffusion pipeline. This approach retains local appearance fidelity while accommodating large-scale edits such as floor duplication or window rearrangement. We provide a thorough evaluation, comparing Pro-DG against inpainting-based baselines and synthetic ground truths. Our user study and quantitative measurements indicate improved preservation of architectural identity and higher edit accuracy. Our novel method is the first to integrate neuro-symbolically derived shape-grammars for modeling with modern generative model and highlights the broader potential of such approaches for precise and controllable image manipulation.
我们介绍了一种名为Pro-DG的框架,该框架用于生成程序可控的逼真建筑立面,它结合了过程形状语法和基于扩散的图像合成技术。从单张输入图片开始,我们可以使用语法规则重建其立面布局,并通过用户定义的转换编辑这一结构。鉴于立面本质上是多层次结构,我们引入了一种层次匹配过程,该过程对不同层级上的立面结构进行对齐,以用于引导生成扩散管道的控制图。这种方法在保持局部外观真实性的前提下,还能容纳大规模编辑,例如楼层复制或窗户重新排列。 我们进行了全面的评估,并将Pro-DG与基于图像修复的方法和合成的真实数据进行了比较。我们的用户研究和定量测量表明,在保留建筑身份的同时提高了编辑准确性。我们的新方法是首个集成神经符号派生形状语法用于现代生成模型建模的技术,突显了此类方法在精确且可控的图像操作方面的广泛应用潜力。
https://arxiv.org/abs/2504.01571
This paper introduces TurboFill, a fast image inpainting model that enhances a few-step text-to-image diffusion model with an inpainting adapter for high-quality and efficient inpainting. While standard diffusion models generate high-quality results, they incur high computational costs. We overcome this by training an inpainting adapter on a few-step distilled text-to-image model, DMD2, using a novel 3-step adversarial training scheme to ensure realistic, structurally consistent, and visually harmonious inpainted regions. To evaluate TurboFill, we propose two benchmarks: DilationBench, which tests performance across mask sizes, and HumanBench, based on human feedback for complex prompts. Experiments show that TurboFill outperforms both multi-step BrushNet and few-step inpainting methods, setting a new benchmark for high-performance inpainting tasks. Our project page: this https URL
这篇论文介绍了TurboFill,这是一种快速的图像修复模型,通过增强几个步骤的文本到图像扩散模型,并加入一个图像修复适配器来实现高质量和高效的图像修补。虽然标准的扩散模型可以生成高质量的结果,但它们会带来高昂的计算成本。我们通过在几步蒸馏后的文本到图像模型DMD2上训练一个图像修复适配器来克服这一问题,并使用一种新颖的三步骤对抗性训练方案,以确保修复区域的真实感、结构一致性和视觉和谐。 为了评估TurboFill的效果,我们提出了两个基准测试:DilationBench用于检测不同遮罩大小下的性能表现;HumanBench则是基于复杂指令的人类反馈。实验结果表明,TurboFill在性能上超过了多步骤的BrushNet和少步骤图像修复方法,在高质量图像修补任务中设立了新的标准。 我们的项目页面链接:[this https URL](https://example.com) (注意:实际使用时请替换为正确的网址)
https://arxiv.org/abs/2504.00996
Deep neural networks for aerial image segmentation require large amounts of labeled data, but high-quality aerial datasets with precise annotations are scarce and costly to produce. To address this limitation, we propose a self-supervised pretraining method that improves segmentation performance while reducing reliance on labeled data. Our approach uses inpainting-based pretraining, where the model learns to reconstruct missing regions in aerial images, capturing their inherent structure before being fine-tuned for road extraction. This method improves generalization, enhances robustness to domain shifts, and is invariant to model architecture and dataset choice. Experiments show that our pretraining significantly boosts segmentation accuracy, especially in low-data regimes, making it a scalable solution for aerial image analysis.
针对高空图像分割的深度神经网络需要大量的标注数据,但高质量的带有精确注释的高空图像数据集稀缺且成本高昂。为了解决这一限制,我们提出了一种自监督预训练方法,该方法可以在减少对标注数据依赖的同时提升分割性能。我们的方法采用基于图像修复(inpainting)的预训练方式,其中模型学习如何重建高空图像中的缺失区域,在捕捉其内在结构后进行道路提取的微调。这种方法提升了泛化能力,增强了领域偏移时的鲁棒性,并且对模型架构和数据集的选择具有不变性。实验结果表明,我们的预训练方法显著提高了分割精度,尤其是在低标注数据的情况下表现尤为出色,使其成为高空图像分析中的可扩展解决方案。
https://arxiv.org/abs/2503.24326