The analysis and use of egocentric videos for robotic tasks is made challenging by occlusion due to the hand and the visual mismatch between the human hand and a robot end-effector. In this sense, the human hand presents a nuisance. However, often hands also provide a valuable signal, e.g. the hand pose may suggest what kind of object is being held. In this work, we propose to extract a factored representation of the scene that separates the agent (human hand) and the environment. This alleviates both occlusion and mismatch while preserving the signal, thereby easing the design of models for downstream robotics tasks. At the heart of this factorization is our proposed Video Inpainting via Diffusion Model (VIDM) that leverages both a prior on real-world images (through a large-scale pre-trained diffusion model) and the appearance of the object in earlier frames of the video (through attention). Our experiments demonstrate the effectiveness of VIDM at improving inpainting quality on egocentric videos and the power of our factored representation for numerous tasks: object detection, 3D reconstruction of manipulated objects, and learning of reward functions, policies, and affordances from videos.
对人类手作为机器人任务中的媒介进行分析和使用个人视角的视频是困难的,这因为手和人类手与机器人末端执行器之间的视觉不匹配而 occlusion。从这个意义上说,人类手是一个麻烦。然而,通常手也提供有价值的信号,例如手的姿势可能暗示着正在握着什么物体。在这项工作中,我们提议提取一个Factored Representation,将agent(人类手)和环境分开。这可以减轻 occlusion 和不匹配,同时保留信号,从而简化后续机器人任务中模型的设计。在这个观点的中心是我们所提出的视频扩散模型(VIDM),它利用现实世界图像的先验知识和视频早期帧中物体的外观(通过注意力)。我们的实验证明了 VIDM 在改善个人视角视频涂色质量和我们Factored Representation 对于许多任务的有效性:物体检测、3D重建操纵物体、从视频中学习奖励函数、政策和可用性。
https://arxiv.org/abs/2305.16301
Audio inpainting aims to reconstruct missing segments in corrupted recordings. Previous methods produce plausible reconstructions when the gap length is shorter than about 100\;ms, but the quality decreases for longer gaps. This paper explores recent advancements in deep learning and, particularly, diffusion models, for the task of audio inpainting. The proposed method uses an unconditionally trained generative model, which can be conditioned in a zero-shot fashion for audio inpainting, offering high flexibility to regenerate gaps of arbitrary length. An improved deep neural network architecture based on the constant-Q transform, which allows the model to exploit pitch-equivariant symmetries in audio, is also presented. The performance of the proposed algorithm is evaluated through objective and subjective metrics for the task of reconstructing short to mid-sized gaps. The results of a formal listening test show that the proposed method delivers a comparable performance against state-of-the-art for short gaps, while retaining a good audio quality and outperforming the baselines for the longest gap lengths tested, 150\;ms and 200\;ms. This work helps improve the restoration of sound recordings having fairly long local disturbances or dropouts, which must be reconstructed.
音频填补旨在修复损坏录音中的缺失部分。先前的方法能够在短于约100\;ms的间隙长度下产生合理的重构,但质量随着间隙长度的增加而降低。本文探讨了深度学习和特别是扩散模型在音频填补任务中的最新进展。 proposed 方法使用了一种无条件训练的生成模型,可以在音频填补任务中以零次条件状态进行训练,提供高灵活性,以生成任意长度的间隙。还提出了基于常质量变换改进的深度神经网络架构,该架构使模型可以利用音频的音高同义对称性。采用该架构的算法的性能通过客观和主观指标进行评估,以重构短小到中等大小的间隙。正式听辨测试的结果表明,该方法在短间隙长度方面提供了与当前最先进的方法相当的性能,同时保持了良好的音频质量和在测试的最长间隙长度上超越了基准值。这项工作有助于改善需要重构的较长 local 干扰或缺失的音频录音的恢复。
https://arxiv.org/abs/2305.15266
Neural Radiance Fields (NeRF) can generate highly realistic novel views. However, editing 3D scenes represented by NeRF across 360-degree views, particularly removing objects while preserving geometric and photometric consistency, remains a challenging problem due to NeRF's implicit scene representation. In this paper, we propose InpaintNeRF360, a unified framework that utilizes natural language instructions as guidance for inpainting NeRF-based 3D scenes.Our approach employs a promptable segmentation model by generating multi-modal prompts from the encoded text for multiview segmentation. We apply depth-space warping to enforce viewing consistency in the segmentations, and further refine the inpainted NeRF model using perceptual priors to ensure visual plausibility. InpaintNeRF360 is capable of simultaneously removing multiple objects or modifying object appearance based on text instructions while synthesizing 3D viewing-consistent and photo-realistic inpainting. Through extensive experiments on both unbounded and frontal-facing scenes trained through NeRF, we demonstrate the effectiveness of our approach and showcase its potential to enhance the editability of implicit radiance fields.
神经网络辐射场(NeRF)可以生成高度真实的新视角。然而,编辑由NeRF代表的三维场景 across 360-度视图,特别是同时保留几何和色彩一致性,但由于NeRF的隐含场景表示是一个挑战性的问题。在本文中,我们提出了InpaintNeRF360,一个统一框架,利用自然语言指令为基于NeRF的三维场景进行填充。我们采用一种可prompt的分割模型,从编码文本中提取多模态提示,用于多视角分割。我们应用深度空间扭曲来强制分割视角一致性,并使用感觉先验来进一步 refine填充的NeRF模型,以确保视觉可行性。InpaintNeRF360能够同时删除多个对象或修改对象外观,基于文本指令,同时合成360度视角一致和 photo-realistic的填充。通过训练通过NeRF训练的无边界和正面场景,我们证明了我们的方法和其有效性,并展示了增强隐含辐射场编辑能力的潜力。
https://arxiv.org/abs/2305.15094
Image alignment and image restoration are classical computer vision tasks. However, there is still a lack of datasets that provide enough data to train and evaluate end-to-end deep learning models. Obtaining ground-truth data for image alignment requires sophisticated structure-from-motion methods or optical flow systems that often do not provide enough data variance, i.e., typically providing a high number of image correspondences, while only introducing few changes of scenery within the underlying image sequences. Alternative approaches utilize random perspective distortions on existing image data. However, this only provides trivial distortions, lacking the complexity and variance of real-world scenarios. Instead, our proposed data augmentation helps to overcome the issue of data scarcity by using 3D rendering: images are added as textures onto a plane, then varying lighting conditions, shadows, and occlusions are added to the scene. The scene is rendered from multiple viewpoints, generating perspective distortions more consistent with real-world scenarios, with homographies closely resembling those of camera projections rather than randomized homographies. For each scene, we provide a sequence of distorted images with corresponding occlusion masks, homographies, and ground-truth labels. The resulting dataset can serve as a training and evaluation set for a multitude of tasks involving image alignment and artifact removal, such as deep homography estimation, dense image matching, 2D bundle adjustment, inpainting, shadow removal, denoising, content retrieval, and background subtraction. Our data generation pipeline is customizable and can be applied to any existing dataset, serving as a data augmentation to further improve the feature learning of any existing method.
图像对齐和图像恢复是经典的计算机视觉任务。然而,仍然缺乏能够提供足够数据来训练和评估全端深度学习模型的数据集。获取图像对齐的基准数据需要复杂的结构-from-Motion方法或光学流系统,但它们往往提供的数据变异较少,即在 underlying 图像序列中只引入少数场景变化。替代的方法利用现有的图像数据中的随机视角扭曲。但是,这只是简单的扭曲,缺乏现实世界场景的复杂性和变异性。相反,我们提出的数据增强方法有助于通过使用3D渲染克服数据匮乏的问题:图像被添加到平面上,然后改变光照条件、阴影和遮挡物到场景上。场景从多个视角渲染,产生与现实世界场景更为一致的视角扭曲,而近邻关系则 closely类似于相机投影的近邻关系,而不是随机生成的近邻关系。对于每个场景,我们提供一组扭曲图像及其对应的遮挡掩模、近邻关系和基准标签。因此,产生的数据集可以作为涉及图像对齐和消除误差的任务的训练和评估集,例如深度近邻关系估计、密集图像匹配、2D集成调整、填充、删除阴影、去噪、内容检索和背景移除。我们的数据生成流程可定制,可以应用于任何现有的数据集,作为数据增强来提高任何现有方法的特征学习。
https://arxiv.org/abs/2305.12036
Text-driven 3D scene generation is widely applicable to video gaming, film industry, and metaverse applications that have a large demand for 3D scenes. However, existing text-to-3D generation methods are limited to producing 3D objects with simple geometries and dreamlike styles that lack realism. In this work, we present Text2NeRF, which is able to generate a wide range of 3D scenes with complicated geometric structures and high-fidelity textures purely from a text prompt. To this end, we adopt NeRF as the 3D representation and leverage a pre-trained text-to-image diffusion model to constrain the 3D reconstruction of the NeRF to reflect the scene description. Specifically, we employ the diffusion model to infer the text-related image as the content prior and use a monocular depth estimation method to offer the geometric prior. Both content and geometric priors are utilized to update the NeRF model. To guarantee textured and geometric consistency between different views, we introduce a progressive scene inpainting and updating strategy for novel view synthesis of the scene. Our method requires no additional training data but only a natural language description of the scene as the input. Extensive experiments demonstrate that our Text2NeRF outperforms existing methods in producing photo-realistic, multi-view consistent, and diverse 3D scenes from a variety of natural language prompts.
文本驱动的三维场景生成可以适用于游戏、电影产业和虚拟现实应用等领域,这些领域对三维场景有很高的需求。然而,现有的文本到三维生成方法却只能生成简单的几何形状和缺乏真实感的梦幻风格三维物体。在本文中,我们提出了Text2NeRF,它能够从文本提示中生成各种复杂的几何结构和高保真的纹理,完全基于文本提示。为此,我们采用了NeRF作为三维表示,并利用预先训练的文本到图像扩散模型来约束NeRF的三维重构,以反映场景描述。具体来说,我们使用扩散模型来推断与文本相关的图像作为内容先验,并使用单目深度估计方法提供几何先验。内容先验和几何先验都用于更新NeRF模型。为了确保不同视角下纹理和几何一致性,我们引入了渐进的场景覆盖和更新策略,用于场景新视角合成。我们的方法和不需要额外的训练数据,只需要场景的自然语言描述作为输入。广泛的实验结果表明,我们的Text2NeRF在从各种自然语言提示生成逼真、多视角一致性和多样性的三维场景方面比现有方法表现更好。
https://arxiv.org/abs/2305.11588
Recent advancements in Text-to-Image (T2I) generative models have yielded impressive results in generating high-fidelity images based on consistent text prompts. However, there is a growing interest in exploring the potential of these models for more diverse reference-based image manipulation tasks that require spatial understanding and visual context. Previous approaches have achieved this by incorporating additional control modules or fine-tuning the generative models specifically for each task until convergence. In this paper, we propose a different perspective. We conjecture that current large-scale T2I generative models already possess the capability to perform these tasks but are not fully activated within the standard generation process. To unlock these capabilities, we introduce a unified Prompt-Guided In-Context inpainting (PGIC) framework, which leverages large-scale T2I models to re-formulate and solve reference-guided image manipulations. In the PGIC framework, the reference and masked target are stitched together as a new input for the generative models, enabling the filling of masked regions as producing final results. Furthermore, we demonstrate that the self-attention modules in T2I models are well-suited for establishing spatial correlations and efficiently addressing challenging reference-guided manipulations. These large T2I models can be effectively driven by task-specific prompts with minimal training cost or even with frozen backbones. We synthetically evaluate the effectiveness of the proposed PGIC framework across various tasks, including reference-guided image inpainting, faithful inpainting, outpainting, local super-resolution, and novel view synthesis. Our results show that PGIC achieves significantly better performance while requiring less computation compared to other fine-tuning based approaches.
近年来,文本到图像生成模型的发展取得了重大进展,基于一致的文本提示生成高保真的图像,取得了令人印象深刻的结果。然而,越来越有兴致探索这些模型的潜力,用于需要空间理解和视觉上下文更多样化的参考图像操纵任务。先前的方法曾通过添加额外的控制模块或针对每个任务专门优化生成模型来达到这一点,但本文提出了不同的视角。我们猜测,当前大规模的T2I生成模型已经具备执行这些任务的能力,但在标准生成过程中并未完全激活。为了解锁这些能力,我们介绍了一个统一的启示引导上下文填充框架(PGIC),该框架利用大规模的T2I模型重新构建和解决参考引导的图像操纵任务。在PGIC框架中,参考和掩膜目标被拼接在一起,作为生成模型的新输入,使掩膜区域能够作为最终结果填充。此外,我们证明,T2I模型中的自注意力模块非常适合建立空间关系,高效处理挑战性的参考引导操纵任务。这些大型T2I模型可以通过少量的培训成本或甚至冻结主干来有效地驱动。我们合成评估了所提出的PGIC框架在不同任务中的 effectiveness,包括参考引导图像填补、忠实填补、去除填补、局部超分辨率和新视野合成。我们的结果显示,PGIC在与其他优化方法相比计算量更少的情况下取得了更好的表现。
https://arxiv.org/abs/2305.11577
Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce \textbf{TextDiffuser}, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, \textbf{MARIO-10M}, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the \textbf{MARIO-Eval} benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. The code, model, and dataset will be available at \url{this https URL}.
扩散模型因其出色的生成能力而日益受到关注,但目前它们面临着与准确和连贯的文本渲染有关的困难。为了解决这个问题,我们引入了 \textbf{TextDiffuser},专注于生成视觉效果良好的文本与背景协调的图片。TextDiffuser分为两个阶段:首先,一个Transformer模型从文本提示中提取关键词的布局,然后扩散模型根据文本提示和生成的布局生成图像。此外,我们贡献了一个包含10百万张文本图像和OCR标注的大规模图像文本数据集 \textbf{MARIO-10M},其中包含文本识别、检测和字符级别的分割标注。我们还收集了 \textbf{MARIO-Eval} 基准数据集,作为评估文本渲染质量的全面工具。通过实验和用户研究,我们表明TextDiffuser是灵活且可控制地使用文本提示或与文本模板图像一起使用来生成高质量的文本图像,并进行文本填充以恢复不完整的图像。代码、模型和数据集将位于 \url{this https URL}。
https://arxiv.org/abs/2305.10855
We present a software that predicts non-cleft facial images for patients with cleft lip, thereby facilitating the understanding, awareness and discussion of cleft lip surgeries. To protect patients privacy, we design a software framework using image inpainting, which does not require cleft lip images for training, thereby mitigating the risk of model leakage. We implement a novel multi-task architecture that predicts both the non-cleft facial image and facial landmarks, resulting in better performance as evaluated by surgeons. The software is implemented with PyTorch and is usable with consumer-level color images with a fast prediction speed, enabling effective deployment.
我们呈现了一种软件,用于预测患者唇裂术后的无唇裂面部图像,从而方便理解、认识和讨论唇裂术后治疗。为了保护患者隐私,我们使用图像修复技术设计了一个软件框架,该框架不需要使用唇裂图像进行训练,从而避免了模型泄漏的风险。我们实现了一种 novel 多任务架构,可以同时预测无唇裂面部图像和面部地标,结果由医生评估后表现出更好的性能。软件使用PyTorch实现,可以与消费级彩色图像以快速预测速度使用,从而实现有效的部署。
https://arxiv.org/abs/2305.10589
The emergence of Neural Radiance Fields (NeRF) for novel view synthesis has led to increased interest in 3D scene editing. One important task in editing is removing objects from a scene while ensuring visual reasonability and multiview consistency. However, current methods face challenges such as time-consuming object labelling, limited capability to remove specific targets, and compromised rendering quality after removal. This paper proposes a novel object-removing pipeline, named OR-NeRF, that can remove objects from 3D scenes with either point or text prompts on a single view, achieving better performance in less time than previous works. Our method uses a points projection strategy to rapidly spread user annotations to all views, significantly reducing the processing burden. This algorithm allows us to leverage the recent 2D segmentation model Segment-Anything (SAM) to predict masks with improved precision and efficiency. Additionally, we obtain colour and depth priors through 2D inpainting methods. Finally, our algorithm employs depth supervision and perceptual loss for scene reconstruction to maintain consistency in geometry and appearance after object removal. Experimental results demonstrate that our method achieves better editing quality with less time than previous works, considering both quality and quantity.
神经网络辐射场(NeRF)的出现,用于生成独特的视角,引起了对3D场景编辑的更多兴趣。在编辑中,一个重要的任务是从一个场景中删除对象,同时确保视觉合理性和多视角一致性。然而,当前的方法面临着一些挑战,例如繁琐的对象标记,有限的删除特定目标的能力和在删除后影响渲染质量。本文提出了一种名为OR-NeRF的新对象删除管道,可以在一个视图上使用点或文本 prompts 删除3D场景中的对象,比以前的工作更快地实现更好的性能。我们的方法使用点投影策略迅速向所有视图传播用户注释,显著减少了处理负担。该算法利用最近开发的2D分割模型Segment- Anything(SAM)提高预测面具的精度和效率。此外,通过2D涂鸦方法获取颜色和深度先验。最后,我们的方法使用深度监督和感知损失用于场景重构,以保持几何和外观在对象删除后一致性。实验结果表明,我们的方法比以前的工作更快地实现更好的编辑质量,考虑质量和数量。
https://arxiv.org/abs/2305.10503
Uncertainty quantification for inverse problems in imaging has drawn much attention lately. Existing approaches towards this task define uncertainty regions based on probable values per pixel, while ignoring spatial correlations within the image, resulting in an exaggerated volume of uncertainty. In this paper, we propose PUQ (Principal Uncertainty Quantification) -- a novel definition and corresponding analysis of uncertainty regions that takes into account spatial relationships within the image, thus providing reduced volume regions. Using recent advancements in stochastic generative models, we derive uncertainty intervals around principal components of the empirical posterior distribution, forming an ambiguity region that guarantees the inclusion of true unseen values with a user confidence probability. To improve computational efficiency and interpretability, we also guarantee the recovery of true unseen values using only a few principal directions, resulting in ultimately more informative uncertainty regions. Our approach is verified through experiments on image colorization, super-resolution, and inpainting; its effectiveness is shown through comparison to baseline methods, demonstrating significantly tighter uncertainty regions.
图像反转问题的不确定性量化最近吸引了很多关注。 existing approaches to this task define uncertainty regions based on probabilistic values per pixel, while neglecting spatial correlation within the image, resulting in an overexaggerated volume of uncertainty. In this paper, we propose PUQ (Principal Uncertainty Quantification) - a novel definition and corresponding analysis of uncertainty regions that takes into account spatial relationships within the image, thus providing reduced volume regions. 利用最新的随机生成模型的进步,我们推导了 empirical posterior distribution 中的主要成分的不确定性区间,形成了一种歧义区域,保证了用户有信心情况下包含真正的未观测值。为了改善计算效率和可解释性,我们还保证使用仅几个主要方向仅几步计算就可以恢复真正的未观测值,最终生成更 informative 的不确定性区域。我们的方法通过图像色彩化、超分辨率和填充实验进行了验证;它的有效性通过与基准方法的比较展示了,证明了更加紧密的不确定性区域。
https://arxiv.org/abs/2305.10124
Plug-and-play Image Restoration (IR) has been widely recognized as a flexible and interpretable method for solving various inverse problems by utilizing any off-the-shelf denoiser as the implicit image prior. However, most existing methods focus on discriminative Gaussian denoisers. Although diffusion models have shown impressive performance for high-quality image synthesis, their potential to serve as a generative denoiser prior to the plug-and-play IR methods remains to be further explored. While several other attempts have been made to adopt diffusion models for image restoration, they either fail to achieve satisfactory results or typically require an unacceptable number of Neural Function Evaluations (NFEs) during inference. This paper proposes DiffPIR, which integrates the traditional plug-and-play method into the diffusion sampling framework. Compared to plug-and-play IR methods that rely on discriminative Gaussian denoisers, DiffPIR is expected to inherit the generative ability of diffusion models. Experimental results on three representative IR tasks, including super-resolution, image deblurring, and inpainting, demonstrate that DiffPIR achieves state-of-the-art performance on both the FFHQ and ImageNet datasets in terms of reconstruction faithfulness and perceptual quality with no more than 100 NFEs. The source code is available at {\url{this https URL}}
Plug-and-play image restoration (IR) 被广泛认可为一种灵活且可解释的方法,可用于解决各种逆问题,通过利用任何可用的噪声去除器作为隐含图像前置。然而,大多数现有方法都集中在具体高分辨率图像合成噪声去除上。尽管扩散模型在高质量图像合成方面表现出令人印象深刻的性能,但它们在Plug-and-play IR方法之前作为生成噪声去除器的潜力仍待进一步探索。虽然已有几种其他尝试采用扩散模型进行图像恢复,但它们要么未能达到预期的结果,要么通常在推理期间需要不可接受的神经网络函数评估(NFEs)数量。本文提出了DiffPIR,它将传统的Plug-and-play方法融入扩散采样框架中。与依赖于具体高分辨率图像合成噪声去除的Plug-and-play IR方法相比,DiffPIR期望继承扩散模型的生成能力。对三个代表性IR任务的实验结果,包括超分辨率、图像去噪和填充,表明DiffPIR在FFHQ和ImageNet数据集上实现高性能,仅需要不超过100 NFEs。源代码可在{\url{this https URL}}获取。
https://arxiv.org/abs/2305.08995
A myriad of algorithms for the automatic analysis of brain MR images is available to support clinicians in their decision-making. For brain tumor patients, the image acquisition time series typically starts with a scan that is already pathological. This poses problems, as many algorithms are designed to analyze healthy brains and provide no guarantees for images featuring lesions. Examples include but are not limited to algorithms for brain anatomy parcellation, tissue segmentation, and brain extraction. To solve this dilemma, we introduce the BraTS 2023 inpainting challenge. Here, the participants' task is to explore inpainting techniques to synthesize healthy brain scans from lesioned ones. The following manuscript contains the task formulation, dataset, and submission procedure. Later it will be updated to summarize the findings of the challenge. The challenge is organized as part of the BraTS 2023 challenge hosted at the MICCAI 2023 conference in Vancouver, Canada.
许多用于自动分析大脑MRI图像的算法可供临床医生支持他们的决策。对于脑瘤患者,图像获取时间序列通常始于已经病理的扫描。这提出了问题,因为许多算法都是设计用于分析健康的大脑的,对于显示损伤的图像提供不了保证。例如,包括脑组织分割、组织分化和脑提取算法。为了解决这个困境,我们介绍了 BraTS 2023 填充挑战。在这里,参与者的任务是探索填充技术,从损伤的脑扫描中合成健康的脑扫描。以下手稿包含了任务 formulation、数据集和提交程序。稍后将更新以概括挑战的结果。挑战是 BraTS 2023 挑战在加拿大温哥华 MICCAI 2023 会议上举办的的一部分。
https://arxiv.org/abs/2305.08992
Recent approaches to the tensor completion problem have often overlooked the nonnegative structure of the data. We consider the problem of learning a nonnegative low-rank tensor, and using duality theory, we propose a novel factorization of such tensors. The factorization decouples the nonnegative constraints from the low-rank constraints. The resulting problem is an optimization problem on manifolds, and we propose a variant of Riemannian conjugate gradients to solve it. We test the proposed algorithm across various tasks such as colour image inpainting, video completion, and hyperspectral image completion. Experimental results show that the proposed method outperforms many state-of-the-art tensor completion algorithms.
最近的Tensor completion问题的解决方案常常忽略了数据的非负结构。我们考虑学习非负低秩 Tensor 的问题,并使用双极理论提出了一种新的 Tensor Factorization。该Factorization将非负约束与低秩约束分离。结果是一个多态优化问题,我们提出了黎曼反交换梯度的一种变体来解决这个问题。我们测试了该算法在各种任务,如彩色图像修复、视频填充和高光谱图像填充。实验结果表明,该方法在许多最先进的 Tensor completion 算法中表现优异。
https://arxiv.org/abs/2305.07976
The objective of the image inpainting task is to fill missing regions of an image in a visually plausible way. Recently, deep-learning-based image inpainting networks have generated outstanding results, and some utilize their models as object removers by masking unwanted objects in an image. However, while trying to better remove objects using their networks, the previous works pay less attention to the importance of the input mask. In this paper, we focus on generating the input mask to better remove objects using the off-the-shelf image inpainting network. We propose an automatic mask generator inspired by the explainable AI (XAI) method, whose output can better remove objects than a semantic segmentation mask. The proposed method generates an importance map using randomly sampled input masks and quantitatively estimated scores of the completed images obtained from the random masks. The output mask is selected by a judge module among the candidate masks which are generated from the importance map. We design the judge module to quantitatively estimate the quality of the object removal results. In addition, we empirically find that the evaluation methods used in the previous works reporting object removal results are not appropriate for estimating the performance of an object remover. Therefore, we propose new evaluation metrics (FID$^*$ and U-IDS$^*$) to properly evaluate the quality of object removers. Experiments confirm that our method shows better performance in removing target class objects than the masks generated from the semantic segmentation maps, and the two proposed metrics make judgments consistent with humans.
图像填充任务的目标是在视觉效果上合理地填充图像中的空缺区域。近年来,基于深度学习的图像填充网络取得了卓越的结果,有些网络将其模型用作对象删除器,在图像中遮盖不希望出现的对象。然而,在使用网络进行对象删除时,以前的工作往往不太关注输入掩码的重要性。在本文中,我们重点讨论如何生成常用的图像填充网络输入掩码,以更好地使用它们进行对象删除。我们提出了一种基于可解释AI(XAI)方法的自动掩码生成器,其输出可以更好地删除对象,比语义分割掩码更有效。该方法使用随机采样输入掩码生成一个重要性地图,并 quantitatively 估算从随机掩码中生成的完整图像的得分。生成的输出掩码由评判模块选择。我们设计了评判模块以 quantitatively 估算对象删除结果的质量。此外,我们经验证,以前报告的对象删除结果所采用的评价方法不适合估计对象删除器的性能。因此,我们提出了新的评价指标(FID$^*$和U-IDS$^*$),以正确评估对象删除器的质量。实验证实,我们的方法在删除目标类对象方面表现出更好的性能,比从语义分割映射中生成的掩码更有效。我们提出的两个评价指标也以人类一致的方式做出了判断。
https://arxiv.org/abs/2305.07857
Benefiting from powerful convolutional neural networks (CNNs), learning-based image inpainting methods have made significant breakthroughs over the years. However, some nature of CNNs (e.g. local prior, spatially shared parameters) limit the performance in the face of broken images with diverse and complex forms. Recently, a class of attention-based network architectures, called transformer, has shown significant performance on natural language processing fields and high-level vision tasks. Compared with CNNs, attention operators are better at long-range modeling and have dynamic weights, but their computational complexity is quadratic in spatial resolution, and thus less suitable for applications involving higher resolution images, such as image inpainting. In this paper, we design a novel attention linearly related to the resolution according to Taylor expansion. And based on this attention, a network called $T$-former is designed for image inpainting. Experiments on several benchmark datasets demonstrate that our proposed method achieves state-of-the-art accuracy while maintaining a relatively low number of parameters and computational complexity. The code can be found at \href{this https URL}{this http URL\_image\_inpainting}
得益于强大的卷积神经网络(CNNs),基于学习的图像处理方法在过去取得了重大突破。然而,CNNs的一些特性(例如局部先验、空间共享参数)在面对各种复杂和多样化的破碎图像时限制了性能。最近,一种名为Transformer的注意力型网络架构类表现出了在自然语言处理领域和高层次视觉任务中的重大表现。与CNNs相比,注意力操作在长距离建模方面表现更好,并具有动态权重,但它们的计算复杂度在空间分辨率上是平方的,因此不太适用于包括高分辨率图像的应用程序,如图像修复。在本文中,我们根据泰勒展开设计了一份与分辨率线性相关的新的注意力。基于这个注意力,我们设计了名为$T$-former的图像修复网络。对多个基准数据集进行的实验证明,我们提出的方法在保持相对低参数和计算复杂度的情况下实现了最先进的精度。代码可在\href{this https URL}{this http URL\_image\_inpainting}中找到。
https://arxiv.org/abs/2305.07239
Dunhuang murals suffer from fading, breakage, surface brittleness and extensive peeling affected by prolonged environmental erosion. Image inpainting techniques are widely used in the field of digital mural inpainting. Generally speaking, for mural inpainting tasks with large area damage, it is challenging for any image inpainting method. In this paper, we design a multi-stage progressive reasoning network (MPR-Net) containing global to local receptive fields for murals inpainting. This network is capable of recursively inferring the damage boundary and progressively tightening the regional texture constraints. Moreover, to adaptively fuse plentiful information at various scales of murals, a multi-scale feature aggregation module (MFA) is designed to empower the capability to select the significant features. The execution of the model is similar to the process of a mural restorer (i.e., inpainting the structure of the damaged mural globally first and then adding the local texture details further). Our method has been evaluated through both qualitative and quantitative experiments, and the results demonstrate that it outperforms state-of-the-art image inpainting methods.
huang壁受到长期环境侵蚀的影响,出现褪色、断裂、表面脆性和广泛剥落等问题。数字壁装技术在壁装领域广泛应用。一般来说,对于大面积的壁装修复任务,任何图像修复方法都面临着挑战。在本文中,我们设计了一种多级渐进推理网络(MPR-Net),其中包含全球到局部接收域,用于壁装修复。该网络能够递归推断破坏边界,并逐步收紧区域纹理限制。此外,为了自适应地融合丰富的信息,我们设计了一个多尺度特征聚合模块(MFA),增强选择重要特征的能力。模型的执行过程类似于壁装重构的过程(即首先 global 到 local 地修复破坏壁的结构,然后进一步添加当地纹理细节)。我们的方法和通过定性和定量实验进行评估,结果表明它比最先进的图像修复方法表现更好。
https://arxiv.org/abs/2305.05902
Diffusion models have emerged as a key pillar of foundation models in visual domains. One of their critical applications is to universally solve different downstream inverse tasks via a single diffusion prior without re-training for each task. Most inverse tasks can be formulated as inferring a posterior distribution over data (e.g., a full image) given a measurement (e.g., a masked image). This is however challenging in diffusion models since the nonlinear and iterative nature of the diffusion process renders the posterior intractable. To cope with this challenge, we propose a variational approach that by design seeks to approximate the true posterior distribution. We show that our approach naturally leads to regularization by denoising diffusion process (RED-Diff) where denoisers at different timesteps concurrently impose different structural constraints over the image. To gauge the contribution of denoisers from different timesteps, we propose a weighting mechanism based on signal-to-noise-ratio (SNR). Our approach provides a new variational perspective for solving inverse problems with diffusion models, allowing us to formulate sampling as stochastic optimization, where one can simply apply off-the-shelf solvers with lightweight iterates. Our experiments for image restoration tasks such as inpainting and superresolution demonstrate the strengths of our method compared with state-of-the-art sampling-based diffusion models.
扩散模型已成为视觉领域的基础模型之一,其关键应用是普遍通过一个扩散先验解决不同下游反演任务,而无需对每个任务进行重新训练。大多数反演任务可以表示为推断数据分布(例如,全图像)给定测量(例如,掩膜图像)的概率分布。然而,在扩散模型中,这仍然是一个挑战,因为扩散过程非线性和迭代的性质导致概率分布难以估计。为了应对这一挑战,我们提出了一种Variational 方法,其设计旨在近似真正的后概率分布。我们证明,我们的方法自然地通过去噪扩散过程 regularize (RED-Diff),其中在不同时间步长上的信号噪声比(SNR)作为权重机制对图像进行加权。为了衡量不同时间步长上的去噪器的贡献,我们提出了一种基于随机优化的加权机制。我们的方法提供了解决扩散模型的反演问题的新Variational 视角,使我们能够将其采样作为随机优化, where one can simply apply off-the-shelf solving algorithms with lightweight iterates.我们对图像修复任务,例如填空和超分辨率,进行了实验,证明了我们方法与最先进的基于采样扩散模型相比的优势。
https://arxiv.org/abs/2305.04391
We introduce DreamPaint, a framework to intelligently inpaint any e-commerce product on any user-provided context image. The context image can be, for example, the user's own image for virtual try-on of clothes from the e-commerce catalog on themselves, the user's room image for virtual try-on of a piece of furniture from the e-commerce catalog in their room, etc. As opposed to previous augmented-reality (AR)-based virtual try-on methods, DreamPaint does not use, nor does it require, 3D modeling of neither the e-commerce product nor the user context. Instead, it directly uses 2D images of the product as available in product catalog database, and a 2D picture of the context, for example taken from the user's phone camera. The method relies on few-shot fine tuning a pre-trained diffusion model with the masked latents (e.g., Masked DreamBooth) of the catalog images per item, whose weights are then loaded on a pre-trained inpainting module that is capable of preserving the characteristics of the context image. DreamPaint allows to preserve both the product image and the context (environment/user) image without requiring text guidance to describe the missing part (product/context). DreamPaint also allows to intelligently infer the best 3D angle of the product to place at the desired location on the user context, even if that angle was previously unseen in the product's reference 2D images. We compare our results against both text-guided and image-guided inpainting modules and show that DreamPaint yields superior performance in both subjective human study and quantitative metrics.
我们介绍了 DreamPaint,一个框架,智能地在用户提供的上下文图像上绘画任何电子商务产品。上下文图像可以例如用户的自己的照片,用于虚拟试衣从电子商务目录上试穿衣服,用户的房间的自己的照片,用于虚拟试衣从电子商务目录上试穿家具等。与之前的基于增强现实(AR)的虚拟试衣方法不同,DreamPaint不使用,也不需要使用电子商务产品或用户上下文的3D建模。相反,它直接使用产品目录数据库中的2D图像,以及用户手机相机拍摄的背景2D图像。方法依赖于对每个物品的少量多次优化预训练扩散模型,其中包含目录图像中的掩码隐藏值(例如:Masked Dreambooth),其权重随后加载到能够保留上下文图像特征的预训练绘画模块中。DreamPaint能够保留产品图像和上下文(环境/用户)图像,无需文字指导描述缺失的部分(产品/上下文),同时能够智能推断最佳的3D角度,将其放置在用户上下文中所需的位置,即使该角度在产品2D参考图像中从未出现过。我们比较了我们的研究结果与文本引导和图像引导的绘画模块,并表明DreamPaint在主观人类研究和量化指标方面都表现出卓越的性能。
https://arxiv.org/abs/2305.01257
This study presents a self-prior-based mesh inpainting framework that requires only an incomplete mesh as input, without the need for any training datasets. Additionally, our method maintains the polygonal mesh format throughout the inpainting process without converting the shape format to an intermediate, such as a voxel grid, a point cloud, or an implicit function, which are typically considered easier for deep neural networks to process. To achieve this goal, we introduce two graph convolutional networks (GCNs): single-resolution GCN (SGCN) and multi-resolution GCN (MGCN), both trained in a self-supervised manner. Our approach refines a watertight mesh obtained from the initial hole filling to generate a completed output mesh. Specifically, we train the GCNs to deform an oversmoothed version of the input mesh into the expected completed shape. To supervise the GCNs for accurate vertex displacements, despite the unknown correct displacements at real holes, we utilize multiple sets of meshes with several connected regions marked as fake holes. The correct displacements are known for vertices in these fake holes, enabling network training with loss functions that assess the accuracy of displacement vectors estimated by the GCNs. We demonstrate that our method outperforms traditional dataset-independent approaches and exhibits greater robustness compared to other deep-learning-based methods for shapes that less frequently appear in shape datasets.
这项研究提出了一种基于自我先验性的网格填充框架,只需要不完整的网格作为输入,而不需要任何训练数据集。此外,我们的方法和传统的数据驱动方法不同,不会将形状格式转换为中间格式,如立方点网格、点云或隐含函数,这些通常被认为是深度学习网络处理更简单的格式。为了实现这一目标,我们引入了两个图卷积神经网络(GCNs):单分辨率GCN(SGCN)和多分辨率GCN(MGCN),均通过自我监督训练完成。我们改进了从初始漏洞填充中获得的密封网格,生成完整的输出网格。具体来说,我们训练GCNs将输入网格的过度平滑版本变形为预期的完整形状。为了监督GCNs准确的顶点位移,尽管真实的漏洞中未知的正确位移,我们使用了多个带有多个连接区域标记为假漏洞的网格集。这些正确的位移对于在这些假漏洞中的顶点是已知的,因此可以使用损失函数训练网络,评估由GCNs估计的位移向量的准确性。我们证明了我们的方法和传统的数据无关的方法相比,对于在形状数据集中较少出现的形状,表现出更强的鲁棒性。
https://arxiv.org/abs/2305.00635
Remote photoplethysmography (rPPG) offers a state-of-the-art, non-contact methodology for estimating human pulse by analyzing facial videos. Despite its potential, rPPG methods can be susceptible to various artifacts, such as noise, occlusions, and other obstructions caused by sunglasses, masks, or even involuntary facial contact, such as individuals inadvertently touching their faces. In this study, we apply image processing transformations to intentionally degrade video quality, mimicking these challenging conditions, and subsequently evaluate the performance of both non-learning and learning-based rPPG methods on the deteriorated data. Our results reveal a significant decrease in accuracy in the presence of these artifacts, prompting us to propose the application of restoration techniques, such as denoising and inpainting, to improve heart-rate estimation outcomes. By addressing these challenging conditions and occlusion artifacts, our approach aims to make rPPG methods more robust and adaptable to real-world situations. To assess the effectiveness of our proposed methods, we undertake comprehensive experiments on three publicly available datasets, encompassing a wide range of scenarios and artifact types. Our findings underscore the potential to construct a robust rPPG system by employing an optimal combination of restoration algorithms and rPPG techniques. Moreover, our study contributes to the advancement of privacy-conscious rPPG methodologies, thereby bolstering the overall utility and impact of this innovative technology in the field of remote heart-rate estimation under realistic and diverse conditions.
远程光流摄影(rPPG)提供了一种先进的、非接触的方法,通过分析面部视频来估算人类的脉搏。尽管它有潜力,但rPPG方法可能受到各种干扰,如噪声、遮挡和其他由太阳镜、口罩或甚至无意识的面部接触引起的干扰,例如 individuals inadvertently touches their faces 等。在本研究中,我们应用图像处理变换故意降低视频质量,模拟这些挑战性的情况,并随后对恶化的数据的两个非学习和学习型的rPPG方法的性能进行评估。我们的结果显示,在存在这些干扰项的情况下,精度有显著降低,因此我们提出了恢复技术的应用,如去噪和填充,以改善心率估计结果。通过解决这些问题和遮挡干扰项,我们的方法旨在使rPPG方法更加稳健和适应现实世界的情况。为了评估我们提出的方法的有效性,我们进行了三个公开数据集的全面实验,涵盖了各种情况和干扰项类型。我们的研究成果强调了使用恢复算法和rPPG技术的最佳组合构建稳健rPPG系统的潜力。此外,我们的研究有助于推进隐私意识的rPPG方法,从而提高这个创新技术在远程心率估计领域在真实和多样化情况下的总体实用性和影响力。
https://arxiv.org/abs/2304.14789