Video-based remote photoplethysmography (rPPG) has emerged as a promising technology for non-contact vital sign monitoring, especially under controlled conditions. However, the accurate measurement of vital signs in real-world scenarios faces several challenges, including artifacts induced by videocodecs, low-light noise, degradation, low dynamic range, occlusions, and hardware and network constraints. In this article, we systematically investigate comprehensive investigate these issues, measuring their detrimental effects on the quality of rPPG measurements. Additionally, we propose practical strategies for mitigating these challenges to improve the dependability and resilience of video-based rPPG systems. We detail methods for effective biosignal recovery in the presence of network limitations and present denoising and inpainting techniques aimed at preserving video frame integrity. Through extensive evaluations and direct comparisons, we demonstrate the effectiveness of the approaches in enhancing rPPG measurements under challenging environments, contributing to the development of more reliable and effective remote vital sign monitoring technologies.
基于视频的远程脉搏测量(rPPG)作为一种非接触式生命体征监测的有前景的技术,尤其是在受控条件下,已经得到了广泛的应用。然而,在现实世界的场景中准确测量生命体征面临着几个挑战,包括由视频编码器产生的伪影、低光噪声、失真、低动态范围、遮挡和硬件及网络限制等。在本文中,我们系统地研究了这些问题,并测量了它们对rPPG测量质量的损害。此外,我们提出了应对这些挑战的实际策略,以提高基于视频的rPPG系统的可靠性和弹性。我们详细介绍了在网络限制下进行生物信号恢复的方法,并提出了旨在保持视频帧完整性的去噪和修复技术。通过广泛的评估和直接比较,我们证明了这些方法在具有挑战性的环境中增强rPPG测量效果,为开发更可靠和有效的远程生命体征监测技术做出了贡献。
https://arxiv.org/abs/2405.01230
In Virtual Product Placement (VPP) applications, the discrete integration of specific brand products into images or videos has emerged as a challenging yet important task. This paper introduces a novel three-stage fully automated VPP system. In the first stage, a language-guided image segmentation model identifies optimal regions within images for product inpainting. In the second stage, Stable Diffusion (SD), fine-tuned with a few example product images, is used to inpaint the product into the previously identified candidate regions. The final stage introduces an "Alignment Module", which is designed to effectively sieve out low-quality images. Comprehensive experiments demonstrate that the Alignment Module ensures the presence of the intended product in every generated image and enhances the average quality of images by 35%. The results presented in this paper demonstrate the effectiveness of the proposed VPP system, which holds significant potential for transforming the landscape of virtual advertising and marketing strategies.
在虚拟产品置入(VPP)应用中,将特定品牌产品离散地集成到图像或视频中已成为一个具有挑战性但重要的问题。本文介绍了一种新颖的三阶段完全自动VPP系统。在第一阶段,受语言指导的图像分割模型在图像中确定产品修复的最佳区域。在第二阶段,使用经过几例产品图像微调的Stable Diffusion(SD)对产品进行修复,将产品修复到之前确定的候选区域中。最后阶段引入了一个“对齐模块”,旨在有效地筛选出低质量的图像。全面的实验结果表明,对齐模块确保了每个生成的图像中都含有意图产品,并提高了图像的平均质量35%。本文所呈现的结果证明了所提出的VPP系统的有效性,该系统在改变虚拟广告和营销策略的地图方面具有巨大的潜力。
https://arxiv.org/abs/2405.01130
Denoising diffusion models have recently gained prominence as powerful tools for a variety of image generation and manipulation tasks. Building on this, we propose a novel tool for real-time editing of images that provides users with fine-grained region-targeted supervision in addition to existing prompt-based controls. Our novel editing technique, termed Layered Diffusion Brushes, leverages prompt-guided and region-targeted alteration of intermediate denoising steps, enabling precise modifications while maintaining the integrity and context of the input image. We provide an editor based on Layered Diffusion Brushes modifications, which incorporates well-known image editing concepts such as layer masks, visibility toggles, and independent manipulation of layers; regardless of their order. Our system renders a single edit on a 512x512 image within 140 ms using a high-end consumer GPU, enabling real-time feedback and rapid exploration of candidate edits. We validated our method and editing system through a user study involving both natural images (using inversion) and generated images, showcasing its usability and effectiveness compared to existing techniques such as InstructPix2Pix and Stable Diffusion Inpainting for refining images. Our approach demonstrates efficacy across a range of tasks, including object attribute adjustments, error correction, and sequential prompt-based object placement and manipulation, demonstrating its versatility and potential for enhancing creative workflows.
去噪扩散模型最近因成为各种图像生成和编辑任务的强大工具而受到了广泛关注。在此基础上,我们提出了一种名为分层扩散刷的新工具,为用户提供了在现有提示基础上的细粒度区域目标指导,以及对输入图像的完整性和上下文的保留。我们提出的编辑技术被称为分层扩散刷,利用了提示引导的中间去噪步骤的区域的修改,可以在保持输入图像的完整性和上下文的同时进行精确修改。我们还基于分层扩散刷的编辑器,该编辑器包含了著名的图像编辑概念,如层遮罩、可见性开关和层级的独立操作;无论它们的顺序如何。我们的系统在高端消费级GPU上对512x512图像进行一次编辑,用时140毫秒,实现了实时的反馈和快速的选择编辑。我们对我们的方法进行了用户研究,包括自然图像(使用反向映射)和生成图像,证明了它的可用性和效果与现有技术(如InstructPix2Pix和稳定扩散修复)相比优越。我们的方法在各种任务中表现出有效性,包括对象属性的调整、错误纠正和基于序列提示的对象放置和编辑,证明了它的多才多艺和提高创意工作流程的潜力。
https://arxiv.org/abs/2405.00313
Current state-of-the-art methods for video inpainting typically rely on optical flow or attention-based approaches to inpaint masked regions by propagating visual information across frames. While such approaches have led to significant progress on standard benchmarks, they struggle with tasks that require the synthesis of novel content that is not present in other frames. In this paper we reframe video inpainting as a conditional generative modeling problem and present a framework for solving such problems with conditional video diffusion models. We highlight the advantages of using a generative approach for this task, showing that our method is capable of generating diverse, high-quality inpaintings and synthesizing new content that is spatially, temporally, and semantically consistent with the provided context.
目前最先进的视频修复方法通常依赖于光学流或基于注意力的方法,通过在帧之间传播视觉信息来修复遮罩区域。虽然这些方法在标准基准测试中取得了显著的进展,但它们在需要合成不存在的其他帧中合成新内容时遇到困难。在本文中,我们将视频修复重新定义为条件生成建模问题,并提出了使用条件视频扩散模型解决此类问题的框架。我们强调了使用生成方法解决此任务的优点,证明了我们的方法能够生成多样、高质量的视频修复,并合成与提供上下文一致的时空语义内容。
https://arxiv.org/abs/2405.00251
3D scene generation has quickly become a challenging new research direction, fueled by consistent improvements of 2D generative diffusion models. Most prior work in this area generates scenes by iteratively stitching newly generated frames with existing geometry. These works often depend on pre-trained monocular depth estimators to lift the generated images into 3D, fusing them with the existing scene representation. These approaches are then often evaluated via a text metric, measuring the similarity between the generated images and a given text prompt. In this work, we make two fundamental contributions to the field of 3D scene generation. First, we note that lifting images to 3D with a monocular depth estimation model is suboptimal as it ignores the geometry of the existing scene. We thus introduce a novel depth completion model, trained via teacher distillation and self-training to learn the 3D fusion process, resulting in improved geometric coherence of the scene. Second, we introduce a new benchmarking scheme for scene generation methods that is based on ground truth geometry, and thus measures the quality of the structure of the scene.
3D场景生成已经成为一个快速发展的研究方向,得到了2D生成扩散模型的持续改进。在这个领域中,之前的工作通常通过迭代地将新产生的帧与现有的几何知识拼接在一起来生成场景。这些工作通常依赖于预训练的单目深度估计器将生成的图像提升到3D,将它们与现有的场景表示融合。然后,这些方法通常通过文本度量来评估,测量生成图像与给定文本提示之间的相似性。在这项工作中,我们为3D场景生成领域做出了两个根本性的贡献。首先,我们指出,使用单目深度估计模型将图像提升到3D是不最优的,因为它忽略了现有场景的几何结构。因此,我们引入了一种新的深度完成模型,通过教师蒸馏和自训练学习3D融合过程,从而改善了场景的几何连贯性。其次,我们引入了一种基于真实几何的新场景生成基准方案,因此它衡量了场景结构的质量。
https://arxiv.org/abs/2404.19758
Recent advancements in image inpainting, particularly through diffusion modeling, have yielded promising outcomes. However, when tested in scenarios involving the completion of images based on the foreground objects, current methods that aim to inpaint an image in an end-to-end manner encounter challenges such as "over-imagination", inconsistency between foreground and background, and limited diversity. In response, we introduce Anywhere, a pioneering multi-agent framework designed to address these issues. Anywhere utilizes a sophisticated pipeline framework comprising various agents such as Visual Language Model (VLM), Large Language Model (LLM), and image generation models. This framework consists of three principal components: the prompt generation module, the image generation module, and the outcome analyzer. The prompt generation module conducts a semantic analysis of the input foreground image, leveraging VLM to predict relevant language descriptions and LLM to recommend optimal language prompts. In the image generation module, we employ a text-guided canny-to-image generation model to create a template image based on the edge map of the foreground image and language prompts, and an image refiner to produce the outcome by blending the input foreground and the template image. The outcome analyzer employs VLM to evaluate image content rationality, aesthetic score, and foreground-background relevance, triggering prompt and image regeneration as needed. Extensive experiments demonstrate that our Anywhere framework excels in foreground-conditioned image inpainting, mitigating "over-imagination", resolving foreground-background discrepancies, and enhancing diversity. It successfully elevates foreground-conditioned image inpainting to produce more reliable and diverse results.
近期在图像修复算法的进步,特别是扩散建模,已经取得了很好的结果。然而,在基于前景对象完成图像的场景中进行测试时,当前试图通过端到端方法修复图像的方法遇到了一些挑战,如“过度想象”、“前景和背景之间不一致”和“缺乏多样性”。为了应对这些挑战,我们引入了Anywhere,这是一个首创的多代理框架,旨在解决这些问题。Anywhere采用了一个复杂的管道框架,包括各种代理,如Visual Language Model(VLM)、大型语言模型(LLM)和图像生成模型。这个框架包括三个主要组件:提示生成模块、图像生成模块和结果分析器。提示生成模块对输入的前景图像进行语义分析,利用VLM预测相关的语言描述,并利用LLM推荐最优的语言提示。在图像生成模块中,我们使用基于文本指导的Canny-to-图像生成模型根据前景图像的边缘图和语言提示创建模板图像,并使用图像平滑器根据输入前景和模板图像产生结果。结果分析器利用VLM评估图像内容的合理性、美学分数和前景与背景的相关性,根据需要触发提示和图像再生。大量实验证明,我们的Anywhere框架在前景条件下的图像修复方面表现出色,减轻了“过度想象”,解决了前景和背景之间的不一致,并增强了多样性。它将前景条件下的图像修复提升到了产生更可靠和多样结果的水平。
https://arxiv.org/abs/2404.18598
Image editing has advanced significantly with the introduction of text-conditioned diffusion models. Despite this progress, seamlessly adding objects to images based on textual instructions without requiring user-provided input masks remains a challenge. We address this by leveraging the insight that removing objects (Inpaint) is significantly simpler than its inverse process of adding them (Paint), attributed to the utilization of segmentation mask datasets alongside inpainting models that inpaint within these masks. Capitalizing on this realization, by implementing an automated and extensive pipeline, we curate a filtered large-scale image dataset containing pairs of images and their corresponding object-removed versions. Using these pairs, we train a diffusion model to inverse the inpainting process, effectively adding objects into images. Unlike other editing datasets, ours features natural target images instead of synthetic ones; moreover, it maintains consistency between source and target by construction. Additionally, we utilize a large Vision-Language Model to provide detailed descriptions of the removed objects and a Large Language Model to convert these descriptions into diverse, natural-language instructions. We show that the trained model surpasses existing ones both qualitatively and quantitatively, and release the large-scale dataset alongside the trained models for the community.
图像编辑已经取得了显著的进步,得益于引入了基于文本指令的扩散模型,现在可以轻松地将对象添加到图像中,而无需用户提供输入掩码。尽管如此,在不需要用户提供输入掩码的情况下,平滑地将文本指令添加到图像中仍然具有挑战性。为了应对这个挑战,我们通过利用消除对象(Inpaint)远比添加它们(Paint)的逆过程简单的见解,利用结合了分割掩码数据集的inpainting模型,为inpainting模型提供更多的信息。抓住这个启示,通过实现自动且广泛的流程,我们策划了一个包含对配对图像及其相应去对象版本的过滤大规模图像数据集。利用这些配对,我们训练了一个扩散模型来反向扩散,有效地将对象添加到图像中。 与其他编辑数据集不同,我们的数据集包含自然目标图像,而不是合成图像。此外,它通过自定义构建了源图像和目标图像之间的 consistency。此外,我们还利用了一个大型的视觉语言模型来提供对移除对象的详细描述,以及一个大型语言模型将描述转换为各种自然语言指令。我们证明了训练后的模型在质量和数量上超过了现有的模型,并将大型数据集与训练好的模型一起发布,供社区使用。
https://arxiv.org/abs/2404.18212
Existing image inpainting methods have achieved remarkable accomplishments in generating visually appealing results, often accompanied by a trend toward creating more intricate structural textures. However, while these models excel at creating more realistic image content, they often leave noticeable traces of tampering, posing a significant threat to security. In this work, we take the anti-forensic capabilities into consideration, firstly proposing an end-to-end training framework for anti-forensic image inpainting named SafePaint. Specifically, we innovatively formulated image inpainting as two major tasks: semantically plausible content completion and region-wise optimization. The former is similar to current inpainting methods that aim to restore the missing regions of corrupted images. The latter, through domain adaptation, endeavors to reconcile the discrepancies between the inpainted region and the unaltered area to achieve anti-forensic goals. Through comprehensive theoretical analysis, we validate the effectiveness of domain adaptation for anti-forensic performance. Furthermore, we meticulously crafted a region-wise separated attention (RWSA) module, which not only aligns with our objective of anti-forensics but also enhances the performance of the model. Extensive qualitative and quantitative evaluations show our approach achieves comparable results to existing image inpainting methods while offering anti-forensic capabilities not available in other methods.
现有的图像修复方法在生成视觉上吸引人的结果方面取得了显著的成就,通常伴随着创建更复杂结构纹理的趋势。然而,虽然这些模型在创建更真实的图像内容方面表现出色,但它们通常会留下明显的篡改痕迹,对安全性构成重大威胁。在这项工作中,我们考虑了反取证能力,首先提出了一个名为SafePaint的端到端反取证图像修复框架。具体来说,我们创新地将图像修复定义为两个主要任务:语义上合理的內容完成和区域优化。前者类似于当前修复方法,旨在恢复损坏图像的缺失部分。后者通过领域适应努力调和修复区域与未修改区域的差异,以实现反取证目标。通过全面的理论分析,我们验证了领域适应在反取证性能上的有效性。此外,我们精心 crafted了一个区域分离注意(RWSA)模块,该模块不仅符合我们反取证的目标,还提高了模型的性能。广泛的定性和定量评估显示,我们的方法在实现与现有图像修复方法相当的结果的同时,提供了反取证能力在其他方法中无法实现。
https://arxiv.org/abs/2404.18136
We introduce ObjectAdd, a training-free diffusion modification method to add user-expected objects into user-specified area. The motive of ObjectAdd stems from: first, describing everything in one prompt can be difficult, and second, users often need to add objects into the generated image. To accommodate with real world, our ObjectAdd maintains accurate image consistency after adding objects with technical innovations in: (1) embedding-level concatenation to ensure correct text embedding coalesce; (2) object-driven layout control with latent and attention injection to ensure objects accessing user-specified area; (3) prompted image inpainting in an attention refocusing & object expansion fashion to ensure rest of the image stays the same. With a text-prompted image, our ObjectAdd allows users to specify a box and an object, and achieves: (1) adding object inside the box area; (2) exact content outside the box area; (3) flawless fusion between the two areas
我们提出了ObjectAdd,一种无需训练的扩散修改方法,可以将用户期望的对象添加到用户指定的区域中。ObjectAdd的动机源于:首先,在仅有一个提示的情况下描述一切可能很难;其次,用户通常需要将对象添加到生成的图像中。为了适应现实世界,我们的ObjectAdd在添加对象时保持了准确的图像一致性:通过(1)在嵌入层级连接中进行连接以确保正确文本嵌入聚类;(2)使用潜在和注意注入的对象驱动布局控制确保访问用户指定区域的物体;(3)在关注重新聚焦和物体扩展的方式中进行提示图像修复,确保其余部分与初始图像相同。有了文本提示的图像,我们的ObjectAdd允许用户指定一个框和一个物体,并实现了: (1)在框内添加物体;(2)超出框外的精确内容;(3)两个区域的无缝融合
https://arxiv.org/abs/2404.17230
Video Wire Inpainting (VWI) is a prominent application in video inpainting, aimed at flawlessly removing wires in films or TV series, offering significant time and labor savings compared to manual frame-by-frame removal. However, wire removal poses greater challenges due to the wires being longer and slimmer than objects typically targeted in general video inpainting tasks, and often intersecting with people and background objects irregularly, which adds complexity to the inpainting process. Recognizing the limitations posed by existing video wire datasets, which are characterized by their small size, poor quality, and limited variety of scenes, we introduce a new VWI dataset with a novel mask generation strategy, namely Wire Removal Video Dataset 2 (WRV2) and Pseudo Wire-Shaped (PWS) Masks. WRV2 dataset comprises over 4,000 videos with an average length of 80 frames, designed to facilitate the development and efficacy of inpainting models. Building upon this, our research proposes the Redundancy-Aware Transformer (Raformer) method that addresses the unique challenges of wire removal in video inpainting. Unlike conventional approaches that indiscriminately process all frame patches, Raformer employs a novel strategy to selectively bypass redundant parts, such as static background segments devoid of valuable information for inpainting. At the core of Raformer is the Redundancy-Aware Attention (RAA) module, which isolates and accentuates essential content through a coarse-grained, window-based attention mechanism. This is complemented by a Soft Feature Alignment (SFA) module, which refines these features and achieves end-to-end feature alignment. Extensive experiments on both the traditional video inpainting datasets and our proposed WRV2 dataset demonstrate that Raformer outperforms other state-of-the-art methods.
视频电线修复(VWI)是视频修复中的一个突出应用,旨在通过完美地去除电影或电视剧中的电线,从而节省大量的时间和劳动力。然而,由于电线比通常针对的视频修复任务中的物体更长且更细,电线 removal 带来了更大的挑战,并且经常与人员和背景物体不规则地相交,这增加了修复过程的复杂性。为了克服现有视频电线数据集中的限制,其特征是数据量小、质量差且场景有限,我们提出了一个新的 VWI 数据集,称为电线移除视频数据集 2(WRV2)和高仿电线形状(PWS)掩码。WRV2 数据集包括超过 4,000 个视频,平均长度为 80 帧,旨在促进修复模型的开发和效果。在此基础上,我们的研究提出了 Redundancy-Aware Transformer(Raformer)方法,该方法解决了视频修复中电线去除的独特挑战。与传统方法不同,Raformer 采用了一种新的策略,通过粗粒度、窗口 based 的注意力机制选择性地绕过冗余部分,例如缺乏对修复有用信息的静态背景段。Raformer 的核心是 Redundancy-Aware 注意力(RAA)模块,通过粗粒度、窗口 based 的注意力机制隔离和强调关键内容。这还由软特征对齐(SFA)模块补充,该模块细化这些特征并实现端到端特征对齐。对传统视频修复数据集和我们的 WRV2 数据集的实验证明,Raformer 优于其他最先进的治疗方法。
https://arxiv.org/abs/2404.15802
The scarcity of green spaces, in urban environments, consists a critical challenge. There are multiple adverse effects, impacting the health and well-being of the citizens. Small scale interventions, e.g. pocket parks, is a viable solution, but comes with multiple constraints, involving the design and implementation over a specific area. In this study, we harness the capabilities of generative AI for multi-scale intervention planning, focusing on nature based solutions. By leveraging image-to-image and image inpainting algorithms, we propose a methodology to address the green space deficit in urban areas. Focusing on two alleys in Thessaloniki, where greenery is lacking, we demonstrate the efficacy of our approach in visualizing NBS interventions. Our findings underscore the transformative potential of emerging technologies in shaping the future of urban intervention planning processes.
城市环境中绿色空间的稀缺是一个关键挑战。这种短缺对公民的健康和福祉产生了多种不利影响。小规模干预措施,例如口袋公园,是一种可行的解决方案,但需要考虑多个限制,包括在特定区域的设计和实施。在这项研究中,我们利用生成式人工智能的多尺度干预规划功能,重点关注基于自然的解决方案。通过利用图像到图像和图像修复算法,我们提出了一种解决城市地区绿色空间不足的方法。聚焦于希腊塞萨洛尼基的两个小巷,其中绿化不足,我们证明了我们方法在可视化基于自然的干预措施方面的有效性。我们的研究结果强调了新兴技术在塑造城市干预规划过程未来方面的变革潜力。
https://arxiv.org/abs/2404.15492
Learning-based image stitching techniques typically involve three distinct stages: registration, fusion, and rectangling. These stages are often performed sequentially, each trained independently, leading to potential cascading error propagation and complex parameter tuning challenges. In rethinking the mathematical modeling of the fusion and rectangling stages, we discovered that these processes can be effectively combined into a single, variety-intensity inpainting problem. Therefore, we propose the Simple and Robust Stitcher (SRStitcher), an efficient training-free image stitching method that merges the fusion and rectangling stages into a unified model. By employing the weighted mask and large-scale generative model, SRStitcher can solve the fusion and rectangling problems in a single inference, without additional training or fine-tuning of other models. Our method not only simplifies the stitching pipeline but also enhances fault tolerance towards misregistration errors. Extensive experiments demonstrate that SRStitcher outperforms state-of-the-art (SOTA) methods in both quantitative assessments and qualitative evaluations. The code is released at this https URL
基于学习的图像拼接技术通常包括三个不同的阶段:注册、融合和矩形化。这些阶段通常按顺序执行,每个阶段都经过独立训练,这可能导致级联错误传播和复杂参数调整挑战。在重新考虑融合和矩形化阶段的数学建模时,我们发现这些过程可以有效合并成一个单一的多样性强度在补救问题中。因此,我们提出了简单且鲁棒的全拼接器(SRStitcher)高效的无训练图像拼接方法,将融合和矩形化阶段合并为一个统一的模型。通过采用加权掩码和大规模生成模型,SRStitcher可以在单个推理中解决融合和矩形化问题,而无需其他模型的微调或训练。我们的方法不仅简化了拼接流程,还提高了对配准错误容错的能力。大量实验证明,SRStitcher在定量评估和定性评估方面都优于最先进的(SOTA)方法。代码发布在https://这一URL
https://arxiv.org/abs/2404.14951
Hyperspectral imaging (HSI) is a key technology for earth observation, surveillance, medical imaging and diagnostics, astronomy and space exploration. The conventional technology for HSI in remote sensing applications is based on the push-broom scanning approach in which the camera records the spectral image of a stripe of the scene at a time, while the image is generated by the aggregation of measurements through time. In real-world airborne and spaceborne HSI instruments, some empty stripes would appear at certain locations, because platforms do not always maintain a constant programmed attitude, or have access to accurate digital elevation maps (DEM), and the travelling track is not necessarily aligned with the hyperspectral cameras at all times. This makes the enhancement of the acquired HS images from incomplete or corrupted observations an essential task. We introduce a novel HSI inpainting algorithm here, called Hyperspectral Equivariant Imaging (Hyper-EI). Hyper-EI is a self-supervised learning-based method which does not require training on extensive datasets or access to a pre-trained model. Experimental results show that the proposed method achieves state-of-the-art inpainting performance compared to the existing methods.
hyperspectral imaging(HSI)是一种用于地球观测、监控、医学成像和诊断、天文学和太空探索的关键技术。在遥感应用中,传统的HSI技术是基于扫描方法,即相机在一次拍摄中记录场景的条带光谱图像,然后通过时间累积测量结果来生成图像。在实际的航空和航天器HSI仪器中,在某些位置会看到一些空条带,因为平台并不总是保持恒定的程序化姿态,或者无法访问准确的数字高程图(DEM),并且飞行轨迹不一定与所有时刻的 hyperspectral 相机对齐。这使得从 incomplete 或 corrupted observations 中增强已获得 HS 图像成为一个必要任务。我们在这里介绍了一种名为 Hyperpectral Equivariant Imaging(Hyper-EI)的新型HSI修复算法。Hyper-EI是一种自监督学习方法,不需要在大量数据集上进行训练或访问预训练模型。实验结果表明,与现有方法相比,所提出的方法在修复效果方面实现了最先进的水平。
https://arxiv.org/abs/2404.13159
Deep learning is dramatically transforming the field of medical imaging and radiology, enabling the identification of pathologies in medical images, including computed tomography (CT) and X-ray scans. However, the performance of deep learning models, particularly in segmentation tasks, is often limited by the need for extensive annotated datasets. To address this challenge, the capabilities of weakly supervised semantic segmentation are explored through the lens of Explainable AI and the generation of counterfactual explanations. The scope of this research is development of a novel counterfactual inpainting approach (COIN) that flips the predicted classification label from abnormal to normal by using a generative model. For instance, if the classifier deems an input medical image X as abnormal, indicating the presence of a pathology, the generative model aims to inpaint the abnormal region, thus reversing the classifier's original prediction label. The approach enables us to produce precise segmentations for pathologies without depending on pre-existing segmentation masks. Crucially, image-level labels are utilized, which are substantially easier to acquire than creating detailed segmentation masks. The effectiveness of the method is demonstrated by segmenting synthetic targets and actual kidney tumors from CT images acquired from Tartu University Hospital in Estonia. The findings indicate that COIN greatly surpasses established attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an alternative counterfactual explanation method introduced by Singla et al. This evidence suggests that COIN is a promising approach for semantic segmentation of tumors in CT images, and presents a step forward in making deep learning applications more accessible and effective in healthcare, where annotated data is scarce.
深度学习正在深刻地改变医学影像和放射学领域,以前所未有的方式识别医学图像中的疾病,包括计算机断层扫描(CT)和X光扫描。然而,深度学习模型的性能,尤其是在分割任务中,常常受到需要大量注释数据的需求的限制。为解决这个问题,通过 Explainable AI 和生成反事实解释来探索弱监督语义分割模型的能力。这项研究旨在开发一种新颖的逆向修复方法(COIN),通过使用生成模型在预测分类标签异常的情况下,将预测分类标签从异常转为正常。例如,如果分类器认为输入医学图像X异常,表示存在疾病,生成模型旨在修复异常区域,从而反转分类器的原始预测标签。该方法使我们能够在不依赖预先存在的分割掩码的情况下精确地分割出疾病。关键的是,图像级标签被利用,这比创建详细的分割掩码要容易得多。该方法的效果由从爱沙尼亚图尔图大学医院的CT图像中分割出合成目标和实际肾肿瘤来证明。研究结果表明,COIN远远超过了已有的归因方法,如RISE、ScoreCAM和LayerCAM,以及Singla等人提出的另一种反事实解释方法。这一证据表明,COIN是用于CT图像肿瘤语义分割的有前景的方法,并为在医疗保健中使深度学习应用更具有可行性和效果铺平了道路,而注释数据又稀缺。
https://arxiv.org/abs/2404.12832
We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using binary masks and text prompts. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this context, a diffusion-based transformer decoder synthesizes the masked pixels in a "lazy" fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder's runtime scales with the mask size, which is typically small, while our encoder introduces negligible overhead. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a 10x speedup for typical user interactions, where the editing mask represents 10% of the image.
我们提出了一种名为LazyDiffusion的新扩散变换器,它能够高效地生成部分图像更新。我们的方法针对交互式图像编辑应用,用户从空白的画布或图像开始,使用二进制掩码和文本提示指定一系列局部图像修改序列。生成器有两个阶段。首先,上下文编码器处理当前画布和用户掩码以生成一个紧凑的全局上下文,专门针对要生成的区域进行优化。其次,在上下文条件下,扩散基变换器解码器以“懒散”的方式合成掩码中的像素,即它只生成掩码的区域。这 contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. 我们的解码器的运行时与掩码大小成比例,而我们的编码器引入的开销可以忽略不计。我们证明了我们的方法在质量和忠实度方面与最先进的修复方法相当,同时为典型的用户交互提供10倍的加速,其中编辑掩码代表图像的10%。
https://arxiv.org/abs/2404.12382
In this work, we study the task of sketch-guided image inpainting. Unlike the well-explored natural language-guided image inpainting, which excels in capturing semantic details, the relatively less-studied sketch-guided inpainting offers greater user control in specifying the object's shape and pose to be inpainted. As one of the early solutions to this task, we introduce a novel partial discrete diffusion process (PDDP). The forward pass of the PDDP corrupts the masked regions of the image and the backward pass reconstructs these masked regions conditioned on hand-drawn sketches using our proposed sketch-guided bi-directional transformer. The proposed novel transformer module accepts two inputs -- the image containing the masked region to be inpainted and the query sketch to model the reverse diffusion process. This strategy effectively addresses the domain gap between sketches and natural images, thereby, enhancing the quality of inpainting results. In the absence of a large-scale dataset specific to this task, we synthesize a dataset from the MS-COCO to train and extensively evaluate our proposed framework against various competent approaches in the literature. The qualitative and quantitative results and user studies establish that the proposed method inpaints realistic objects that fit the context in terms of the visual appearance of the provided sketch. To aid further research, we have made our code publicly available at this https URL .
在这项工作中,我们研究了基于草图指导的图像修复任务。与在自然语言引导下图像修复任务中已经取得很好进展的情况不同,相对较少研究的基于草图指导的修复任务为用户指定修复物体形状和姿势提供了更大的用户控制。作为解决这个问题的一种早期解决方案,我们引入了一个新的部分离散扩散过程(PDDP)。PDDP的前向传播会破坏图像上的遮罩区域,而反向传播根据我们提出的草图指导双向变换器重构这些遮罩区域。所提出的新的变换器模块接受两个输入——包含要修复的遮罩区域的图像和查询草图,以建模反向扩散过程。这种策略有效地解决了草图和自然图像之间的领域差距,从而提高了修复结果的质量。在缺乏针对这个任务的较大规模数据集的情况下,我们通过将MS-COCO中的数据集合成一个数据集来训练,并详细评估我们提出的框架与各种有效方法之间的差异。定性和定量的结果以及用户研究证实了所提出的修复方法在给定草图的视觉外观下修复了真实的物体。为了进一步研究,我们将代码公开发布在https:// 这个网址上。
https://arxiv.org/abs/2404.11949
3D Gaussians have recently emerged as an efficient representation for novel view synthesis. This work studies its editability with a particular focus on the inpainting task, which aims to supplement an incomplete set of 3D Gaussians with additional points for visually harmonious rendering. Compared to 2D inpainting, the crux of inpainting 3D Gaussians is to figure out the rendering-relevant properties of the introduced points, whose optimization largely benefits from their initial 3D positions. To this end, we propose to guide the point initialization with an image-conditioned depth completion model, which learns to directly restore the depth map based on the observed image. Such a design allows our model to fill in depth values at an aligned scale with the original depth, and also to harness strong generalizability from largescale diffusion prior. Thanks to the more accurate depth completion, our approach, dubbed InFusion, surpasses existing alternatives with sufficiently better fidelity and efficiency under various complex scenarios. We further demonstrate the effectiveness of InFusion with several practical applications, such as inpainting with user-specific texture or with novel object insertion.
最近,3D高斯分布作为一种有效的新的视图合成表示方法而 emergence。本文重点研究了在修复任务上的可编辑性,该任务旨在通过为3D Gaussians增加视觉上和谐的数据点来补充不完整的3D Gaussians。与2D修复相比,修复3D Gaussians的关键在于确定引入的点的渲染相关性质,这些优化主要受益于它们初始的3D位置。为此,我们提出了一种基于图像条件完成模型的点初始化方法,该模型能够根据观察到的图像直接恢复深度图。这样的设计允许我们的模型在与原始深度对齐的尺度上填充深度值,并且还能从大面积扩散先验中充分利用强大的泛化能力。由于更准确的深度完成,我们的方法(被称为InFusion)在各种复杂场景下具有足够好的保真度和效率超越了现有的替代方法。我们进一步通过几个实际应用展示了InFusion的有效性,例如使用用户特定纹理进行修复或通过新颖物体插入进行修复。
https://arxiv.org/abs/2404.11613
Object removal refers to the process of erasing designated objects from an image while preserving the overall appearance, and it is one area where image inpainting is widely used in real-world applications. The performance of an object remover is quantitatively evaluated by measuring the quality of object removal results, similar to how the performance of an image inpainter is gauged. Current works reporting quantitative performance evaluations utilize original images as references. In this letter, to validate the current evaluation methods cannot properly evaluate the performance of an object remover, we create a dataset with object removal ground truth and compare the evaluations made by the current methods using original images to those utilizing object removal ground truth images. The disparities between two evaluation sets validate that the current methods are not suitable for measuring the performance of an object remover. Additionally, we propose new evaluation methods tailored to gauge the performance of an object remover. The proposed methods evaluate the performance through class-wise object removal results and utilize images without the target class objects as a comparison set. We confirm that the proposed methods can make judgments consistent with human evaluators in the COCO dataset, and that they can produce measurements aligning with those using object removal ground truth in the self-acquired dataset.
对象移除是指在保留图像整体外观的情况下,从图像中删除指定对象的过程,它是图像修复在现实应用中得到广泛使用的一个领域。对象移除算法的性能通过测量对象移除结果的质量进行定量评估,就像衡量图像修复性能一样。目前的工作报道了定量性能评估,它们使用原始图像作为参考。在本文中,为了验证当前评估方法不能正确评估对象移除算法的性能,我们创建了一个带有对象移除真实值的 dataset,并使用原始图像比较当前方法得到的评估结果和利用对象移除真实值图像得到的评估结果。两个评估集中的差异证实了当前方法不适合测量对象移除算法的性能。此外,我们提出了针对对象移除算法的性能评估的新方法。这些方法通过类级对象移除结果进行评估,并利用没有目标类物体作为比较集的图像。我们证实,所提出的方法可以在COCO数据集中让评估者做出一致的判断,并且可以产生与利用对象移除真实值数据集中的测量结果相一致的测量值。
https://arxiv.org/abs/2404.11104
The task of video inpainting detection is to expose the pixel-level inpainted regions within a video sequence. Existing methods usually focus on leveraging spatial and temporal inconsistencies. However, these methods typically employ fixed operations to combine spatial and temporal clues, limiting their applicability in different scenarios. In this paper, we introduce a novel Multilateral Temporal-view Pyramid Transformer ({\em MumPy}) that collaborates spatial-temporal clues flexibly. Our method utilizes a newly designed multilateral temporal-view encoder to extract various collaborations of spatial-temporal clues and introduces a deformable window-based temporal-view interaction module to enhance the diversity of these collaborations. Subsequently, we develop a multi-pyramid decoder to aggregate the various types of features and generate detection maps. By adjusting the contribution strength of spatial and temporal clues, our method can effectively identify inpainted regions. We validate our method on existing datasets and also introduce a new challenging and large-scale Video Inpainting dataset based on the YouTube-VOS dataset, which employs several more recent inpainting methods. The results demonstrate the superiority of our method in both in-domain and cross-domain evaluation scenarios.
视频修复检测的任务是揭示视频中每个视频帧的像素级修复区域。现有的方法通常利用空间和时间不一致性来结合空间和时间提示。然而,这些方法通常采用固定的操作来结合空间和时间提示,限制了它们在不同场景中的应用。在本文中,我们引入了一种名为Multilateral Temporal-view Pyramid Transformer(MUMPy)的新颖方法,它灵活地合作空间和时间提示。我们的方法利用一个新的多边形时间视图编码器来提取各种空间-时间提示的合作,并引入了一个可变的窗口基于时间视图的交互模块,以增强这些合作的变化。接下来,我们开发了一个多层金字塔解码器来聚合各种特征并生成检测图。通过调整空间和时间提示的贡献强度,我们的方法可以有效地检测修复区域。我们在现有数据集上评估了我们的方法,并还基于YouTube-VOS数据集引入了一个新的具有挑战性和大规模的视频修复检测数据集,该数据集采用了一些更先进的修复方法。结果表明,在我们的方法和跨域评估场景中,我们的方法具有优越性。
https://arxiv.org/abs/2404.11054
Neural reconstruction approaches are rapidly emerging as the preferred representation for 3D scenes, but their limited editability is still posing a challenge. In this work, we propose an approach for 3D scene inpainting -- the task of coherently replacing parts of the reconstructed scene with desired content. Scene inpainting is an inherently ill-posed task as there exist many solutions that plausibly replace the missing content. A good inpainting method should therefore not only enable high-quality synthesis but also a high degree of control. Based on this observation, we focus on enabling explicit control over the inpainted content and leverage a reference image as an efficient means to achieve this goal. Specifically, we introduce RefFusion, a novel 3D inpainting method based on a multi-scale personalization of an image inpainting diffusion model to the given reference view. The personalization effectively adapts the prior distribution to the target scene, resulting in a lower variance of score distillation objective and hence significantly sharper details. Our framework achieves state-of-the-art results for object removal while maintaining high controllability. We further demonstrate the generality of our formulation on other downstream tasks such as object insertion, scene outpainting, and sparse view reconstruction.
神经重建方法正迅速成为3D场景的首选表示方法,但它们的可编辑性仍然存在挑战。在这项工作中,我们提出了一个3D场景修复方法--在修复场景中部分内容替换所需内容。场景修复是一个本质上有问题的任务,因为存在许多合理地替换缺失内容的解决方案。因此,一个好的修复方法应该不仅能够实现高质量合成,还应该具有高程度的控制力。基于这一观察,我们专注于实现对修复内容的显式控制,并利用参考图像作为实现这一目标的有效手段。具体来说,我们引入了RefFusion,一种基于图像修复扩散模型多尺度自适应的3D修复方法。自适应有效地将先验分布适应目标场景,导致评分蒸馏目标变低,因此显著地更清晰地突出细节。我们的框架在物体去除的同时保持高可控制性。我们还进一步证明了我们在其他下游任务上的通用性,例如物体插入、场景修复和稀疏视图重建。
https://arxiv.org/abs/2404.10765