Paper Reading AI Learner

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

2024-04-28 15:07:53
Navve Wasserman, Noam Rotstein, Roy Ganz, Ron Kimmel

Abstract

Image editing has advanced significantly with the introduction of text-conditioned diffusion models. Despite this progress, seamlessly adding objects to images based on textual instructions without requiring user-provided input masks remains a challenge. We address this by leveraging the insight that removing objects (Inpaint) is significantly simpler than its inverse process of adding them (Paint), attributed to the utilization of segmentation mask datasets alongside inpainting models that inpaint within these masks. Capitalizing on this realization, by implementing an automated and extensive pipeline, we curate a filtered large-scale image dataset containing pairs of images and their corresponding object-removed versions. Using these pairs, we train a diffusion model to inverse the inpainting process, effectively adding objects into images. Unlike other editing datasets, ours features natural target images instead of synthetic ones; moreover, it maintains consistency between source and target by construction. Additionally, we utilize a large Vision-Language Model to provide detailed descriptions of the removed objects and a Large Language Model to convert these descriptions into diverse, natural-language instructions. We show that the trained model surpasses existing ones both qualitatively and quantitatively, and release the large-scale dataset alongside the trained models for the community.

Abstract (translated)

图像编辑已经取得了显著的进步,得益于引入了基于文本指令的扩散模型,现在可以轻松地将对象添加到图像中,而无需用户提供输入掩码。尽管如此,在不需要用户提供输入掩码的情况下,平滑地将文本指令添加到图像中仍然具有挑战性。为了应对这个挑战,我们通过利用消除对象(Inpaint)远比添加它们(Paint)的逆过程简单的见解,利用结合了分割掩码数据集的inpainting模型,为inpainting模型提供更多的信息。抓住这个启示,通过实现自动且广泛的流程,我们策划了一个包含对配对图像及其相应去对象版本的过滤大规模图像数据集。利用这些配对,我们训练了一个扩散模型来反向扩散,有效地将对象添加到图像中。 与其他编辑数据集不同,我们的数据集包含自然目标图像,而不是合成图像。此外,它通过自定义构建了源图像和目标图像之间的 consistency。此外,我们还利用了一个大型的视觉语言模型来提供对移除对象的详细描述,以及一个大型语言模型将描述转换为各种自然语言指令。我们证明了训练后的模型在质量和数量上超过了现有的模型,并将大型数据集与训练好的模型一起发布,供社区使用。

URL

https://arxiv.org/abs/2404.18212

PDF

https://arxiv.org/pdf/2404.18212.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot