Abstract
Image editing has advanced significantly with the introduction of text-conditioned diffusion models. Despite this progress, seamlessly adding objects to images based on textual instructions without requiring user-provided input masks remains a challenge. We address this by leveraging the insight that removing objects (Inpaint) is significantly simpler than its inverse process of adding them (Paint), attributed to the utilization of segmentation mask datasets alongside inpainting models that inpaint within these masks. Capitalizing on this realization, by implementing an automated and extensive pipeline, we curate a filtered large-scale image dataset containing pairs of images and their corresponding object-removed versions. Using these pairs, we train a diffusion model to inverse the inpainting process, effectively adding objects into images. Unlike other editing datasets, ours features natural target images instead of synthetic ones; moreover, it maintains consistency between source and target by construction. Additionally, we utilize a large Vision-Language Model to provide detailed descriptions of the removed objects and a Large Language Model to convert these descriptions into diverse, natural-language instructions. We show that the trained model surpasses existing ones both qualitatively and quantitatively, and release the large-scale dataset alongside the trained models for the community.
Abstract (translated)
图像编辑已经取得了显著的进步,得益于引入了基于文本指令的扩散模型,现在可以轻松地将对象添加到图像中,而无需用户提供输入掩码。尽管如此,在不需要用户提供输入掩码的情况下,平滑地将文本指令添加到图像中仍然具有挑战性。为了应对这个挑战,我们通过利用消除对象(Inpaint)远比添加它们(Paint)的逆过程简单的见解,利用结合了分割掩码数据集的inpainting模型,为inpainting模型提供更多的信息。抓住这个启示,通过实现自动且广泛的流程,我们策划了一个包含对配对图像及其相应去对象版本的过滤大规模图像数据集。利用这些配对,我们训练了一个扩散模型来反向扩散,有效地将对象添加到图像中。 与其他编辑数据集不同,我们的数据集包含自然目标图像,而不是合成图像。此外,它通过自定义构建了源图像和目标图像之间的 consistency。此外,我们还利用了一个大型的视觉语言模型来提供对移除对象的详细描述,以及一个大型语言模型将描述转换为各种自然语言指令。我们证明了训练后的模型在质量和数量上超过了现有的模型,并将大型数据集与训练好的模型一起发布,供社区使用。
URL
https://arxiv.org/abs/2404.18212