Abstract
Prior studies have made significant progress in image inpainting guided by either text or subject image. However, the research on editing with their combined guidance is still in the early stages. To tackle this challenge, we present LAR-Gen, a novel approach for image inpainting that enables seamless inpainting of masked scene images, incorporating both the textual prompts and specified subjects. Our approach adopts a coarse-to-fine manner to ensure subject identity preservation and local semantic coherence. The process involves (i) Locate: concatenating the noise with masked scene image to achieve precise regional editing, (ii) Assign: employing decoupled cross-attention mechanism to accommodate multi-modal guidance, and (iii) Refine: using a novel RefineNet to supplement subject details. Additionally, to address the issue of scarce training data, we introduce a novel data construction pipeline. This pipeline extracts substantial pairs of data consisting of local text prompts and corresponding visual instances from a vast image dataset, leveraging publicly available large models. Extensive experiments and varied application scenarios demonstrate the superiority of LAR-Gen in terms of both identity preservation and text semantic consistency. Project page can be found at \url{this https URL}.
Abstract (translated)
之前的研究在基于文本或主题图像的图像修复引导方面取得了显著进展。然而,使用其联合指导进行编辑的研究仍处于初步阶段。为应对这一挑战,我们提出了LAR-Gen,一种新型的图像修复方法,可以实现对掩膜场景图像的无缝修复,同时包含文本提示和指定的主题。我们的方法采用粗到细的方式,以确保主题身份保留和局部语义连贯。该过程包括:(i)定位:将噪声与掩膜场景图像拼接起来以实现精确的局部编辑,(ii)分配:采用解耦的跨注意机制来适应多模态指导,和(iii)优化:使用新颖的RefineNet来补充主题细节。此外,为了解决数据稀疏性问题,我们引入了一种新的数据构建管道。这个管道从庞大的图像数据集中提取大量对本地文本提示和相应视觉实例的对称数据,并利用公开可用的较大模型。大量的实验和应用场景证明了LAR-Gen在身份保留和文本语义一致性方面的卓越性。项目页面可以通过这个链接找到:https://this URL。
URL
https://arxiv.org/abs/2403.19534