Abstract
Recent advancements in Text-to-Image (T2I) generative models have yielded impressive results in generating high-fidelity images based on consistent text prompts. However, there is a growing interest in exploring the potential of these models for more diverse reference-based image manipulation tasks that require spatial understanding and visual context. Previous approaches have achieved this by incorporating additional control modules or fine-tuning the generative models specifically for each task until convergence. In this paper, we propose a different perspective. We conjecture that current large-scale T2I generative models already possess the capability to perform these tasks but are not fully activated within the standard generation process. To unlock these capabilities, we introduce a unified Prompt-Guided In-Context inpainting (PGIC) framework, which leverages large-scale T2I models to re-formulate and solve reference-guided image manipulations. In the PGIC framework, the reference and masked target are stitched together as a new input for the generative models, enabling the filling of masked regions as producing final results. Furthermore, we demonstrate that the self-attention modules in T2I models are well-suited for establishing spatial correlations and efficiently addressing challenging reference-guided manipulations. These large T2I models can be effectively driven by task-specific prompts with minimal training cost or even with frozen backbones. We synthetically evaluate the effectiveness of the proposed PGIC framework across various tasks, including reference-guided image inpainting, faithful inpainting, outpainting, local super-resolution, and novel view synthesis. Our results show that PGIC achieves significantly better performance while requiring less computation compared to other fine-tuning based approaches.
Abstract (translated)
近年来,文本到图像生成模型的发展取得了重大进展,基于一致的文本提示生成高保真的图像,取得了令人印象深刻的结果。然而,越来越有兴致探索这些模型的潜力,用于需要空间理解和视觉上下文更多样化的参考图像操纵任务。先前的方法曾通过添加额外的控制模块或针对每个任务专门优化生成模型来达到这一点,但本文提出了不同的视角。我们猜测,当前大规模的T2I生成模型已经具备执行这些任务的能力,但在标准生成过程中并未完全激活。为了解锁这些能力,我们介绍了一个统一的启示引导上下文填充框架(PGIC),该框架利用大规模的T2I模型重新构建和解决参考引导的图像操纵任务。在PGIC框架中,参考和掩膜目标被拼接在一起,作为生成模型的新输入,使掩膜区域能够作为最终结果填充。此外,我们证明,T2I模型中的自注意力模块非常适合建立空间关系,高效处理挑战性的参考引导操纵任务。这些大型T2I模型可以通过少量的培训成本或甚至冻结主干来有效地驱动。我们合成评估了所提出的PGIC框架在不同任务中的 effectiveness,包括参考引导图像填补、忠实填补、去除填补、局部超分辨率和新视野合成。我们的结果显示,PGIC在与其他优化方法相比计算量更少的情况下取得了更好的表现。
URL
https://arxiv.org/abs/2305.11577