Paper Reading AI Learner

A Unified Prompt-Guided In-Context Inpainting Framework for Reference-based Image Manipulations

2023-05-19 10:29:42
Chenjie Cao, Qiaole Dong, Yikai Wang, Yunuo Cai, Yanwei Fu

Abstract

Recent advancements in Text-to-Image (T2I) generative models have yielded impressive results in generating high-fidelity images based on consistent text prompts. However, there is a growing interest in exploring the potential of these models for more diverse reference-based image manipulation tasks that require spatial understanding and visual context. Previous approaches have achieved this by incorporating additional control modules or fine-tuning the generative models specifically for each task until convergence. In this paper, we propose a different perspective. We conjecture that current large-scale T2I generative models already possess the capability to perform these tasks but are not fully activated within the standard generation process. To unlock these capabilities, we introduce a unified Prompt-Guided In-Context inpainting (PGIC) framework, which leverages large-scale T2I models to re-formulate and solve reference-guided image manipulations. In the PGIC framework, the reference and masked target are stitched together as a new input for the generative models, enabling the filling of masked regions as producing final results. Furthermore, we demonstrate that the self-attention modules in T2I models are well-suited for establishing spatial correlations and efficiently addressing challenging reference-guided manipulations. These large T2I models can be effectively driven by task-specific prompts with minimal training cost or even with frozen backbones. We synthetically evaluate the effectiveness of the proposed PGIC framework across various tasks, including reference-guided image inpainting, faithful inpainting, outpainting, local super-resolution, and novel view synthesis. Our results show that PGIC achieves significantly better performance while requiring less computation compared to other fine-tuning based approaches.

Abstract (translated)

近年来,文本到图像生成模型的发展取得了重大进展,基于一致的文本提示生成高保真的图像,取得了令人印象深刻的结果。然而,越来越有兴致探索这些模型的潜力,用于需要空间理解和视觉上下文更多样化的参考图像操纵任务。先前的方法曾通过添加额外的控制模块或针对每个任务专门优化生成模型来达到这一点,但本文提出了不同的视角。我们猜测,当前大规模的T2I生成模型已经具备执行这些任务的能力,但在标准生成过程中并未完全激活。为了解锁这些能力,我们介绍了一个统一的启示引导上下文填充框架(PGIC),该框架利用大规模的T2I模型重新构建和解决参考引导的图像操纵任务。在PGIC框架中,参考和掩膜目标被拼接在一起,作为生成模型的新输入,使掩膜区域能够作为最终结果填充。此外,我们证明,T2I模型中的自注意力模块非常适合建立空间关系,高效处理挑战性的参考引导操纵任务。这些大型T2I模型可以通过少量的培训成本或甚至冻结主干来有效地驱动。我们合成评估了所提出的PGIC框架在不同任务中的 effectiveness,包括参考引导图像填补、忠实填补、去除填补、局部超分辨率和新视野合成。我们的结果显示,PGIC在与其他优化方法相比计算量更少的情况下取得了更好的表现。

URL

https://arxiv.org/abs/2305.11577

PDF

https://arxiv.org/pdf/2305.11577.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot