Paper Reading AI Learner

Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance

2024-03-28 16:07:55
Yulin Pan, Chaojie Mao, Zeyinzi Jiang, Zhen Han, Jingfeng Zhang

Abstract

Prior studies have made significant progress in image inpainting guided by either text or subject image. However, the research on editing with their combined guidance is still in the early stages. To tackle this challenge, we present LAR-Gen, a novel approach for image inpainting that enables seamless inpainting of masked scene images, incorporating both the textual prompts and specified subjects. Our approach adopts a coarse-to-fine manner to ensure subject identity preservation and local semantic coherence. The process involves (i) Locate: concatenating the noise with masked scene image to achieve precise regional editing, (ii) Assign: employing decoupled cross-attention mechanism to accommodate multi-modal guidance, and (iii) Refine: using a novel RefineNet to supplement subject details. Additionally, to address the issue of scarce training data, we introduce a novel data construction pipeline. This pipeline extracts substantial pairs of data consisting of local text prompts and corresponding visual instances from a vast image dataset, leveraging publicly available large models. Extensive experiments and varied application scenarios demonstrate the superiority of LAR-Gen in terms of both identity preservation and text semantic consistency. Project page can be found at \url{this https URL}.

Abstract (translated)

之前的研究在基于文本或主题图像的图像修复引导方面取得了显著进展。然而,使用其联合指导进行编辑的研究仍处于初步阶段。为应对这一挑战,我们提出了LAR-Gen,一种新型的图像修复方法,可以实现对掩膜场景图像的无缝修复,同时包含文本提示和指定的主题。我们的方法采用粗到细的方式,以确保主题身份保留和局部语义连贯。该过程包括:(i)定位:将噪声与掩膜场景图像拼接起来以实现精确的局部编辑,(ii)分配:采用解耦的跨注意机制来适应多模态指导,和(iii)优化:使用新颖的RefineNet来补充主题细节。此外,为了解决数据稀疏性问题,我们引入了一种新的数据构建管道。这个管道从庞大的图像数据集中提取大量对本地文本提示和相应视觉实例的对称数据,并利用公开可用的较大模型。大量的实验和应用场景证明了LAR-Gen在身份保留和文本语义一致性方面的卓越性。项目页面可以通过这个链接找到:https://this URL。

URL

https://arxiv.org/abs/2403.19534

PDF

https://arxiv.org/pdf/2403.19534.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot