Paper Reading AI Learner

ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification Fashion

2024-04-26 08:02:07
Ziyue Zhang, Mingbao Lin, Rongrong Ji

Abstract

We introduce ObjectAdd, a training-free diffusion modification method to add user-expected objects into user-specified area. The motive of ObjectAdd stems from: first, describing everything in one prompt can be difficult, and second, users often need to add objects into the generated image. To accommodate with real world, our ObjectAdd maintains accurate image consistency after adding objects with technical innovations in: (1) embedding-level concatenation to ensure correct text embedding coalesce; (2) object-driven layout control with latent and attention injection to ensure objects accessing user-specified area; (3) prompted image inpainting in an attention refocusing & object expansion fashion to ensure rest of the image stays the same. With a text-prompted image, our ObjectAdd allows users to specify a box and an object, and achieves: (1) adding object inside the box area; (2) exact content outside the box area; (3) flawless fusion between the two areas

Abstract (translated)

我们提出了ObjectAdd,一种无需训练的扩散修改方法,可以将用户期望的对象添加到用户指定的区域中。ObjectAdd的动机源于:首先,在仅有一个提示的情况下描述一切可能很难;其次,用户通常需要将对象添加到生成的图像中。为了适应现实世界,我们的ObjectAdd在添加对象时保持了准确的图像一致性:通过(1)在嵌入层级连接中进行连接以确保正确文本嵌入聚类;(2)使用潜在和注意注入的对象驱动布局控制确保访问用户指定区域的物体;(3)在关注重新聚焦和物体扩展的方式中进行提示图像修复,确保其余部分与初始图像相同。有了文本提示的图像,我们的ObjectAdd允许用户指定一个框和一个物体,并实现了: (1)在框内添加物体;(2)超出框外的精确内容;(3)两个区域的无缝融合

URL

https://arxiv.org/abs/2404.17230

PDF

https://arxiv.org/pdf/2404.17230.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot