Paper Reading AI Learner

Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion

2024-04-23 06:10:43
Hongyu Chen, Yiqi Gao, Min Zhou, Peng Wang, Xubin Li, Tiezheng Ge, Bo Zheng

Abstract

Recently, integrating visual controls into text-to-image~(T2I) models, such as ControlNet method, has received significant attention for finer control capabilities. While various training-free methods make efforts to enhance prompt following in T2I models, the issue with visual control is still rarely studied, especially in the scenario that visual controls are misaligned with text prompts. In this paper, we address the challenge of ``Prompt Following With Visual Control" and propose a training-free approach named Mask-guided Prompt Following (MGPF). Object masks are introduced to distinct aligned and misaligned parts of visual controls and prompts. Meanwhile, a network, dubbed as Masked ControlNet, is designed to utilize these object masks for object generation in the misaligned visual control region. Further, to improve attribute matching, a simple yet efficient loss is designed to align the attention maps of attributes with object regions constrained by ControlNet and object masks. The efficacy and superiority of MGPF are validated through comprehensive quantitative and qualitative experiments.

Abstract (translated)

近年来,将视觉控制集成到文本到图像(T2I)模型中,如ControlNet方法,因更精确的控制能力而受到广泛关注。虽然各种训练免费的方法努力增强T2I模型的提示跟随,但视觉控制的一个问题仍然很少被研究,尤其是在视觉控制与文本提示对齐的情况下。在本文中,我们解决了“与视觉控制一起跟踪提示”的挑战,并提出了一个无需训练的名为Mask-guided Prompt Following(MGPF)的训练免费方法。我们引入了物体掩码来对对齐和错位的视觉控制和提示的不同部分。同时,设计了一个名为Masked ControlNet的网络,用于在错位的视觉控制区域中利用这些物体掩码进行物体生成。此外,为了提高属性匹配,设计了一个简单而有效的损失,将属性的注意力图与受控网络和物体掩码约束的对象区域对齐。通过全面的定量实验和定性实验,验证了MGPF的有效性和优越性。

URL

https://arxiv.org/abs/2404.14768

PDF

https://arxiv.org/pdf/2404.14768.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot