Paper Reading AI Learner

OneActor: Consistent Character Generation via Cluster-Conditioned Guidance

2024-04-16 03:45:45
Jiahao Wang, Caixia Yan, Haonan Lin, Weizhan Zhang

Abstract

Text-to-image diffusion models benefit artists with high-quality image generation. Yet its stochastic nature prevent artists from creating consistent images of the same character. Existing methods try to tackle this challenge and generate consistent content in various ways. However, they either depend on external data or require expensive tuning of the diffusion model. For this issue, we argue that a lightweight but intricate guidance is enough to function. Aiming at this, we lead the way to formalize the objective of consistent generation, derive a clustering-based score function and propose a novel paradigm, OneActor. We design a cluster-conditioned model which incorporates posterior samples to guide the denoising trajectories towards the target cluster. To overcome the overfitting challenge shared by one-shot tuning pipelines, we devise auxiliary components to simultaneously augment the tuning and regulate the inference. This technique is later verified to significantly enhance the content diversity of generated images. Comprehensive experiments show that our method outperforms a variety of baselines with satisfactory character consistency, superior prompt conformity as well as high image quality. And our method is at least 4 times faster than tuning-based baselines. Furthermore, to our best knowledge, we first prove that the semantic space has the same interpolation property as the latent space dose. This property can serve as another promising tool for fine generation control.

Abstract (translated)

文本到图像扩散模型为高质量图像生成艺术家带来了好处。然而,其随机性阻止了艺术家创建相同角色的 consistent 图像。现有方法试图解决这个挑战,并以各种方式生成一致的内容。然而,它们要么依赖于外部数据,要么需要对扩散模型进行昂贵的调整。针对这个问题,我们认为轻量但复杂指导就足够了。为了实现这个目标,我们提出了一个名为 OneActor 的全新范式。我们设计了一个包含后验样本的聚类条件模型,该模型通过引导去噪轨迹朝向目标聚类来指导模糊化过程。为了克服一阶调整管道共享的过拟合挑战,我们设计了一些辅助组件来同时增强调整和控制推理过程。这种技术后来被证明可以显著增强生成图像的内容多样性。综合实验证明,我们的方法在具有满意的字符一致性、卓越的提示符合性以及高质量图像的基础上优于各种基线。而且,据我们所知,我们的方法至少是调整基线的 4 倍快。此外,据我们最好了解,我们首先证明语义空间与潜在空间具有相同的平滑特性。这种特性可以作为另一种改进生成控制的有前景的工具。

URL

https://arxiv.org/abs/2404.10267

PDF

https://arxiv.org/pdf/2404.10267.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot