Paper Reading AI Learner

ID-Animator: Zero-Shot Identity-Preserving Human Video Generation

2024-04-23 17:59:43
Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, Man Zhou, Jie Zhang

Abstract

Generating high fidelity human video with specified identities has attracted significant attention in the content generation community. However, existing techniques struggle to strike a balance between training efficiency and identity preservation, either requiring tedious case-by-case finetuning or usually missing the identity details in video generation process. In this study, we present ID-Animator, a zero-shot human-video generation approach that can perform personalized video generation given single reference facial image without further training. ID-Animator inherits existing diffusion-based video generation backbones with a face adapter to encode the ID-relevant embeddings from learnable facial latent queries. To facilitate the extraction of identity information in video generation, we introduce an ID-oriented dataset construction pipeline, which incorporates decoupled human attribute and action captioning technique from a constructed facial image pool. Based on this pipeline, a random face reference training method is further devised to precisely capture the ID-relevant embeddings from reference images, thus improving the fidelity and generalization capacity of our model for ID-specific video generation. Extensive experiments demonstrate the superiority of ID-Animator to generate personalized human videos over previous models. Moreover, our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models, showing high extendability in real-world applications for video generation where identity preservation is highly desired. Our codes and checkpoints will be released at this https URL.

Abstract (translated)

生成指定身份的高保真度人类视频在内容生成社区中引起了广泛关注。然而,现有的技术在训练效率和身份保留之间往往难以取得平衡,或者需要耗时的案例逐个微调,或者在视频生成过程中通常会丢失身份细节。在本文中,我们提出了ID-Animator,一种零散拍摄人类视频的方法,可以根据单个参考面部图像生成个性化的视频,无需进一步训练。ID-Animator继承了现有的扩散基视频生成骨架,带有面部适配器来编码与ID相关的特征嵌入。为了在视频生成过程中促进身份信息的提取,我们引入了从构建面部图像池中分离的人体属性和动作标题技术,ID导向的数据构建管道。基于该管道,我们还设计了一种随机的面部参考训练方法,精确捕捉参考图像中的ID相关嵌入,从而提高我们的模型在ID特定视频生成方面的保真度和泛化能力。大量实验证明,ID-Animator在生成个性化人类视频方面优于 previous 模型。此外,我们的方法与如animatediff 和各种社区骨架模型等热门预训练 T2V 模型高度兼容,在现实世界中,对于需要高度保留身份的视频生成,我们的方法具有很高的可扩展性。我们的代码和检查点将发布在https:// this URL。

URL

https://arxiv.org/abs/2404.15275

PDF

https://arxiv.org/pdf/2404.15275.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot