Abstract
Generating high fidelity human video with specified identities has attracted significant attention in the content generation community. However, existing techniques struggle to strike a balance between training efficiency and identity preservation, either requiring tedious case-by-case finetuning or usually missing the identity details in video generation process. In this study, we present ID-Animator, a zero-shot human-video generation approach that can perform personalized video generation given single reference facial image without further training. ID-Animator inherits existing diffusion-based video generation backbones with a face adapter to encode the ID-relevant embeddings from learnable facial latent queries. To facilitate the extraction of identity information in video generation, we introduce an ID-oriented dataset construction pipeline, which incorporates decoupled human attribute and action captioning technique from a constructed facial image pool. Based on this pipeline, a random face reference training method is further devised to precisely capture the ID-relevant embeddings from reference images, thus improving the fidelity and generalization capacity of our model for ID-specific video generation. Extensive experiments demonstrate the superiority of ID-Animator to generate personalized human videos over previous models. Moreover, our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models, showing high extendability in real-world applications for video generation where identity preservation is highly desired. Our codes and checkpoints will be released at this https URL.
Abstract (translated)
生成指定身份的高保真度人类视频在内容生成社区中引起了广泛关注。然而,现有的技术在训练效率和身份保留之间往往难以取得平衡,或者需要耗时的案例逐个微调,或者在视频生成过程中通常会丢失身份细节。在本文中,我们提出了ID-Animator,一种零散拍摄人类视频的方法,可以根据单个参考面部图像生成个性化的视频,无需进一步训练。ID-Animator继承了现有的扩散基视频生成骨架,带有面部适配器来编码与ID相关的特征嵌入。为了在视频生成过程中促进身份信息的提取,我们引入了从构建面部图像池中分离的人体属性和动作标题技术,ID导向的数据构建管道。基于该管道,我们还设计了一种随机的面部参考训练方法,精确捕捉参考图像中的ID相关嵌入,从而提高我们的模型在ID特定视频生成方面的保真度和泛化能力。大量实验证明,ID-Animator在生成个性化人类视频方面优于 previous 模型。此外,我们的方法与如animatediff 和各种社区骨架模型等热门预训练 T2V 模型高度兼容,在现实世界中,对于需要高度保留身份的视频生成,我们的方法具有很高的可扩展性。我们的代码和检查点将发布在https:// this URL。
URL
https://arxiv.org/abs/2404.15275