Abstract
Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless, direct application of text-to-video generation models often struggles to retain the visual identity of clipart images or generate cartoon-style motions, resulting in unsatisfactory animation outcomes. In this paper, we introduce AniClipart, a system that transforms static clipart images into high-quality motion sequences guided by text-to-video priors. To generate cartoon-style and smooth motion, we first define Bézier curves over keypoints of the clipart image as a form of motion regularization. We then align the motion trajectories of the keypoints with the provided text prompt by optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes adequate knowledge of natural motion within a pretrained text-to-video diffusion model. With a differentiable As-Rigid-As-Possible shape deformation algorithm, our method can be end-to-end optimized while maintaining deformation rigidity. Experimental results show that the proposed AniClipart consistently outperforms existing image-to-video generation models, in terms of text-video alignment, visual identity preservation, and motion consistency. Furthermore, we showcase the versatility of AniClipart by adapting it to generate a broader array of animation formats, such as layered animation, which allows topological changes.
Abstract (translated)
Clipart是一种预先制作好的图形艺术形式,为描绘视觉内容提供了方便和高效的途径。将静态 clipart 图像转换为动图序列的传统工作流程费力且耗时,需要进行许多复杂的步骤,如绑定、关键帧动画和中间帧处理。近年来在将文本到视频生成模型的研究中取得了很大的进展,有望解决这个问题。然而,直接应用文本到视频生成模型通常很难保留 clipart 图像的视觉身份或生成卡通风格的运动,导致不满意的动画效果。在本文中,我们介绍了 AniClipart 系统,该系统将静态 clipart 图像转换为高质量的动图序列,通过文本到视频先验指导。为了生成卡通风格和流畅的运动,我们首先将 clipart 图像的关键点定义为运动正则化形式。然后通过优化 Video Score Distillation Sampling(VSDS)损失,使关键点的运动轨迹与提供的文本提示对齐,该损失可以表示预训练文本到视频扩散模型中自然运动足够的知识。通过使用可导的 As-Rigid-As-Possible 形状变形算法,我们的方法可以在保持变形刚度的同时进行端到端的优化。实验结果表明,与现有的图像到视频生成模型相比,AniClipart 在文本到视频对齐、视觉身份保留和运动一致性方面 consistently 表现出色。此外,我们还展示了 AniClipart 的多样性,通过将其应用于生成更广泛的动画格式,如分层动画,实现了拓扑变化。
URL
https://arxiv.org/abs/2404.12347