Paper Reading AI Learner

AniClipart: Clipart Animation with Text-to-Video Priors

2024-04-18 17:24:28
Ronghuan Wu, Wanchao Su, Kede Ma, Jing Liao

Abstract

Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless, direct application of text-to-video generation models often struggles to retain the visual identity of clipart images or generate cartoon-style motions, resulting in unsatisfactory animation outcomes. In this paper, we introduce AniClipart, a system that transforms static clipart images into high-quality motion sequences guided by text-to-video priors. To generate cartoon-style and smooth motion, we first define Bézier curves over keypoints of the clipart image as a form of motion regularization. We then align the motion trajectories of the keypoints with the provided text prompt by optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes adequate knowledge of natural motion within a pretrained text-to-video diffusion model. With a differentiable As-Rigid-As-Possible shape deformation algorithm, our method can be end-to-end optimized while maintaining deformation rigidity. Experimental results show that the proposed AniClipart consistently outperforms existing image-to-video generation models, in terms of text-video alignment, visual identity preservation, and motion consistency. Furthermore, we showcase the versatility of AniClipart by adapting it to generate a broader array of animation formats, such as layered animation, which allows topological changes.

Abstract (translated)

Clipart是一种预先制作好的图形艺术形式,为描绘视觉内容提供了方便和高效的途径。将静态 clipart 图像转换为动图序列的传统工作流程费力且耗时,需要进行许多复杂的步骤,如绑定、关键帧动画和中间帧处理。近年来在将文本到视频生成模型的研究中取得了很大的进展,有望解决这个问题。然而,直接应用文本到视频生成模型通常很难保留 clipart 图像的视觉身份或生成卡通风格的运动,导致不满意的动画效果。在本文中,我们介绍了 AniClipart 系统,该系统将静态 clipart 图像转换为高质量的动图序列,通过文本到视频先验指导。为了生成卡通风格和流畅的运动,我们首先将 clipart 图像的关键点定义为运动正则化形式。然后通过优化 Video Score Distillation Sampling(VSDS)损失,使关键点的运动轨迹与提供的文本提示对齐,该损失可以表示预训练文本到视频扩散模型中自然运动足够的知识。通过使用可导的 As-Rigid-As-Possible 形状变形算法,我们的方法可以在保持变形刚度的同时进行端到端的优化。实验结果表明,与现有的图像到视频生成模型相比,AniClipart 在文本到视频对齐、视觉身份保留和运动一致性方面 consistently 表现出色。此外,我们还展示了 AniClipart 的多样性,通过将其应用于生成更广泛的动画格式,如分层动画,实现了拓扑变化。

URL

https://arxiv.org/abs/2404.12347

PDF

https://arxiv.org/pdf/2404.12347.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot