Abstract
This article introduces Lester, a novel method to automatically synthetise retro-style 2D animations from videos. The method approaches the challenge mainly as an object segmentation and tracking problem. Video frames are processed with the Segment Anything Model (SAM) and the resulting masks are tracked through subsequent frames with DeAOT, a method of hierarchical propagation for semi-supervised video object segmentation. The geometry of the masks' contours is simplified with the Douglas-Peucker algorithm. Finally, facial traits, pixelation and a basic shadow effect can be optionally added. The results show that the method exhibits an excellent temporal consistency and can correctly process videos with different poses and appearances, dynamic shots, partial shots and diverse backgrounds. The proposed method provides a more simple and deterministic approach than diffusion models based video-to-video translation pipelines, which suffer from temporal consistency problems and do not cope well with pixelated and schematic outputs. The method is also much most practical than techniques based on 3D human pose estimation, which require custom handcrafted 3D models and are very limited with respect to the type of scenes they can process.
Abstract (translated)
这篇文章介绍了一种新的方法,称为Lester,可以从视频中自动合成怀旧风格的2D动画。该方法主要作为物体分割和跟踪问题。视频帧使用Segment Anything Model(SAM)处理,然后通过DeAOT,一种半监督视频物体分割方法,处理后续帧。使用Douglas-Peucker算法简化mask轮廓的几何形状。最后,还可以选择添加面部特征、像素化和基本阴影效果。结果表明,该方法具有出色的时间一致性,能够正确处理不同姿态和外观的视频,包括动态镜头、部分镜头和多样化的背景。与基于视频到视频翻译管道的扩散模型相比,所提出的方法更简单和确定性。这种方法比基于3D人体姿态估计的技术更实用,不需要手工制作3D模型,而且对于它们可以处理的场景类型也非常有限。
URL
https://arxiv.org/abs/2402.09883