Abstract
Audio-driven cospeech video generation typically involves two stages: speech-to-gesture and gesture-to-video. While significant advances have been made in speech-to-gesture generation, synthesizing natural expressions and gestures remains challenging in gesture-to-video systems. In order to improve the generation effect, previous works adopted complex input and training strategies and required a large amount of data sets for pre-training, which brought inconvenience to practical applications. We propose a simple one-stage training method and a temporal inference method based on a diffusion model to synthesize realistic and continuous gesture videos without the need for additional training of temporal this http URL entire model makes use of existing pre-trained weights, and only a few thousand frames of data are needed for each character at a time to complete fine-tuning. Built upon the video generator, we introduce a new audio-to-video pipeline to synthesize co-speech videos, using 2D human skeleton as the intermediate motion representation. Our experiments show that our method outperforms existing GAN-based and diffusion-based methods.
Abstract (translated)
音频驱动的同步视频生成通常涉及两个阶段:语音到姿态(speech-to-gesture)和姿态到视频(gesture-to-video)。虽然在语音到姿态生成方面取得了显著进展,但在姿态到视频系统中合成自然表情和手势仍然具有挑战性。为了改进生成效果,以往的研究采用了复杂的输入和训练策略,并且需要大量的数据集进行预训练,这给实际应用带来了不便。 我们提出了一种简单的单阶段训练方法以及一种基于扩散模型的时序推理方法,用于在不需额外训练的情况下合成逼真且连续的手势视频。整个模型利用现有的预训练权重,仅需少量几千帧的数据即可完成每个角色的微调。在此基础上,我们在视频生成器之上引入了一种新的音频到视频管道,使用2D人体骨架作为中间动作表示来合成同步视频。 我们的实验表明,该方法在性能上优于现有的基于GAN和扩散模型的方法。
URL
https://arxiv.org/abs/2504.08344