Abstract
This paper presents STARCaster, an identity-aware spatio-temporal video diffusion model that addresses both speech-driven portrait animation and free-viewpoint talking portrait synthesis, given an identity embedding or reference image, within a unified framework. Existing 2D speech-to-video diffusion models depend heavily on reference guidance, leading to limited motion diversity. At the same time, 3D-aware animation typically relies on inversion through pre-trained tri-plane generators, which often leads to imperfect reconstructions and identity drift. We rethink reference- and geometry-based paradigms in two ways. First, we deviate from strict reference conditioning at pre-training by introducing softer identity constraints. Second, we address 3D awareness implicitly within the 2D video domain by leveraging the inherent multi-view nature of video data. STARCaster adopts a compositional approach progressing from ID-aware motion modeling, to audio-visual synchronization via lip reading-based supervision, and finally to novel view animation through temporal-to-spatial adaptation. To overcome the scarcity of 4D audio-visual data, we propose a decoupled learning approach in which view consistency and temporal coherence are trained independently. A self-forcing training scheme enables the model to learn from longer temporal contexts than those generated at inference, mitigating the overly static animations common in existing autoregressive approaches. Comprehensive evaluations demonstrate that STARCaster generalizes effectively across tasks and identities, consistently surpassing prior approaches in different benchmarks.
Abstract (translated)
本文介绍了STARCaster,这是一种身份感知的空间-时间视频扩散模型,它在一个统一的框架内解决了基于语音驱动的人脸动画和自由视角说话人脸合成的问题,只需要一个身份嵌入或参考图像。现有的二维语音到视频扩散模型严重依赖于参考指导,导致动作多样性有限。同时,三维意识动画通常依赖于通过预训练的三平面生成器进行逆向转换,这往往会导致不完美的重建和身份漂移。我们从两个方面重新思考了基于参考和几何的方法。首先,在预训练中偏离严格的参考条件引入更柔和的身份约束;其次,我们在二维视频域内隐式地解决了三维意识问题,利用视频数据固有的多视角特性。STARCaster采用了一种组合方法,从身份感知的动作建模开始,通过唇读监督实现音视频同步,最后通过时间到空间的适应生成新视图动画。为了克服四维视听数据稀缺的问题,我们提出了一种解耦学习方法,在这种方法中,视图一致性和时间连贯性分别独立训练。一种自我强制训练方案使模型能够从比推理时产生的更长的时间上下文中进行学习,缓解了现有自回归方法中常见的过度静态动画问题。全面的评估表明,STARCaster在不同任务和身份上有效泛化,并且在不同的基准测试中始终超越先前的方法。
URL
https://arxiv.org/abs/2512.13247