Abstract
Video diffusion models, trained on large-scale datasets, naturally capture correspondences of shared features across frames. Recent works have exploited this property for tasks such as optical flow prediction and tracking in a zero-shot setting. Motivated by these findings, we investigate whether supervised training can more fully harness the tracking capability of video diffusion models. To this end, we propose Moaw, a framework that unleashes motion awareness for video diffusion models and leverages it to facilitate motion transfer. Specifically, we train a diffusion model for motion perception, shifting its modality from image-to-video generation to video-to-dense-tracking. We then construct a motion-labeled dataset to identify features that encode the strongest motion information, and inject them into a structurally identical video generation model. Owing to the homogeneity between the two networks, these features can be naturally adapted in a zero-shot manner, enabling motion transfer without additional adapters. Our work provides a new paradigm for bridging generative modeling and motion understanding, paving the way for more unified and controllable video learning frameworks.
Abstract (translated)
视频扩散模型在大规模数据集上训练后,能够自然地捕捉帧间共享特征的对应关系。最近的一些研究利用这一特性来执行零样本设置下的光流预测和跟踪任务。受这些发现启发,我们探讨了监督学习是否可以更充分地发挥视频扩散模型的跟踪能力。为此,我们提出了Moaw框架,该框架使视频扩散模型具备运动感知,并借此促进运动迁移。具体来说,我们训练了一个用于运动感知的扩散模型,将其模态从图像到视频生成转换为视频到稠密跟踪。然后构建一个带有运动标签的数据集来识别编码最强运动信息的特征,并将这些特征注入到结构相同但用于视频生成的模型中。由于两个网络之间的同质性,在零样本设置下可以自然地适应这些特征,从而无需额外适配器即可实现运动迁移。我们的工作为生成式建模和运动理解之间架起了桥梁,为更统一、可控的视频学习框架铺平了道路。
URL
https://arxiv.org/abs/2601.12761