Abstract
Video editing has garnered increasing attention alongside the rapid progress of diffusion-based video generation models. As part of these advancements, there is a growing demand for more accessible and controllable forms of video editing, such as prompt-based editing. Previous studies have primarily focused on tasks such as style transfer, background replacement, object substitution, and attribute modification, while maintaining the content structure of the source video. However, more complex tasks, including the addition of novel objects and nonrigid transformations, remain relatively unexplored. In this paper, we present TV-LiVE, a Training-free and text-guided Video editing framework via Layerinformed Vitality Exploitation. We empirically identify vital layers within the video generation model that significantly influence the quality of generated outputs. Notably, these layers are closely associated with Rotary Position Embeddings (RoPE). Based on this observation, our method enables both object addition and non-rigid video editing by selectively injecting key and value features from the source model into the corresponding layers of the target model guided by the layer vitality. For object addition, we further identify prominent layers to extract the mask regions corresponding to the newly added target prompt. We found that the extracted masks from the prominent layers faithfully indicate the region to be edited. Experimental results demonstrate that TV-LiVE outperforms existing approaches for both object addition and non-rigid video editing. Project Page: this https URL
Abstract (translated)
视频编辑随着基于扩散模型的视频生成技术的进步而受到了越来越多的关注。随着这些进步,人们对于更易访问和控制的视频编辑形式的需求也在增长,例如基于提示词的编辑。此前的研究主要集中在样式转换、背景替换、对象置换以及属性修改等任务上,同时保持源视频的内容结构不变。然而,包括添加新对象和非刚性变换在内的更为复杂的任务仍然鲜有探索。 在本文中,我们提出了TV-LiVE(Training-free and text-guided Video editing via Layer-informed Vitality Exploitation),这是一种无需训练且受文本指导的视频编辑框架。通过实验分析,我们识别出了视频生成模型中的关键层,这些层对生成输出的质量具有显著影响。值得注意的是,这些关键层与旋转位置嵌入(RoPE)紧密相关。基于这一观察结果,我们的方法可以通过利用图层活力来引导目标模型相应层级的关键和值特征的注入,从而实现对象添加和非刚性视频编辑。对于对象添加任务,我们进一步识别了重要的图层以提取对应于新添加提示词的目标掩码区域。研究发现从重要图层中提取出的掩码能够准确指示需要进行编辑的区域。 实验结果表明,在对象添加和非刚性视频编辑方面,TV-LiVE的表现优于现有方法。 项目主页:[此链接](https://this-url.com)(请将URL替换为实际项目页面地址)
URL
https://arxiv.org/abs/2506.07205