Paper Reading AI Learner

FateZero: Fusing Attentions for Zero-shot Text-based Video Editing

2023-03-16 17:51:13
Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, Qifeng Chen

Abstract

The diffusion-based generative models have achieved remarkable success in text-based image generation. However, since it contains enormous randomness in generation progress, it is still challenging to apply such models for real-world visual content editing, especially in videos. In this paper, we propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask. To edit videos consistently, we propose several techniques based on the pre-trained models. Firstly, in contrast to the straightforward DDIM inversion technique, our approach captures intermediate attention maps during inversion, which effectively retain both structural and motion information. These maps are directly fused in the editing process rather than generated during denoising. To further minimize semantic leakage of the source video, we then fuse self-attentions with a blending mask obtained by cross-attention features from the source prompt. Furthermore, we have implemented a reform of the self-attention mechanism in denoising UNet by introducing spatial-temporal attention to ensure frame consistency. Yet succinct, our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model. We also have a better zero-shot shape-aware editing ability based on the text-to-video model. Extensive experiments demonstrate our superior temporal consistency and editing capability than previous works.

Abstract (translated)

扩散based生成模型在基于文本的图像生成方面取得了惊人的成功。然而,由于在生成过程中包含了巨大的随机性,将这些模型应用于实际的视觉内容编辑仍然具有挑战性,特别是在视频编辑方面。在本文中,我们提出了FateZero,一种无提示训练或特定掩码的视频文本编辑方法,以 consistently 修改视频。为了持续地编辑视频,我们提出了基于预训练模型的一些技术。首先,与直接DDIM翻转技术相反,我们的算法在翻转过程中捕获了中间注意力地图,有效地保留了结构信息和运动信息。这些地图直接在编辑过程中融合而不是在去噪过程中生成。进一步减少源视频的语义泄漏,我们然后将自注意力与从源提示中提取的交叉注意力特征的融合掩码融合。此外,我们还实现了在去噪UNet中引入空间注意力来保证帧一致性的自注意力机制改革。然而简洁明了,我们的算法是首先展示从训练的文本到图像模型的零次文本驱动视频风格和局部属性编辑能力的方法。我们还基于文本到视频模型实现了更好的零次形状aware编辑能力。广泛的实验表明,我们的 temporal 一致性和编辑能力比先前的工作更好。

URL

https://arxiv.org/abs/2303.09535

PDF

https://arxiv.org/pdf/2303.09535.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot