Paper Reading AI Learner

LumosFlow: Motion-Guided Long Video Generation

2025-06-03 06:25:00
Jiahao Chen, Hangjie Yuan, Yichen Qian, Jingyun Liang, Jiazheng Xing, Pengwei Liu, Weihua Chen, Fan Wang, Bing Su

Abstract

Long video generation has gained increasing attention due to its widespread applications in fields such as entertainment and simulation. Despite advances, synthesizing temporally coherent and visually compelling long sequences remains a formidable challenge. Conventional approaches often synthesize long videos by sequentially generating and concatenating short clips, or generating key frames and then interpolate the intermediate frames in a hierarchical manner. However, both of them still remain significant challenges, leading to issues such as temporal repetition or unnatural transitions. In this paper, we revisit the hierarchical long video generation pipeline and introduce LumosFlow, a framework introduce motion guidance explicitly. Specifically, we first employ the Large Motion Text-to-Video Diffusion Model (LMTV-DM) to generate key frames with larger motion intervals, thereby ensuring content diversity in the generated long videos. Given the complexity of interpolating contextual transitions between key frames, we further decompose the intermediate frame interpolation into motion generation and post-hoc refinement. For each pair of key frames, the Latent Optical Flow Diffusion Model (LOF-DM) synthesizes complex and large-motion optical flows, while MotionControlNet subsequently refines the warped results to enhance quality and guide intermediate frame generation. Compared with traditional video frame interpolation, we achieve 15x interpolation, ensuring reasonable and continuous motion between adjacent frames. Experiments show that our method can generate long videos with consistent motion and appearance. Code and models will be made publicly available upon acceptance. Our project page: this https URL

Abstract (translated)

长时间视频生成因其在娱乐和仿真等领域的广泛应用而受到了越来越多的关注。尽管已经取得了进展,但合成时间上连贯且视觉吸引人的长序列仍然是一个艰巨的挑战。传统的方法通常通过依次生成并连接短片段来构建长视频,或者通过分层次的方式生成关键帧再插值中间帧。然而这两种方法仍然存在显著的问题,导致诸如时间上的重复或过渡不自然等问题。 在本文中,我们重新审视了分层长时间视频生成流水线,并引入了LumosFlow框架,该框架明确地引入运动引导机制。具体而言,我们首先利用大型动作文本到视频扩散模型(Large Motion Text-to-Video Diffusion Model, LMTV-DM)来生成具有较大动作间隔的关键帧,从而确保生成的长视频内容多样性。鉴于在关键帧之间进行上下文过渡插值的复杂性,我们将中间帧插值进一步分解为运动生成和事后精炼两个阶段。对于每一对关键帧,Latent Optical Flow Diffusion Model(LOF-DM)会合成复杂的、大动作位移流,而随后通过MotionControlNet对形变结果进行细化以提升质量,并指导中间帧的生成。 与传统的视频帧插值相比,我们实现了15倍的插值,确保了相邻帧之间的合理且连贯的动作。实验表明,我们的方法可以生成具有一致运动和外观的长视频。代码和模型将在被接受后公开发布。项目页面:[请在此处插入实际链接]。 请注意,原文末尾提到的实际链接需在正式发布时添加具体的URL地址以便访问相关资源。

URL

https://arxiv.org/abs/2506.02497

PDF

https://arxiv.org/pdf/2506.02497.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot