Paper Reading AI Learner

MultiCOIN: Multi-Modal COntrollable Video INbetweening

2025-10-09 17:59:27
Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, Nanxuan Zhao

Abstract

Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce \modelname{}, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

Abstract (translated)

视频中间帧生成技术能够在两个图像帧之间创建平滑自然的过渡,因此成为视频编辑和长篇视频合成中不可或缺的工具。然而,现有方法在生成大型、复杂或精细的动作方面存在局限性,难以适应用户多样化的需求,并且通常缺乏对中间帧细节的精细控制,导致与创作者意图不一致。为了填补这些空白,我们介绍了**模型名称**(假设您指代的是某个具体的研究或技术成果,请在此处插入实际的名字),这是一个允许多模态控制的视频中间帧生成框架,包括深度转换和层叠、运动轨迹、文本提示以及移动定位的目标区域等,同时在灵活性、易用性和精度方面实现了细粒度视频插值的良好平衡。 为了实现这一目标,我们采用扩散变换器(DiT)架构作为我们的视频生成模型,因为其已被证明有能力生成高质量的长视频。为确保DiT与多模态控制之间的兼容性,我们将所有运动控制映射到一个通用且易于使用的基于点的稀疏表示形式,并将其用作视频/噪声输入。此外,为了尊重不同级别的粒度和影响力的各种控制类型,我们分别将内容控制和运动控制分离成两个分支以编码所需特征,并在去噪过程中引导生成器,最终形成一个用于运动、另一个用于内容的两组生成器。最后,我们提出了一种阶段式的训练策略,确保我们的模型能够平稳地学习多模态控制。 广泛的定性和定量实验表明,多模态控制使视频故事叙述更加动态化、定制化,并在上下文准确性方面取得了显著的进步。

URL

https://arxiv.org/abs/2510.08561

PDF

https://arxiv.org/pdf/2510.08561.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot