Abstract
Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.
Abstract (translated)
近年来,虚拟化身视频生成模型取得了显著进展。然而,现有工作在生成长时间高分辨率视频方面效率较低,随着视频长度的增加,会出现时间漂移、质量下降以及弱指令跟随等问题。为了解决这些挑战,我们提出了KlingAvatar 2.0,这是一个时空级联框架,能够在空间分辨率和时间维度上进行放大处理。该框架首先生成低分辨率的关键帧蓝图,捕捉全局语义和运动信息,然后通过首尾帧策略将其细化为高分辨率、具有时间连贯性的子片段,并确保长视频中平滑的时间过渡。 为了增强跨模态指令融合与对齐,在扩展视频中的表现,我们引入了一个协同推理导演(Co-Reasoning Director),由三个特定于模式的大语言模型(LLM)专家组成。这些专家可以评估不同模态的优先级并推断用户意图,通过多轮对话将输入转化为详细的故事线。此外,一个负向导演进一步优化负面提示,以提高指令对齐度。 基于这些组件,我们将框架扩展为支持ID特定的多角色控制。广泛的实验表明,我们的模型在高效、跨模态对齐的长时间高分辨率视频生成方面效果显著,提供了增强的视觉清晰度、逼真的唇齿渲染及准确的唇部同步,强大的身份保持以及连贯的跨模态指令跟随能力。
URL
https://arxiv.org/abs/2512.13313