Abstract
Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.
Abstract (translated)
最近在音频驱动的虚拟化身视频生成方面取得的进步显著提升了视听真实感。然而,现有的方法仅仅将指令条件化视为由声学或视觉线索驱动的基本级跟踪,并没有建模指令传达的沟通意图。这种局限性损害了它们的故事连贯性和角色表现力。为了解决这一问题,我们提出了Kling-Avatar,这是一种新的级联框架,它统一了多模式指令理解和逼真的肖像生成。 我们的方法采用了两阶段管道。在第一阶段中,我们设计了一种多模态大型语言模型(MLLM)导演,该导演根据各种指令信号生成蓝图视频,从而控制角色动作和情感等高层次语义。在第二阶段中,在蓝本关键帧的引导下,我们使用首尾帧策略并行生成多个子片段。这种从全局到局部的方法在保持细粒度细节的同时,忠实地编码了多模式指令背后的高层次意图。 我们的并行架构还能够快速而稳定地生成长时间视频,使其适用于数字人类直播和Vlogging等现实世界应用中。为了全面评估我们方法的效果,我们构建了一个基准测试集,包括375个精心挑选的样本,涵盖了各种各样的指令和挑战性的场景。广泛的实验表明,Kling-Avatar能够以高达1080p和48fps的速度生成生动、流畅、长时间视频,在唇同步准确度、情感及动态表现力、指令可控性、身份保持以及跨域泛化能力方面表现出色。 这些结果使Kling-Avatar成为语义支持的高保真音频驱动虚拟化身合成的新基准。
URL
https://arxiv.org/abs/2509.09595