Abstract
In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.
Abstract (translated)
在这篇论文中,我们介绍了JoVA,这是一个用于联合视频-音频生成的统一框架。尽管最近取得了一些令人鼓舞的进步,现有的方法仍然面临着两个关键限制。首先,大多数现有方法只能生成环境声音,并且缺乏产生与唇部动作同步的人类语音的能力。其次,近期尝试进行统一的人体视频-音频生成的方法通常依赖于显式的融合或特定模态的对齐模块,这会引入额外的架构设计并削弱原始变压器模型的简洁性。 为了解决这些问题,JoVA在每个变压器层内通过跨视频和音频标记的联合自注意力机制来直接进行有效的跨模式交互,从而无需使用额外的对齐模块。此外,为了实现高质量的唇部语音同步,我们引入了一个基于面部关键点检测的简单而有效的口区损失函数,该函数可以在不牺牲架构简洁性的情况下增强训练过程中对关键口区域的监督。 在基准测试上的广泛实验表明,JoVA在唇部同步精度、语音质量和整体视频-音频生成保真度方面优于或可与当前最先进的统一方法和音频驱动的方法相媲美。我们的研究结果确立了JoVA作为高质量多模态生成框架的地位。
URL
https://arxiv.org/abs/2512.13677