Paper Reading AI Learner

JoVA: Unified Multimodal Learning for Joint Video-Audio Generation

2025-12-15 18:58:18
Xiaohu Huang, Hao Zhou, Qiangpeng Yang, Shilei Wen, Kai Han

Abstract

In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.

Abstract (translated)

在这篇论文中,我们介绍了JoVA,这是一个用于联合视频-音频生成的统一框架。尽管最近取得了一些令人鼓舞的进步,现有的方法仍然面临着两个关键限制。首先,大多数现有方法只能生成环境声音,并且缺乏产生与唇部动作同步的人类语音的能力。其次,近期尝试进行统一的人体视频-音频生成的方法通常依赖于显式的融合或特定模态的对齐模块,这会引入额外的架构设计并削弱原始变压器模型的简洁性。 为了解决这些问题,JoVA在每个变压器层内通过跨视频和音频标记的联合自注意力机制来直接进行有效的跨模式交互,从而无需使用额外的对齐模块。此外,为了实现高质量的唇部语音同步,我们引入了一个基于面部关键点检测的简单而有效的口区损失函数,该函数可以在不牺牲架构简洁性的情况下增强训练过程中对关键口区域的监督。 在基准测试上的广泛实验表明,JoVA在唇部同步精度、语音质量和整体视频-音频生成保真度方面优于或可与当前最先进的统一方法和音频驱动的方法相媲美。我们的研究结果确立了JoVA作为高质量多模态生成框架的地位。

URL

https://arxiv.org/abs/2512.13677

PDF

https://arxiv.org/pdf/2512.13677.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot