Paper Reading AI Learner

Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model

2024-04-25 03:14:49
Gehui Chen, Guan'an Wang, Xiaowen Huang, Jitao Sang

Abstract

Existing works have made strides in video generation, but the lack of sound effects (SFX) and background music (BGM) hinders a complete and immersive viewer experience. We introduce a novel semantically consistent v ideo-to-audio generation framework, namely SVA, which automatically generates audio semantically consistent with the given video content. The framework harnesses the power of multimodal large language model (MLLM) to understand video semantics from a key frame and generate creative audio schemes, which are then utilized as prompts for text-to-audio models, resulting in video-to-audio generation with natural language as an interface. We show the satisfactory performance of SVA through case study and discuss the limitations along with the future research direction. The project page is available at this https URL.

Abstract (translated)

现有的视频生成工作已经取得了一定的进展,但缺乏音效(SFX)和背景音乐(BGM)会阻碍完全沉浸的观众体验。我们介绍了一个新颖的语义一致的视频到音频生成框架,即SVA,它能够自动生成与给定视频内容语义一致的音频。该框架利用多模态大型语言模型(MLLM)的力量,从关键帧理解视频语义,并生成创意音频方案,这些方案作为文本到音频模型的提示,实现了自然语言作为界面的视频到音频生成。我们通过案例研究展示了SVA的满意性能,并讨论了与未来研究方向相关的局限性。项目页面可以通过这个链接访问:https://www.aclweb.org/anthology/N18-1196

URL

https://arxiv.org/abs/2404.16305

PDF

https://arxiv.org/pdf/2404.16305.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot