Abstract
Existing works have made strides in video generation, but the lack of sound effects (SFX) and background music (BGM) hinders a complete and immersive viewer experience. We introduce a novel semantically consistent v ideo-to-audio generation framework, namely SVA, which automatically generates audio semantically consistent with the given video content. The framework harnesses the power of multimodal large language model (MLLM) to understand video semantics from a key frame and generate creative audio schemes, which are then utilized as prompts for text-to-audio models, resulting in video-to-audio generation with natural language as an interface. We show the satisfactory performance of SVA through case study and discuss the limitations along with the future research direction. The project page is available at this https URL.
Abstract (translated)
现有的视频生成工作已经取得了一定的进展,但缺乏音效(SFX)和背景音乐(BGM)会阻碍完全沉浸的观众体验。我们介绍了一个新颖的语义一致的视频到音频生成框架,即SVA,它能够自动生成与给定视频内容语义一致的音频。该框架利用多模态大型语言模型(MLLM)的力量,从关键帧理解视频语义,并生成创意音频方案,这些方案作为文本到音频模型的提示,实现了自然语言作为界面的视频到音频生成。我们通过案例研究展示了SVA的满意性能,并讨论了与未来研究方向相关的局限性。项目页面可以通过这个链接访问:https://www.aclweb.org/anthology/N18-1196
URL
https://arxiv.org/abs/2404.16305