Paper Reading AI Learner

Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation

2025-01-09 18:13:57
Darius Petermann, Mahdi M. Kalayeh

Abstract

Training audio-to-image generative models requires an abundance of diverse audio-visual pairs that are semantically aligned. Such data is almost always curated from in-the-wild videos, given the cross-modal semantic correspondence that is inherent to them. In this work, we hypothesize that insisting on the absolute need for ground truth audio-visual correspondence, is not only unnecessary, but also leads to severe restrictions in scale, quality, and diversity of the data, ultimately impairing its use in the modern generative models. That is, we propose a scalable image sonification framework where instances from a variety of high-quality yet disjoint uni-modal origins can be artificially paired through a retrieval process that is empowered by reasoning capabilities of modern vision-language models. To demonstrate the efficacy of this approach, we use our sonified images to train an audio-to-image generative model that performs competitively against state-of-the-art. Finally, through a series of ablation studies, we exhibit several intriguing auditory capabilities like semantic mixing and interpolation, loudness calibration and acoustic space modeling through reverberation that our model has implicitly developed to guide the image generation process.

Abstract (translated)

训练音频到图像生成模型需要大量的语义对齐的视听成对数据。这种数据通常是从野生视频中策划出来的,因为这些视频具有跨模态的语义对应关系。在这项工作中,我们假设坚持绝对需要地面实况的音视对应是不必要的,并且还会导致数据在规模、质量和多样性方面受到严重限制,最终影响其在现代生成模型中的应用效果。具体来说,我们提出了一种可扩展的图像声化框架,在该框架中,通过利用现代视觉语言模型的推理能力,可以将来自各种高质量但互不相关的单模态来源的数据人工配对。 为了展示这种方法的有效性,我们使用我们的声化图像来训练一个音频到图像的生成模型,并且这个模型在与现有最先进的技术对比时表现出色。最后,通过一系列消融研究,展示了我们的模型隐含地开发出了一些有趣的听觉能力,比如语义混合、插值、响度校准以及通过混响建模声学空间等,这些都指导了图像生成的过程。

URL

https://arxiv.org/abs/2501.05413

PDF

https://arxiv.org/pdf/2501.05413.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot