Abstract
Text-to-audio (T2A) generation has achieved promising results with the recent advances in generative models. However, because of the limited quality and quantity of temporally-aligned audio-text pairs, existing T2A methods struggle to handle the complex text prompts that contain precise timing control, e.g., "owl hooted at 2.4s-5.2s". Recent works have explored data augmentation techniques or introduced timing conditions as model inputs to enable timing-conditioned 10-second T2A generation, while their synthesis quality is still limited. In this work, we propose a novel training-free timing-controlled T2A framework, FreeAudio, making the first attempt to enable timing-controlled long-form T2A generation, e.g., "owl hooted at 2.4s-5.2s and crickets chirping at 0s-24s". Specifically, we first employ an LLM to plan non-overlapping time windows and recaption each with a refined natural language description, based on the input text and timing prompts. Then we introduce: 1) Decoupling and Aggregating Attention Control for precise timing control; 2) Contextual Latent Composition for local smoothness and Reference Guidance for global consistency. Extensive experiments show that: 1) FreeAudio achieves state-of-the-art timing-conditioned T2A synthesis quality among training-free methods and is comparable to leading training-based methods; 2) FreeAudio demonstrates comparable long-form generation quality with training-based Stable Audio and paves the way for timing-controlled long-form T2A synthesis. Demo samples are available at: this https URL
Abstract (translated)
文本到音频(T2A)生成技术在生成模型的最新进展中取得了显著成果。然而,由于时间对齐的音视频数据质量和数量有限,现有的T2A方法难以处理包含精确时间控制的复杂文字提示,例如“2.4秒至5.2秒之间猫头鹰叫声”。近期的研究探讨了数据增强技术或引入时间条件作为模型输入以实现基于时间条件的10秒钟T2A生成,但其合成质量仍然有限。在这项工作中,我们提出了一种全新的无训练所需的时间控制型T2A框架——FreeAudio,首次实现了基于时间条件的长格式T2A生成,例如“在2.4秒至5.2秒之间猫头鹰叫声,在0秒至24秒期间蟋蟀鸣叫”。具体来说,我们首先使用大型语言模型(LLM)根据输入文本和时间提示来规划非重叠的时间窗口,并为每个窗口重新描述以生成更精细的自然语言描述。然后我们引入了:1)解耦与聚合注意力控制,用于精确的时间控制;2)上下文隐变量组合以及参考指导,以确保局部平滑性和全局一致性。大量实验表明: 1. 在无需训练的方法中,FreeAudio实现了最佳的时间条件T2A合成质量,并且其性能可媲美领先的基于训练的方法; 2. FreeAudio的长格式生成质量与基于训练的Stable Audio方法相当,为时间控制型长格式T2A合成铺平了道路。演示样本可在以下链接查看:[此处提供URL]
URL
https://arxiv.org/abs/2507.08557