Abstract
Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks. However, current CLAP struggles to capture temporal information within audio and text features, presenting substantial limitations for tasks such as audio retrieval and generation. To address this gap, we introduce T-CLAP, a temporal-enhanced CLAP model. We use Large Language Models~(LLMs) and mixed-up strategies to generate temporal-contrastive captions for audio clips from extensive audio-text datasets. Subsequently, a new temporal-focused contrastive loss is designed to fine-tune the CLAP model by incorporating these synthetic data. We conduct comprehensive experiments and analysis in multiple downstream tasks. T-CLAP shows improved capability in capturing the temporal relationship of sound events and outperforms state-of-the-art models by a significant margin.
Abstract (translated)
对比语言-音频预训练(CLAP)的目标是使音频和语言的表示达成一致,在检索和分类任务中取得显著的性能。然而,当前的CLAP在捕捉音频和文本特征中的时间信息方面遇到了很大的挑战,对音频检索和生成等任务造成了很大的限制。为了填补这一空白,我们引入了T-CLAP,一种时间增强的CLAP模型。我们使用大型语言模型(LLMs)和混合策略生成广泛的音频-文本数据集中的音频片段的时空对比性字幕。然后,为通过包含这些合成数据来微调CLAP模型,设计了一个新的时间关注度的对比损失。我们在多个下游任务上进行全面的实验和分析。T-CLAP在捕捉声音事件的时间关系方面表现优异,与最先进的模型相比,性能优势显著。
URL
https://arxiv.org/abs/2404.17806