Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval

Abstract
Abstract (translated)
URL
PDF

Abstract

The aim of this research is to refine knowledge transfer on audio-image temporal agreement for audio-text cross retrieval. To address the limited availability of paired non-speech audio-text data, learning methods for transferring the knowledge acquired from a large amount of paired audio-image data to shared audio-text representation have been investigated, suggesting the importance of how audio-image co-occurrence is learned. Conventional approaches in audio-image learning assign a single image randomly selected from the corresponding video stream to the entire audio clip, assuming their co-occurrence. However, this method may not accurately capture the temporal agreement between the target audio and image because a single image can only represent a snapshot of a scene, though the target audio changes from moment to moment. To address this problem, we propose two methods for audio and image matching that effectively capture the temporal information: (i) Nearest Match wherein an image is selected from multiple time frames based on similarity with audio, and (ii) Multiframe Match wherein audio and image pairs of multiple time frames are used. Experimental results show that method (i) improves the audio-text retrieval performance by selecting the nearest image that aligns with the audio information and transferring the learned knowledge. Conversely, method (ii) improves the performance of audio-image retrieval while not showing significant improvements in audio-text retrieval performance. These results indicate that refining audio-image temporal agreement may contribute to better knowledge transfer to audio-text retrieval.

Abstract (translated)

本研究旨在改进音频-图像时间一致性对于音频-文本跨检索的目标。为了解决大规模非语音音频-图像数据对齐困难的问题，研究了将大量配对音频-图像数据获得的知識转移方法，以探究如何在共享音频-文本表示中学习知識。传统的音频-图像学习方法是将一个随机的视频流中的单个图像随机分配给整个音频剪辑，假设它们的共现。然而，这种方法可能无法准确捕捉目标音频和图像之间的时间一致性，因为单个图像只能代表场景的一个快照，尽管目标音频会随时刻变化。为了应对这个问题，我们提出了两种音频和图像匹配方法，它们有效地捕捉了时间信息：（i）最近匹配，即根据音频信息选择多个时间帧中的图像，并（ii）多帧匹配，即使用多个时间帧的音频和图像对。实验结果表明，方法（i）通过选择与音频信息最相似的图像来提高音频-文本检索性能，并将所學知識轉移。相反，方法（ii）在保持音频-图像检索性能的同时，没有在音频-文本检索性能上表现出显著的改善。这些结果表明，優化音频-图像时间一致性可能有助于更好地將知識傳遞到audio-text retrieval中。

URL

https://arxiv.org/abs/2403.10756

PDF

https://arxiv.org/pdf/2403.10756.pdf

Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval

Abstract

Abstract (translated)

URL

PDF Copy

PDF