Paper Reading AI Learner

Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval

2024-03-16 01:38:36
Shunsuke Tsubaki, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Keisuke Imoto

Abstract

The aim of this research is to refine knowledge transfer on audio-image temporal agreement for audio-text cross retrieval. To address the limited availability of paired non-speech audio-text data, learning methods for transferring the knowledge acquired from a large amount of paired audio-image data to shared audio-text representation have been investigated, suggesting the importance of how audio-image co-occurrence is learned. Conventional approaches in audio-image learning assign a single image randomly selected from the corresponding video stream to the entire audio clip, assuming their co-occurrence. However, this method may not accurately capture the temporal agreement between the target audio and image because a single image can only represent a snapshot of a scene, though the target audio changes from moment to moment. To address this problem, we propose two methods for audio and image matching that effectively capture the temporal information: (i) Nearest Match wherein an image is selected from multiple time frames based on similarity with audio, and (ii) Multiframe Match wherein audio and image pairs of multiple time frames are used. Experimental results show that method (i) improves the audio-text retrieval performance by selecting the nearest image that aligns with the audio information and transferring the learned knowledge. Conversely, method (ii) improves the performance of audio-image retrieval while not showing significant improvements in audio-text retrieval performance. These results indicate that refining audio-image temporal agreement may contribute to better knowledge transfer to audio-text retrieval.

Abstract (translated)

本研究旨在改进音频-图像时间一致性对于音频-文本跨检索的目标。为了解决大规模非语音音频-图像数据对齐困难的问题,研究了将大量配对音频-图像数据获得的知識转移方法,以探究如何在共享音频-文本表示中学习知識。传统的音频-图像学习方法是将一个随机的视频流中的单个图像随机分配给整个音频剪辑,假设它们的共现。然而,这种方法可能无法准确捕捉目标音频和图像之间的时间一致性,因为单个图像只能代表场景的一个快照,尽管目标音频会随时刻变化。为了应对这个问题,我们提出了两种音频和图像匹配方法,它们有效地捕捉了时间信息:(i)最近匹配,即根据音频信息选择多个时间帧中的图像,并(ii)多帧匹配,即使用多个时间帧的音频和图像对。实验结果表明,方法(i)通过选择与音频信息最相似的图像来提高音频-文本检索性能,并将所學知識轉移。相反,方法(ii)在保持音频-图像检索性能的同时,没有在音频-文本检索性能上表现出显著的改善。这些结果表明,優化音频-图像时间一致性可能有助于更好地將知識傳遞到audio-text retrieval中。

URL

https://arxiv.org/abs/2403.10756

PDF

https://arxiv.org/pdf/2403.10756.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot