Paper Reading AI Learner

Multi-granularity Correspondence Learning from Long-term Noisy Videos

2024-01-30 03:03:26
Yijie Lin, Jie Zhang, Zhenyu Huang, Jia Liu, Zujie Wen, Xi Peng

Abstract

Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (coarse-grained) and frame-word misalignment (fine-grained), hindering temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address coarse-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method. Code is available at this https URL.

Abstract (translated)

现有视频语言研究主要集中在学习短视频片段,而很少涉及对长视频的长期依赖关系的探讨,因为模型的计算成本过高,导致对长视频的建模存在过高的问题。为解决这一问题,一个可行的解决方案是学习视频片段和字幕之间的对应关系,然而这不可避免地遇到了多粒度噪声匹配(MNC)问题。具体来说,MNC指的是视频片段和字幕对齐误差(粗粒度)和帧词对齐误差(细粒度),这阻碍了时序学习和视频理解。在本文中,我们提出了一个统一最优传输(OT)框架下的Norton噪声鲁棒时序优化(Norton)来解决MNC问题。总之,Norton通过视频段落和视频片段对比损失来捕捉基于OT的长期依赖关系。为解决视频段落对比中的粗粒度对齐问题,Norton通过可调的提示桶对无关的片段和字幕进行过滤,并根据传输距离将异步视频片段对齐。为解决细粒度对齐问题,Norton使用软最大操作来确定关键单词和关键帧。此外,Norton还利用视频片段对比中的潜在有缺陷负样本,通过OT分配对齐目标来确保精确的时间建模。对视频检索、视频QA和动作分割等领域的实验证实了我们的方法的有效性。代码可以从该链接获取:https://this URL。

URL

https://arxiv.org/abs/2401.16702

PDF

https://arxiv.org/pdf/2401.16702.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot