Paper Reading AI Learner

VideoCon: Robust Video-Language Alignment via Contrast Captions

2023-11-15 19:51:57
Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, Aditya Grover

Abstract

Despite being (pre)trained on a massive amount of data, state-of-the-art video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions. Our work addresses this by identifying a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order, which alignment models should be robust against. To this end, we introduce the VideoCon, a video-language alignment dataset constructed by a large language model that generates plausible contrast video captions and explanations for differences between original and contrast video captions. Then, a generative video-language model is finetuned with VideoCon to assess video-language entailment and generate explanations. Our VideoCon-based alignment model significantly outperforms current models. It exhibits a 12-point increase in AUC for the video-language alignment task on human-generated contrast captions. Finally, our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video question answering (ATP-Hard). Moreover, our model shows superior performance on novel videos and human-crafted captions and explanations. Our code and data are available at this https URL.

Abstract (translated)

尽管预先训练在大量数据上,最先进的视频语言对齐模型对视频摘要中的语义可解释变化不够稳健。我们的工作通过确定一系列对比性错误对齐,例如替换实体、动作和颠倒事件顺序等,这些对齐模型应该对语义可解释变化具有稳健性。为此,我们引入了VideoCon,一个由大型语言模型构建的视频语言对齐数据集,生成原视频摘要和对比视频摘要之间的合理对比。然后,用VideoCon微调生成式视频语言模型来评估视频语言等价性和生成解释。基于VideoCon的视频语言对齐模型在人类生成的对比摘要任务上显著优于当前模型。它在大规模文本到视频检索(SSv2-Temporal)和视频问题回答(ATP-Hard)等temporally-extensive视频语言任务上的AUC提高了12个点。此外,我们的模型在新颖视频和人类生成的摘要和解释上表现出色。我们的代码和数据可在此处访问:https://www.youtube.com/watch?v=QzQgPvSn_Pk&t=0s

URL

https://arxiv.org/abs/2311.10111

PDF

https://arxiv.org/pdf/2311.10111.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot