Paper Reading AI Learner

STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training

2023-02-20 03:13:45
Weihong Zhong, Mao Zheng, Duyu Tang, Xuan Luo, Heng Gong, Xiaocheng Feng, Bing Qin

Abstract

Although large-scale video-language pre-training models, which usually build a global alignment between the video and the text, have achieved remarkable progress on various downstream tasks, the idea of adopting fine-grained information during the pre-training stage is not well explored. In this work, we propose STOA-VLP, a pre-training framework that jointly models object and action information across spatial and temporal dimensions. More specifically, the model regards object trajectories across frames and multiple action features from the video as fine-grained features. Besides, We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model. The first is the dynamic object-text alignment task, which builds a better connection between object trajectories and the relevant noun tokens. The second is the spatial-temporal action set prediction, which guides the model to generate consistent action features by predicting actions found in the text. Extensive experiments on three downstream tasks (video captioning, text-video retrieval, and video question answering) demonstrate the effectiveness of our proposed STOA-VLP (e.g. 3.7 Rouge-L improvements on MSR-VTT video captioning benchmark, 2.9% accuracy improvements on MSVD video question answering benchmark, compared to previous approaches).

Abstract (translated)

尽管大规模的视频语言预训练模型,通常建立视频和文本的全局对齐,在多种下游任务中取得了显著进展,但在预训练阶段采用精细信息的想法并没有得到充分的探索。在本文中,我们提出了STOA-VLP,一个预训练框架,跨空间和时间维度同时建模对象和动作信息。更具体地说,模型将帧间对象轨迹和视频中的多个行动特征视为精细特征。此外,我们设计了两个辅助任务,更好地将这两种信息纳入视频语言模型的预训练过程。第一个任务是动态对象-文本对齐任务,建立对象轨迹与相关名词词干之间的联系。第二个任务是空间-时间行动集预测,指导模型通过预测文本中的行动生成一致的行动特征。对三个下游任务(视频字幕制作、文本-视频检索和视频问答)进行的广泛实验证明了我们提出的STOA-VLP的有效性(例如,与前一种方法相比,MSR-VTT视频字幕制作基准中3.7 Rouge-L改进了,MSVD视频问答基准中2.9%的精度提高了)。

URL

https://arxiv.org/abs/2302.09736

PDF

https://arxiv.org/pdf/2302.09736.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot