Paper Reading AI Learner

Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning

2019-06-11 03:35:25
Junchao Zhang, Yuxin Peng

Abstract

Video captioning aims to automatically generate natural language descriptions of video content, which has drawn a lot of attention recent years. Generating accurate and fine-grained captions needs to not only understand the global content of video, but also capture the detailed object information. Meanwhile, video representations have great impact on the quality of generated captions. Thus, it is important for video captioning to capture salient objects with their detailed temporal dynamics, and represent them using discriminative spatio-temporal representations. In this paper, we propose a new video captioning approach based on object-aware aggregation with bidirectional temporal graph (OA-BTG), which captures detailed temporal dynamics for salient objects in video, and learns discriminative spatio-temporal representations by performing object-aware local feature aggregation on detected object regions. The main novelties and advantages are: (1) Bidirectional temporal graph: A bidirectional temporal graph is constructed along and reversely along the temporal order, which provides complementary ways to capture the temporal trajectories for each salient object. (2) Object-aware aggregation: Learnable VLAD (Vector of Locally Aggregated Descriptors) models are constructed on object temporal trajectories and global frame sequence, which performs object-aware aggregation to learn discriminative representations. A hierarchical attention mechanism is also developed to distinguish different contributions of multiple objects. Experiments on two widely-used datasets demonstrate our OA-BTG achieves state-of-the-art performance in terms of BLEU@4, METEOR and CIDEr metrics.

Abstract (translated)

视频字幕是为了自动生成视频内容的自然语言描述,近年来引起了人们的广泛关注。生成准确、细粒度的标题不仅需要了解视频的全局内容,还需要捕获详细的对象信息。同时,视频表示对生成字幕的质量有很大的影响。因此,视频字幕的关键是捕捉突出物体的详细的时间动态,并使用区分时空的表示来表示它们。本文提出了一种基于双向时间图对象感知聚合(OA-BTG)的视频字幕方法,该方法捕获了视频中突出对象的详细时间动态,并通过对检测到的对象区域进行对象感知局部特征聚合来学习区分时空的表示。主要的创新点和优点是:(1)双向时间图:沿时间顺序和逆时间顺序构造双向时间图,为每个突出对象的时间轨迹捕获提供了互补的方法。(2)对象感知聚合:在对象时间轨迹和全局帧序列上构造可学习VLAD(局部聚合描述符向量)模型,执行对象感知聚合以学习区分表示。为了区分多个对象的不同贡献,还开发了一种分级注意机制。对两个广泛使用的数据集进行的实验表明,我们的OA-BTG在Bleu@4、Meteor和Cider指标方面达到了最先进的性能。

URL

https://arxiv.org/abs/1906.04375

PDF

https://arxiv.org/pdf/1906.04375.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot