Abstract
Video captioning aims to automatically generate natural language descriptions of video content, which has drawn a lot of attention recent years. Generating accurate and fine-grained captions needs to not only understand the global content of video, but also capture the detailed object information. Meanwhile, video representations have great impact on the quality of generated captions. Thus, it is important for video captioning to capture salient objects with their detailed temporal dynamics, and represent them using discriminative spatio-temporal representations. In this paper, we propose a new video captioning approach based on object-aware aggregation with bidirectional temporal graph (OA-BTG), which captures detailed temporal dynamics for salient objects in video, and learns discriminative spatio-temporal representations by performing object-aware local feature aggregation on detected object regions. The main novelties and advantages are: (1) Bidirectional temporal graph: A bidirectional temporal graph is constructed along and reversely along the temporal order, which provides complementary ways to capture the temporal trajectories for each salient object. (2) Object-aware aggregation: Learnable VLAD (Vector of Locally Aggregated Descriptors) models are constructed on object temporal trajectories and global frame sequence, which performs object-aware aggregation to learn discriminative representations. A hierarchical attention mechanism is also developed to distinguish different contributions of multiple objects. Experiments on two widely-used datasets demonstrate our OA-BTG achieves state-of-the-art performance in terms of BLEU@4, METEOR and CIDEr metrics.
Abstract (translated)
视频字幕是为了自动生成视频内容的自然语言描述,近年来引起了人们的广泛关注。生成准确、细粒度的标题不仅需要了解视频的全局内容,还需要捕获详细的对象信息。同时,视频表示对生成字幕的质量有很大的影响。因此,视频字幕的关键是捕捉突出物体的详细的时间动态,并使用区分时空的表示来表示它们。本文提出了一种基于双向时间图对象感知聚合(OA-BTG)的视频字幕方法,该方法捕获了视频中突出对象的详细时间动态,并通过对检测到的对象区域进行对象感知局部特征聚合来学习区分时空的表示。主要的创新点和优点是:(1)双向时间图:沿时间顺序和逆时间顺序构造双向时间图,为每个突出对象的时间轨迹捕获提供了互补的方法。(2)对象感知聚合:在对象时间轨迹和全局帧序列上构造可学习VLAD(局部聚合描述符向量)模型,执行对象感知聚合以学习区分表示。为了区分多个对象的不同贡献,还开发了一种分级注意机制。对两个广泛使用的数据集进行的实验表明,我们的OA-BTG在Bleu@4、Meteor和Cider指标方面达到了最先进的性能。
URL
https://arxiv.org/abs/1906.04375