Abstract
Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video data to high level textual description, bypassing localization and recognition, thus discarding potentially valuable information for content localization and generalization. In this work we present an automatic video captioning model that combines spatio-temporal attention and image classification by means of deep neural network structures based on long short-term memory. The resulting system is demonstrated to produce state-of-the-art results in the standard YouTube captioning benchmark while also offering the advantage of localizing the visual concepts (subjects, verbs, objects), with no grounding supervision, over space and time.
Abstract (translated)
由于动态真实场景中的复杂交互,自动视频字幕具有挑战性。一个全面的系统将最终本地化和追踪视频中存在的对象,动作和交互,并生成一个依赖于时间本地化的描述,以便为视觉概念奠定基础。然而,大多数现有的自动视频字幕系统从原始视频数据映射到高级文本描述,绕过本地化和识别,因此丢弃用于内容本地化和泛化的潜在有价值的信息。在这项工作中,我们提出了一种自动视频字幕模型,该模型通过基于长期短期记忆的深度神经网络结构来结合时空关注和图像分类。演示结果显示系统能够在标准的YouTube字幕基准测试中获得最新的结果,同时还提供了在空间和时间上对视觉概念(主题,动词,物体)进行本地化处理,而无需接地监控的优势。
URL
https://arxiv.org/abs/1610.04997