Paper Reading AI Learner

Spatio-Temporal Attention Models for Grounded Video Captioning

2016-10-18 08:27:23
Mihai Zanfir, Elisabeta Marinoiu, Cristian Sminchisescu

Abstract

Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video data to high level textual description, bypassing localization and recognition, thus discarding potentially valuable information for content localization and generalization. In this work we present an automatic video captioning model that combines spatio-temporal attention and image classification by means of deep neural network structures based on long short-term memory. The resulting system is demonstrated to produce state-of-the-art results in the standard YouTube captioning benchmark while also offering the advantage of localizing the visual concepts (subjects, verbs, objects), with no grounding supervision, over space and time.

Abstract (translated)

由于动态真实场景中的复杂交互,自动视频字幕具有挑战性。一个全面的系统将最终本地化和追踪视频中存在的对象,动作和交互,并生成一个依赖于时间本地化的描述,以便为视觉概念奠定基础。然而,大多数现有的自动视频字幕系统从原始视频数据映射到高级文本描述,绕过本地化和识别,因此丢弃用于内容本地化和泛化的潜在有价值的信息。在这项工作中,我们提出了一种自动视频字幕模型,该模型通过基于长期短期记忆的深度神经网络结构来结合时空关注和图像分类。演示结果显示系统能够在标准的YouTube字幕基准测试中获得最新的结果,同时还提供了在空间和时间上对视觉概念(主题,动词,物体)进行本地化处理,而无需接地监控的优势。

URL

https://arxiv.org/abs/1610.04997

PDF

https://arxiv.org/pdf/1610.04997.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot