Paper Reading AI Learner

Video Captioning with Transferred Semantic Attributes

2016-11-23 07:59:59
Yingwei Pan, Ting Yao, Houqiang Li, Tao Mei

Abstract

Automatically generating natural language descriptions of videos plays a fundamental challenge for computer vision community. Most recent progress in this problem has been achieved through employing 2-D and/or 3-D Convolutional Neural Networks (CNN) to encode video content and Recurrent Neural Networks (RNN) to decode a sentence. In this paper, we present Long Short-Term Memory with Transferred Semantic Attributes (LSTM-TSA)---a novel deep architecture that incorporates the transferred semantic attributes learnt from images and videos into the CNN plus RNN framework, by training them in an end-to-end manner. The design of LSTM-TSA is highly inspired by the facts that 1) semantic attributes play a significant contribution to captioning, and 2) images and videos carry complementary semantics and thus can reinforce each other for captioning. To boost video captioning, we propose a novel transfer unit to model the mutually correlated attributes learnt from images and videos. Extensive experiments are conducted on three public datasets, i.e., MSVD, M-VAD and MPII-MD. Our proposed LSTM-TSA achieves to-date the best published performance in sentence generation on MSVD: 52.8% and 74.0% in terms of BLEU@4 and CIDEr-D. Superior results when compared to state-of-the-art methods are also reported on M-VAD and MPII-MD.

Abstract (translated)

自动生成视频的自然语言描述是计算机视觉社区面临的基本挑战。通过使用二维和/或三维卷积神经网络(CNN)对视频内容进行编码和递归神经网络(RNN)来对句子进行解码,已经实现了这个问题的最新进展。在本文中,我们提出了带有被传递的语义属性的长期短期记忆(LSTM-TSA) - 一种新颖的深层架构,它将从图像和视频中学习到的语义属性转化到CNN和RNN框架中,端到端的方式。 LSTM-TSA的设计受到以下事实的启发:1)语义属性对字幕有显着贡献; 2)图像和视频带有互补语义,因此可以互相补充字幕。为了增强视频字幕,我们提出了一种新颖的传输单元来模拟从图像和视频中学习到的相互关联的属性。对三个公共数据集进行了广泛的实验,即MSVD,M-VAD和MPII-MD。我们提出的LSTM-TSA达到了最新公布的MSVD语句生成的最佳表现:以BLEU4和CIDEr-D表示,分别为52.8%和74.0%。 M-VAD和MPII-MD也报道了与最先进的方法相比的优异结果。

URL

https://arxiv.org/abs/1611.07675

PDF

https://arxiv.org/pdf/1611.07675.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot