Paper Reading AI Learner

Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation

2016-08-17 13:30:06
Rakshith Shetty, Jorma Laaksonen

Abstract

We present our submission to the Microsoft Video to Language Challenge of generating short captions describing videos in the challenge dataset. Our model is based on the encoder--decoder pipeline, popular in image and video captioning systems. We propose to utilize two different kinds of video features, one to capture the video content in terms of objects and attributes, and the other to capture the motion and action information. Using these diverse features we train models specializing in two separate input sub-domains. We then train an evaluator model which is used to pick the best caption from the pool of candidates generated by these domain expert models. We argue that this approach is better suited for the current video captioning task, compared to using a single model, due to the diversity in the dataset. Efficacy of our method is proven by the fact that it was rated best in MSR Video to Language Challenge, as per human evaluation. Additionally, we were ranked second in the automatic evaluation metrics based table.

Abstract (translated)

我们将提交给微软视频到语言的挑战,即在挑战数据集中生成描述视频的简短字幕。我们的模型基于编码器 - 解码器流水线,在图像和视频字幕系统中很受欢迎。我们建议利用两种不同类型的视频功能,一种根据对象和属性捕获视频内容,另一种捕获视频内容以捕获运动和动作信息。使用这些不同的功能,我们训练专注于两个独立输入子域的模型。然后,我们训练一个评估者模型,该模型用于从这些领域专家模型生成的候选人中挑选最佳标题。我们认为,由于数据集的多样性,与使用单一模型相比,这种方法更适合当前的视频字幕任务。根据人类评估,我们的方法的功效已得到证实,因为它在MSR Video to Language Challenge中被评为最佳。此外,我们在自动评估指标表中排名第二。

URL

https://arxiv.org/abs/1608.04959

PDF

https://arxiv.org/pdf/1608.04959.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot