Paper Reading AI Learner

Evaluation of Automatic Video Captioning Using Direct Assessment

2017-10-29 09:37:02
Yvette Graham, George Awad, Alan Smeaton

Abstract

We present Direct Assessment, a method for manually assessing the quality of automatically-generated captions for video. Evaluating the accuracy of video captions is particularly difficult because for any given video clip there is no definitive ground truth or correct answer against which to measure. Automatic metrics for comparing automatic video captions against a manual caption such as BLEU and METEOR, drawn from techniques used in evaluating machine translation, were used in the TRECVid video captioning task in 2016 but these are shown to have weaknesses. The work presented here brings human assessment into the evaluation by crowdsourcing how well a caption describes a video. We automatically degrade the quality of some sample captions which are assessed manually and from this we are able to rate the quality of the human assessors, a factor we take into account in the evaluation. Using data from the TRECVid video-to-text task in 2016, we show how our direct assessment method is replicable and robust and should scale to where there many caption-generation techniques to be evaluated.

Abstract (translated)

我们提供直接评估,一种手动评估视频自动生成字幕质量的方法。评估视频字幕的准确性特别困难,因为对于任何给定的视频剪辑,没有明确的基本事实或针对其进行测量的正确答案。 2016年TRECVid视频字幕任务中使用了用于比较自动视频字幕与手动标题(如BLEU和METEOR)的自动度量标准,这些标准用于评估机器翻译的技术,但这些标准显示其缺点。这里介绍的工作通过众包将人类评估带入评估过程中,描述视频的标题如何。我们会自动降低手动评估的一些样本字幕的质量,因此我们可以评估人员评估人员的素质,这是我们在评估中考虑的一个因素。使用2016年TRECVid视频到文本任务的数据,我们展示了我们的直接评估方法是如何可复制和强大的,并且应该扩展到需要评估许多字幕生成技术的地方。

URL

https://arxiv.org/abs/1710.10586

PDF

https://arxiv.org/pdf/1710.10586.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot