Abstract
We present Direct Assessment, a method for manually assessing the quality of automatically-generated captions for video. Evaluating the accuracy of video captions is particularly difficult because for any given video clip there is no definitive ground truth or correct answer against which to measure. Automatic metrics for comparing automatic video captions against a manual caption such as BLEU and METEOR, drawn from techniques used in evaluating machine translation, were used in the TRECVid video captioning task in 2016 but these are shown to have weaknesses. The work presented here brings human assessment into the evaluation by crowdsourcing how well a caption describes a video. We automatically degrade the quality of some sample captions which are assessed manually and from this we are able to rate the quality of the human assessors, a factor we take into account in the evaluation. Using data from the TRECVid video-to-text task in 2016, we show how our direct assessment method is replicable and robust and should scale to where there many caption-generation techniques to be evaluated.
Abstract (translated)
我们提供直接评估,一种手动评估视频自动生成字幕质量的方法。评估视频字幕的准确性特别困难,因为对于任何给定的视频剪辑,没有明确的基本事实或针对其进行测量的正确答案。 2016年TRECVid视频字幕任务中使用了用于比较自动视频字幕与手动标题(如BLEU和METEOR)的自动度量标准,这些标准用于评估机器翻译的技术,但这些标准显示其缺点。这里介绍的工作通过众包将人类评估带入评估过程中,描述视频的标题如何。我们会自动降低手动评估的一些样本字幕的质量,因此我们可以评估人员评估人员的素质,这是我们在评估中考虑的一个因素。使用2016年TRECVid视频到文本任务的数据,我们展示了我们的直接评估方法是如何可复制和强大的,并且应该扩展到需要评估许多字幕生成技术的地方。
URL
https://arxiv.org/abs/1710.10586