Abstract
Video captioning aims to describe events in a video with natural language. In recent years, many works have focused on improving captioning models' performance. However, like other text generation tasks, it risks introducing factual errors not supported by the input video. These factual errors can seriously affect the quality of the generated text, sometimes making it completely unusable. Although factual consistency has received much research attention in text-to-text tasks (e.g., summarization), it is less studied in the context of vision-based text generation. In this work, we conduct a detailed human evaluation of the factuality in video captioning and collect two annotated factuality datasets. We find that 57.0% of the model-generated sentences have factual errors, indicating it is a severe problem in this field. However, existing evaluation metrics are mainly based on n-gram matching and show little correlation with human factuality annotation. We further propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning. The datasets and metrics will be released to promote future research for video captioning.
Abstract (translated)
视频字幕旨在用自然语言描述视频事件。近年来,许多工作都关注改进字幕模型的性能。然而,与其他文本生成任务一样,它可能引入输入视频不支持的事实错误。这些事实错误可以严重影响生成的文本质量,有时使其完全不可用。尽管事实一致性在文本到文本任务(如总结)中受到大量研究关注,但在基于视觉文本生成的背景下研究较少。在本工作中,我们进行了详细的人类评估视频字幕中的事实准确性,并收集了两个标注事实准确性的数据集。我们发现,模型生成的语句中57.0%存在事实错误,这表明该领域这是一个严重的问题。然而,现有的评估指标主要基于词袋匹配,并缺乏与人类事实准确性标注的相关性。我们进一步提出了一种弱监督的基于模型的事实准确性指标FactVC,它在视频字幕事实准确性评估方面表现优异。数据和指标将发布以促进视频字幕研究的未来研究。
URL
https://arxiv.org/abs/2303.02961