Paper Reading AI Learner

Models See Hallucinations: Evaluating the Factuality in Video Captioning

2023-03-06 08:32:50
Hui Liu, Xiaojun Wan

Abstract

Video captioning aims to describe events in a video with natural language. In recent years, many works have focused on improving captioning models' performance. However, like other text generation tasks, it risks introducing factual errors not supported by the input video. These factual errors can seriously affect the quality of the generated text, sometimes making it completely unusable. Although factual consistency has received much research attention in text-to-text tasks (e.g., summarization), it is less studied in the context of vision-based text generation. In this work, we conduct a detailed human evaluation of the factuality in video captioning and collect two annotated factuality datasets. We find that 57.0% of the model-generated sentences have factual errors, indicating it is a severe problem in this field. However, existing evaluation metrics are mainly based on n-gram matching and show little correlation with human factuality annotation. We further propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning. The datasets and metrics will be released to promote future research for video captioning.

Abstract (translated)

视频字幕旨在用自然语言描述视频事件。近年来,许多工作都关注改进字幕模型的性能。然而,与其他文本生成任务一样,它可能引入输入视频不支持的事实错误。这些事实错误可以严重影响生成的文本质量,有时使其完全不可用。尽管事实一致性在文本到文本任务(如总结)中受到大量研究关注,但在基于视觉文本生成的背景下研究较少。在本工作中,我们进行了详细的人类评估视频字幕中的事实准确性,并收集了两个标注事实准确性的数据集。我们发现,模型生成的语句中57.0%存在事实错误,这表明该领域这是一个严重的问题。然而,现有的评估指标主要基于词袋匹配,并缺乏与人类事实准确性标注的相关性。我们进一步提出了一种弱监督的基于模型的事实准确性指标FactVC,它在视频字幕事实准确性评估方面表现优异。数据和指标将发布以促进视频字幕研究的未来研究。

URL

https://arxiv.org/abs/2303.02961

PDF

https://arxiv.org/pdf/2303.02961.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot