Paper Reading AI Learner

VideoMCC: a New Benchmark for Video Comprehension

2017-06-16 19:50:46
Du Tran, Maksim Bolonkin, Manohar Paluri, Lorenzo Torresani

Abstract

While there is overall agreement that future technology for organizing, browsing and searching videos hinges on the development of methods for high-level semantic understanding of video, so far no consensus has been reached on the best way to train and assess models for this task. Casting video understanding as a form of action or event categorization is problematic as it is not fully clear what the semantic classes or abstractions in this domain should be. Language has been exploited to sidestep the problem of defining video categories, by formulating video understanding as the task of captioning or description. However, language is highly complex, redundant and sometimes ambiguous. Many different captions may express the same semantic concept. To account for this ambiguity, quantitative evaluation of video description requires sophisticated metrics, whose performance scores are typically hard to interpret by humans. This paper provides four contributions to this problem. First, we formulate Video Multiple Choice Caption (VideoMCC) as a new well-defined task with an easy-to-interpret performance measure. Second, we describe a general semi-automatic procedure to create benchmarks for this task. Third, we publicly release a large-scale video benchmark created with an implementation of this procedure and we include a human study that assesses human performance on our dataset. Finally, we propose and test a varied collection of approaches on this benchmark for the purpose of gaining a better understanding of the new challenges posed by video comprehension.

Abstract (translated)

虽然大家一致认为未来用于组织,浏览和搜索视频的技术取决于视频高级语义理解方法的发展,但迄今为止,尚未就如何训练和评估这一任务的模型达成共识。将视频理解视为行为或事件分类的一种形式是有问题的,因为它不完全清楚该领域中的语义类别或抽象应该是什么。通过将视频理解作为字幕或描述的任务,语言被利用来避开定义视频类别的问题。但是,语言非常复杂,多余,有时候模棱两可。许多不同的标题可以表达相同的语义概念。为了解决这种模糊性,视频描述的定量评估需要复杂的度量标准,其性能评分通常很难被人类解读。本文提供了四个对此问题的贡献。首先,我们制定了Video Multiple Choice Caption(VideoMCC)作为一个易于解读性能测量的新定义明确的任务。其次,我们描述一个通用的半自动程序来为这项任务创建基准。第三,我们公开发布了一个通过实施这一程序而创建的大型视频基准,并且我们纳入了一项人类研究,该研究评估了人类对我们数据集的表现。最后,为了更好地理解视频理解所带来的新挑战,我们针对这一基准提出并测试了各种各样的方法。

URL

https://arxiv.org/abs/1606.07373

PDF

https://arxiv.org/pdf/1606.07373.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot