Paper Reading AI Learner

AutoShot: A Short Video Dataset and State-of-the-Art Shot Boundary Detection

2023-04-12 19:01:21
Wentao Zhu, Yufang Huang, Xiufeng Xie, Wenxian Liu, Jincan Deng, Debing Zhang, Zhangyang Wang, Ji Liu

Abstract

The short-form videos have explosive popularity and have dominated the new social media trends. Prevailing short-video platforms,~\textit{e.g.}, Kuaishou (Kwai), TikTok, Instagram Reels, and YouTube Shorts, have changed the way we consume and create content. For video content creation and understanding, the shot boundary detection (SBD) is one of the most essential components in various scenarios. In this work, we release a new public Short video sHot bOundary deTection dataset, named SHOT, consisting of 853 complete short videos and 11,606 shot annotations, with 2,716 high quality shot boundary annotations in 200 test videos. Leveraging this new data wealth, we propose to optimize the model design for video SBD, by conducting neural architecture search in a search space encapsulating various advanced 3D ConvNets and Transformers. Our proposed approach, named AutoShot, achieves higher F1 scores than previous state-of-the-art approaches, e.g., outperforming TransNetV2 by 4.2%, when being derived and evaluated on our newly constructed SHOT dataset. Moreover, to validate the generalizability of the AutoShot architecture, we directly evaluate it on another three public datasets: ClipShots, BBC and RAI, and the F1 scores of AutoShot outperform previous state-of-the-art approaches by 1.1%, 0.9% and 1.2%, respectively. The SHOT dataset and code can be found in this https URL .

Abstract (translated)

短视频已经迅速流行起来并主导了最新的社交媒体趋势。主流的短视频平台,例如 Kuaishou (Kwai)、TikTok、Instagram Reels 和 YouTube Shorts,已经改变了我们消费和创造内容的方式。对于视频内容创建和理解,场景下的 shot boundary detection (SBD) 是最为重要的组成部分之一。在本研究中,我们发布了一个新的公开的短片片段匹配dataset,名为 Shot,由 853 个完整的短片和 11,606 个片段标注,在 200 个测试视频中有 2,716 个高质量的片段边界标注。利用这一新的数据财富,我们建议优化视频 SDB 模型设计,通过在包含各种高级 3D 卷积神经网络和Transformer 的搜索空间中进行神经网络架构搜索。我们提出的算法名为 AutoShot,在之前的先进方法中比其相比实现了更高的 F1 得分,例如在对我们的新创建的 Shot dataset 进行推导和评估时,超越了 TransNetV2 的 4.2%。此外,为了验证 AutoShot 架构的通用性,我们直接评估了另外两个公开dataset:ClipShots、BBC 和 RAI,而 AutoShot 的 F1 得分在之前的先进方法中超越了 1.1%、0.9% 和 1.2%。Shot dataset 和代码可在 this https URL 中找到。

URL

https://arxiv.org/abs/2304.06116

PDF

https://arxiv.org/pdf/2304.06116.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot