Paper Reading AI Learner

Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval

2023-09-20 06:41:30
Chen Jiang, Kaiming Huang, Sifeng He, Xudong Yang, Wei Zhang, Xiaobo Zhang, Yuan Cheng, Lei Yang, Qing Wang, Furong Xu, Tan Pan, Wei Chu

Abstract

With the explosive growth of web videos in recent years, large-scale Content-Based Video Retrieval (CBVR) becomes increasingly essential in video filtering, recommendation, and copyright protection. Segment-level CBVR (S-CBVR) locates the start and end time of similar segments in finer granularity, which is beneficial for user browsing efficiency and infringement detection especially in long video scenarios. The challenge of S-CBVR task is how to achieve high temporal alignment accuracy with efficient computation and low storage consumption. In this paper, we propose a Segment Similarity and Alignment Network (SSAN) in dealing with the challenge which is firstly trained end-to-end in S-CBVR. SSAN is based on two newly proposed modules in video retrieval: (1) An efficient Self-supervised Keyframe Extraction (SKE) module to reduce redundant frame features, (2) A robust Similarity Pattern Detection (SPD) module for temporal alignment. In comparison with uniform frame extraction, SKE not only saves feature storage and search time, but also introduces comparable accuracy and limited extra computation time. In terms of temporal alignment, SPD localizes similar segments with higher accuracy and efficiency than existing deep learning methods. Furthermore, we jointly train SSAN with SKE and SPD and achieve an end-to-end improvement. Meanwhile, the two key modules SKE and SPD can also be effectively inserted into other video retrieval pipelines and gain considerable performance improvements. Experimental results on public datasets show that SSAN can obtain higher alignment accuracy while saving storage and online query computational cost compared to existing methods.

Abstract (translated)

过去几年中,Web视频的快速增长,使得大规模基于内容的视频检索(CBVR)在视频过滤、推荐和版权保护中变得越来越重要。分块级别的CBVR(S-CBVR)则在更细粒度上找到了类似片段的起始和结束时间,这对用户的浏览效率和侵犯检测特别有利。S-CBVR任务的挑战是如何在高效计算和低存储消耗的情况下实现高时间对齐精度。在本文中,我们提出了一个Segment Similarity and Alignment Network(SSAN)来解决这个挑战,这个挑战是在S-CBVR任务中首先训练端到端的任务。SSAN基于视频检索中新提出的两个模块:(1)高效的自监督关键帧提取(SKE)模块以减少冗余帧特征,(2)一个稳健的相似性模式检测(SPD)模块来进行时间对齐。与通用的帧提取相比,SKE不仅节省了特征存储和搜索时间,还引入了类似的精度和有限的额外计算时间。在时间对齐方面,SPDLocalization比现有的深度学习方法更准确且更高效。此外,我们与SKE和SPD一起训练SSAN,并实现了端到端改进。同时,这两个关键模块SKE和SPD也可以有效插入到其他视频检索管道中,并取得了显著的性能改进。公开数据集的实验结果显示,与现有方法相比,SSAN可以在节省存储和在线查询计算成本的同时获得更高的对齐精度。

URL

https://arxiv.org/abs/2309.11091

PDF

https://arxiv.org/pdf/2309.11091.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot