Paper Reading AI Learner

Differentiable Resolution Compression and Alignment for Efficient Video Classification and Retrieval

2023-09-15 05:31:53
Rui Deng, Qian Wu, Yuke Li, Haoran Fu

Abstract

Optimizing video inference efficiency has become increasingly important with the growing demand for video analysis in various fields. Some existing methods achieve high efficiency by explicit discard of spatial or temporal information, which poses challenges in fast-changing and fine-grained scenarios. To address these issues, we propose an efficient video representation network with Differentiable Resolution Compression and Alignment mechanism, which compresses non-essential information in the early stage of the network to reduce computational costs while maintaining consistent temporal correlations. Specifically, we leverage a Differentiable Context-aware Compression Module to encode the saliency and non-saliency frame features, refining and updating the features into a high-low resolution video sequence. To process the new sequence, we introduce a new Resolution-Align Transformer Layer to capture global temporal correlations among frame features with different resolutions, while reducing spatial computation costs quadratically by utilizing fewer spatial tokens in low-resolution non-saliency frames. The entire network can be end-to-end optimized via the integration of the differentiable compression module. Experimental results show that our method achieves the best trade-off between efficiency and performance on near-duplicate video retrieval and competitive results on dynamic video classification compared to state-of-the-art methods. Code:this https URL

Abstract (translated)

优化视频推断效率随着各个领域对视频分析的需求不断增加变得越来越重要。一些现有方法通过明确放弃空间或时间信息实现了高效的性能,但在快速变化和精细的场景下会带来挑战。为了解决这些问题,我们提出了一种高效的视频表示网络,采用可分化分辨率压缩和对齐机制。该网络在网络的早期阶段压缩非关键信息,以降低计算成本,同时保持 consistent 的时间相关度。具体来说,我们利用一种可分化上下文 aware 压缩模块编码可见和非可见帧特征,将它们 refine 和更新为高低频分辨率的视频序列。为了处理新的序列,我们引入了一种新分辨率 align Transformer 层,以捕捉不同分辨率帧特征之间的全局时间相关度,同时通过在低分辨率非可见帧中使用更少的空间 token 以减少空间计算成本。整个网络可以通过集成可分化压缩模块进行端到端优化。实验结果显示,与我们现有的方法相比,我们的方法在近同视频检索和动态视频分类中的效率和表现实现了最佳平衡。代码: this https URL

URL

https://arxiv.org/abs/2309.08167

PDF

https://arxiv.org/pdf/2309.08167.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot