Paper Reading AI Learner

Efficient Self-Supervised Video Hashing with Selective State Spaces

2024-12-19 04:33:22
Jinpeng Wang, Niu Lian, Jun Li, Yuting Wang, Yan Feng, Bin Chen, Yongbing Zhang, Shu-Tao Xia

Abstract

Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval. Although Transformers are predominant in SSVH for their impressive temporal modeling capabilities, they often suffer from computational and memory inefficiencies. Drawing inspiration from Mamba, an advanced state-space model, we explore its potential in SSVH to achieve a better balance between efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm. Specifically, we design bidirectional Mamba layers for both the encoder and decoder, which are effective and efficient in capturing temporal relationships thanks to the data-dependent selective scanning mechanism with linear complexity. In our learning strategy, we transform global semantics in the feature space into semantically consistent and discriminative hash centers, followed by a center alignment loss as a global learning signal. Our self-local-global (SLG) paradigm significantly improves learning efficiency, leading to faster and better convergence. Extensive experiments demonstrate S5VH's improvements over state-of-the-art methods, superior transferability, and scalable advantages in inference efficiency. Code is available at this https URL.

Abstract (translated)

自我监督视频哈希(SSVH)是在视频索引和检索中的一个实用任务。尽管Transformer因出色的时序建模能力而在SSVH中占据主导地位,但它们往往存在计算和内存效率低下的问题。受到先进状态空间模型Mamba的启发,我们探索其在SSVH中的潜力,以实现效果与效率之间的更好平衡。我们引入了S5VH,一种基于Mamba的视频哈希模型,并采用了改进的自我监督学习范式。具体来说,我们在编码器和解码器中设计了双向Mamba层,这些层能够有效地捕捉时间关系,得益于线性复杂度的数据依赖选择性扫描机制。在我们的学习策略中,我们将特征空间中的全局语义转换为语义一致且具有判别性的哈希中心,并随后采用中心对齐损失作为全局学习信号。我们的自局部-全局(SLG)范式显著提升了学习效率,实现了更快和更好的收敛效果。广泛的实验表明,S5VH在性能上超越了最先进的方法,拥有优越的迁移能力和推理效率可扩展优势。代码可在该链接https URL获得。

URL

https://arxiv.org/abs/2412.14518

PDF

https://arxiv.org/pdf/2412.14518.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot