Abstract
Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval. Although Transformers are predominant in SSVH for their impressive temporal modeling capabilities, they often suffer from computational and memory inefficiencies. Drawing inspiration from Mamba, an advanced state-space model, we explore its potential in SSVH to achieve a better balance between efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm. Specifically, we design bidirectional Mamba layers for both the encoder and decoder, which are effective and efficient in capturing temporal relationships thanks to the data-dependent selective scanning mechanism with linear complexity. In our learning strategy, we transform global semantics in the feature space into semantically consistent and discriminative hash centers, followed by a center alignment loss as a global learning signal. Our self-local-global (SLG) paradigm significantly improves learning efficiency, leading to faster and better convergence. Extensive experiments demonstrate S5VH's improvements over state-of-the-art methods, superior transferability, and scalable advantages in inference efficiency. Code is available at this https URL.
Abstract (translated)
自我监督视频哈希(SSVH)是在视频索引和检索中的一个实用任务。尽管Transformer因出色的时序建模能力而在SSVH中占据主导地位,但它们往往存在计算和内存效率低下的问题。受到先进状态空间模型Mamba的启发,我们探索其在SSVH中的潜力,以实现效果与效率之间的更好平衡。我们引入了S5VH,一种基于Mamba的视频哈希模型,并采用了改进的自我监督学习范式。具体来说,我们在编码器和解码器中设计了双向Mamba层,这些层能够有效地捕捉时间关系,得益于线性复杂度的数据依赖选择性扫描机制。在我们的学习策略中,我们将特征空间中的全局语义转换为语义一致且具有判别性的哈希中心,并随后采用中心对齐损失作为全局学习信号。我们的自局部-全局(SLG)范式显著提升了学习效率,实现了更快和更好的收敛效果。广泛的实验表明,S5VH在性能上超越了最先进的方法,拥有优越的迁移能力和推理效率可扩展优势。代码可在该链接https URL获得。
URL
https://arxiv.org/abs/2412.14518