Abstract
Compressing videos into binary codes can improve retrieval speed and reduce storage overhead. However, learning accurate hash codes for video retrieval can be challenging due to high local redundancy and complex global dependencies between video frames, especially in the absence of labels. Existing self-supervised video hashing methods have been effective in designing expressive temporal encoders, but have not fully utilized the temporal dynamics and spatial appearance of videos due to less challenging and unreliable learning tasks. To address these challenges, we begin by utilizing the contrastive learning task to capture global spatio-temporal information of videos for hashing. With the aid of our designed augmentation strategies, which focus on spatial and temporal variations to create positive pairs, the learning framework can generate hash codes that are invariant to motion, scale, and viewpoint. Furthermore, we incorporate two collaborative learning tasks, i.e., frame order verification and scene change regularization, to capture local spatio-temporal details within video frames, thereby enhancing the perception of temporal structure and the modeling of spatio-temporal relationships. Our proposed Contrastive Hashing with Global-Local Spatio-temporal Information (CHAIN) outperforms state-of-the-art self-supervised video hashing methods on four video benchmark datasets. Our codes will be released.
Abstract (translated)
将视频压缩为二进制代码可以提高检索速度并减少存储开销。然而,由于视频帧的高局部相关性和复杂全局依赖关系,特别是缺乏标签,学习精确的哈希码对于视频检索可能会具有挑战性。现有的自监督视频哈希方法在设计富有表现力的时间编码器方面有效,但由于更简单和不可靠的学习任务,没有充分利用视频的时间 dynamics 和空间特征。为了应对这些挑战,我们首先利用对比学习任务来捕捉视频的全局时空信息进行哈希。通过我们设计的增强策略,重点关注空间和时间变化以创建积极对,学习框架可以生成对运动、缩放和视点的鲁棒哈希码。此外,我们还引入两个协作学习任务,即帧序验证和场景切换 regularization,以捕捉视频帧内的局部时空细节,从而增强对时间结构和空间关系的感知,建模。我们提出的具有全局-局部时空信息的自监督哈希链(CHAIN)在四个视频基准数据集上优于最先进的自监督视频哈希方法。我们的代码将发布。
URL
https://arxiv.org/abs/2310.18926