Paper Reading AI Learner

HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval

2026-01-22 17:57:42
Zequn Xie, Xin Liu, Boyun Zhang, Yuxiao Lin, Sihang Cai, Tao Jin

Abstract

The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.

Abstract (translated)

CLIP模型的成功推动了文本视频检索领域的显著进步。然而,当前的方法往往在“盲”特征交互方面存在问题,即由于文本查询的稀疏性,模型难以从背景噪声中区分出关键视觉信息。为了弥补这一差距,我们借鉴了人类的认知行为,并提出了人眼视图驱动(HVD)模型。我们的框架建立了一个由粗到细的对齐机制,包含两个关键组件:帧特征选择模块(FFSM)和补丁特征压缩模块(PFCM)。FFSM通过选择关键帧来消除时间冗余,模拟了人类宏观感知能力。随后,PFCM通过先进的注意力机制聚合补丁特征以形成显著视觉实体,从而模仿微观感知并实现精确的实体级别匹配。 在五个基准测试中的大量实验表明,HVD不仅能够捕捉到类似人的视觉关注点,还实现了最先进的性能。

URL

https://arxiv.org/abs/2601.16155

PDF

https://arxiv.org/pdf/2601.16155.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot