Paper Reading AI Learner

Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection

2025-03-21 12:10:38
Gensheng Pei, Tao Chen, Yujia Wang, Xinhao Cai, Xiangbo Shu, Tianfei Zhou, Yazhou Yao

Abstract

The CLIP model has demonstrated significant advancements in aligning visual and language modalities through large-scale pre-training on image-text pairs, enabling strong zero-shot classification and retrieval capabilities on various domains. However, CLIP's training remains computationally intensive, with high demands on both data processing and memory. To address these challenges, recent masking strategies have emerged, focusing on the selective removal of image patches to improve training efficiency. Although effective, these methods often compromise key semantic information, resulting in suboptimal alignment between visual features and text descriptions. In this work, we present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP's training efficiency while preserving critical semantic content. Our method introduces a gradual masking process in which a small set of candidate patches is first pre-selected as potential mask regions. Then, we apply Sobel edge detection across the entire image to generate an edge mask that prioritizes the retention of the primary object areas. Finally, similarity scores between the candidate mask patches and their neighboring patches are computed, with optimal transport normalization refining the selection process to ensure a balanced similarity matrix. Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks, achieving superior performance in robustness evaluation and language compositionality benchmarks.

Abstract (translated)

CLIP模型通过在图像-文本对上进行大规模预训练,在视觉和语言模式的对齐方面取得了显著进展,从而能够在各种领域中实现强大的零样本分类和检索能力。然而,CLIP的训练仍具有较高的计算需求,特别是在数据处理和内存使用方面。为了应对这些挑战,最近出现了一些掩码策略,通过选择性地移除图像补丁来提高训练效率。尽管这些方法有效,但它们通常会牺牲关键的语义信息,导致视觉特征与文本描述之间的对齐效果不佳。 在本文中,我们提出了一种简洁而有效的称为Patch Generation-to-Selection的方法,旨在提升CLIP的训练效率的同时保留重要的语义内容。我们的方法引入了逐步掩码过程,在此过程中,首先从图像中选取一小部分候选补丁作为潜在的掩膜区域。然后,我们在整个图像上应用Sobel边缘检测算法来生成一个边缘掩模,优先保持主要物体区域。最后,计算候选掩模补丁与其邻近补丁之间的相似度分数,并通过最优传输归一化对选择过程进行优化,以确保相似性矩阵的平衡。 我们的方法CLIP-PGS在零样本分类和检索任务中取得了新的最先进成果,在鲁棒性评估和语言组合性基准测试中也表现出优越性能。

URL

https://arxiv.org/abs/2503.17080

PDF

https://arxiv.org/pdf/2503.17080.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot