Paper Reading AI Learner

Efficient Feature Distillation for Zero-shot Detection

2023-03-21 19:02:36
Zhuoming Liu, Xuefeng Hu, Ram Nevatia

Abstract

The large-scale vision-language models (e.g., CLIP) are leveraged by different methods to detect unseen objects. However, most of these works require additional captions or images for training, which is not feasible in the context of zero-shot detection. In contrast, the distillation-based method is an extra-data-free method, but it has its limitations. Specifically, existing work creates distillation regions that are biased to the base categories, which limits the distillation of novel category information and harms the distillation efficiency. Furthermore, directly using the raw feature from CLIP for distillation neglects the domain gap between the training data of CLIP and the detection datasets, which makes it difficult to learn the mapping from the image region to the vision-language feature space - an essential component for detecting unseen objects. As a result, existing distillation-based methods require an excessively long training schedule. To solve these problems, we propose Efficient feature distillation for Zero-Shot Detection (EZSD). Firstly, EZSD adapts the CLIP's feature space to the target detection domain by re-normalizing CLIP to bridge the domain gap; Secondly, EZSD uses CLIP to generate distillation proposals with potential novel instances, to avoid the distillation being overly biased to the base categories. Finally, EZSD takes advantage of semantic meaning for regression to further improve the model performance. As a result, EZSD achieves state-of-the-art performance in the COCO zero-shot benchmark with a much shorter training schedule and outperforms previous work by 4% in LVIS overall setting with 1/10 training time.

Abstract (translated)

大型视觉语言模型(例如Clip)利用不同方法检测未观测到的对象。然而,这些方法中大多数需要额外的标题或图像来进行训练,这在零样本检测上下文中是不可行的。相比之下,基于离散傅里叶变换的方法是一种无额外数据的方法,但它也有其限制。具体来说,现有的工作创造了基于基类的偏置区域,这限制了新类别信息的汇聚和损害了汇聚效率。此外,直接使用Clip的 raw feature 进行汇聚忽略了Clip的训练数据和检测数据之间的域差,这使从图像区域到视觉语言特征空间的映射学习变得困难 - 这是检测未观测到对象的关键组件。因此,现有的基于汇聚的方法需要过长的训练时间表。为了解决这些问题,我们提出了高效的特征汇聚零样本检测(EZSD)方法。首先,EZSD将Clip的特征空间适应到目标检测域,通过归一化Clip来弥散域差;其次,EZSD使用Clip生成可能的新实例汇聚提议,以避免汇聚过度偏置基类。最后,EZSD利用回归语义意义进一步改善模型性能。因此,EZSD在COCO零样本基准测试中取得了最先进的性能,训练时间表只有1/10,但比先前工作高出4%。在LVIS整体设置中,通过训练时间仅为1/10,EZSD表现优于先前工作。

URL

https://arxiv.org/abs/2303.12145

PDF

https://arxiv.org/pdf/2303.12145.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot