Abstract
The large-scale vision-language models (e.g., CLIP) are leveraged by different methods to detect unseen objects. However, most of these works require additional captions or images for training, which is not feasible in the context of zero-shot detection. In contrast, the distillation-based method is an extra-data-free method, but it has its limitations. Specifically, existing work creates distillation regions that are biased to the base categories, which limits the distillation of novel category information and harms the distillation efficiency. Furthermore, directly using the raw feature from CLIP for distillation neglects the domain gap between the training data of CLIP and the detection datasets, which makes it difficult to learn the mapping from the image region to the vision-language feature space - an essential component for detecting unseen objects. As a result, existing distillation-based methods require an excessively long training schedule. To solve these problems, we propose Efficient feature distillation for Zero-Shot Detection (EZSD). Firstly, EZSD adapts the CLIP's feature space to the target detection domain by re-normalizing CLIP to bridge the domain gap; Secondly, EZSD uses CLIP to generate distillation proposals with potential novel instances, to avoid the distillation being overly biased to the base categories. Finally, EZSD takes advantage of semantic meaning for regression to further improve the model performance. As a result, EZSD achieves state-of-the-art performance in the COCO zero-shot benchmark with a much shorter training schedule and outperforms previous work by 4% in LVIS overall setting with 1/10 training time.
Abstract (translated)
大型视觉语言模型(例如Clip)利用不同方法检测未观测到的对象。然而,这些方法中大多数需要额外的标题或图像来进行训练,这在零样本检测上下文中是不可行的。相比之下,基于离散傅里叶变换的方法是一种无额外数据的方法,但它也有其限制。具体来说,现有的工作创造了基于基类的偏置区域,这限制了新类别信息的汇聚和损害了汇聚效率。此外,直接使用Clip的 raw feature 进行汇聚忽略了Clip的训练数据和检测数据之间的域差,这使从图像区域到视觉语言特征空间的映射学习变得困难 - 这是检测未观测到对象的关键组件。因此,现有的基于汇聚的方法需要过长的训练时间表。为了解决这些问题,我们提出了高效的特征汇聚零样本检测(EZSD)方法。首先,EZSD将Clip的特征空间适应到目标检测域,通过归一化Clip来弥散域差;其次,EZSD使用Clip生成可能的新实例汇聚提议,以避免汇聚过度偏置基类。最后,EZSD利用回归语义意义进一步改善模型性能。因此,EZSD在COCO零样本基准测试中取得了最先进的性能,训练时间表只有1/10,但比先前工作高出4%。在LVIS整体设置中,通过训练时间仅为1/10,EZSD表现优于先前工作。
URL
https://arxiv.org/abs/2303.12145