Abstract
In this paper, we consider the problem of simultaneously detecting objects and inferring their visual attributes in an image, even for those with no manual annotations provided at the training stage, resembling an open-vocabulary scenario. To achieve this goal, we make the following contributions: (i) we start with a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr. The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes; (ii) we combine all available datasets and train with a federated strategy to finetune the CLIP model, aligning the visual representation with attributes, additionally, we investigate the efficacy of leveraging freely available online image-caption pairs under weakly supervised learning; (iii) in pursuit of efficiency, we train a Faster-RCNN type model end-to-end with knowledge distillation, that performs class-agnostic object proposals and classification on semantic categories and attributes with classifiers generated from a text encoder; Finally, (iv) we conduct extensive experiments on VAW, MS-COCO, LSA, and OVAD datasets, and show that recognition of semantic category and attributes is complementary for visual scene understanding, i.e., jointly training object detection and attributes prediction largely outperform existing approaches that treat the two tasks independently, demonstrating strong generalization ability to novel attributes and categories.
Abstract (translated)
在本文中,我们考虑了在图像中同时检测物体并推断其视觉属性的问题,即使训练阶段没有提供手动标注,也类似于一个开放式词汇场景。为实现这一目标,我们做出了以下贡献:(一)我们首先考虑了一个开放式词汇对象检测和属性分类的简单两阶段方法,称为CLIP-Attr。候选对象首先通过离线RPN提出,然后根据语义类别和属性进行分类;(二)我们将所有可用的数据集合并并使用联邦策略进行训练,以微调CLIP模型,并将视觉表示与属性对齐;此外,我们研究在弱监督学习条件下利用自由可用的在线图像标题对的效果;(三)为了追求效率,我们训练了一个Faster-RCNN类型模型,通过知识蒸馏进行端到端训练,该模型可以在文本编码器生成分类器的基础上,对语义类别和属性进行分类;最后,(四)我们在VAW、MS-COCO、LSA和OVAD数据集上进行了广泛的实验,并表明,对语义类别和属性的识别是视觉场景理解的补充,即同时训练物体检测和属性预测在很大程度上比独立处理两个任务的方法更有效,证明了对新属性和类别的强大泛化能力。
URL
https://arxiv.org/abs/2301.09506