Paper Reading AI Learner

OvarNet: Towards Open-vocabulary Object Attribute Recognition

2023-01-23 15:59:29
Keyan Chen, Xiaolong Jiang, Yao Hu, Xu Tang, Yan Gao, Jianqi Chen, Weidi Xie

Abstract

In this paper, we consider the problem of simultaneously detecting objects and inferring their visual attributes in an image, even for those with no manual annotations provided at the training stage, resembling an open-vocabulary scenario. To achieve this goal, we make the following contributions: (i) we start with a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr. The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes; (ii) we combine all available datasets and train with a federated strategy to finetune the CLIP model, aligning the visual representation with attributes, additionally, we investigate the efficacy of leveraging freely available online image-caption pairs under weakly supervised learning; (iii) in pursuit of efficiency, we train a Faster-RCNN type model end-to-end with knowledge distillation, that performs class-agnostic object proposals and classification on semantic categories and attributes with classifiers generated from a text encoder; Finally, (iv) we conduct extensive experiments on VAW, MS-COCO, LSA, and OVAD datasets, and show that recognition of semantic category and attributes is complementary for visual scene understanding, i.e., jointly training object detection and attributes prediction largely outperform existing approaches that treat the two tasks independently, demonstrating strong generalization ability to novel attributes and categories.

Abstract (translated)

在本文中,我们考虑了在图像中同时检测物体并推断其视觉属性的问题,即使训练阶段没有提供手动标注,也类似于一个开放式词汇场景。为实现这一目标,我们做出了以下贡献:(一)我们首先考虑了一个开放式词汇对象检测和属性分类的简单两阶段方法,称为CLIP-Attr。候选对象首先通过离线RPN提出,然后根据语义类别和属性进行分类;(二)我们将所有可用的数据集合并并使用联邦策略进行训练,以微调CLIP模型,并将视觉表示与属性对齐;此外,我们研究在弱监督学习条件下利用自由可用的在线图像标题对的效果;(三)为了追求效率,我们训练了一个Faster-RCNN类型模型,通过知识蒸馏进行端到端训练,该模型可以在文本编码器生成分类器的基础上,对语义类别和属性进行分类;最后,(四)我们在VAW、MS-COCO、LSA和OVAD数据集上进行了广泛的实验,并表明,对语义类别和属性的识别是视觉场景理解的补充,即同时训练物体检测和属性预测在很大程度上比独立处理两个任务的方法更有效,证明了对新属性和类别的强大泛化能力。

URL

https://arxiv.org/abs/2301.09506

PDF

https://arxiv.org/pdf/2301.09506.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot