Abstract
All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field has been split into multiple independent subtasks. In this work, we present a universal instance perception model of the next generation, termed UNINEXT. UNINEXT reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm and can flexibly perceive different types of objects by simply changing the input prompts. This unified formulation brings the following benefits: (1) enormous data from different tasks and label vocabularies can be exploited for jointly training general instance-level representations, which is especially beneficial for tasks lacking in training data. (2) the unified model is parameter-efficient and can save redundant computation when handling multiple tasks simultaneously. UNINEXT shows superior performance on 20 challenging benchmarks from 10 instance-level tasks including classical image-level tasks (object detection and instance segmentation), vision-and-language tasks (referring expression comprehension and segmentation), and six video-level object tracking tasks. Code is available at this https URL.
Abstract (translated)
所有实例感知任务的目标都是找到由某些查询指定的对象,例如类别名称、语言表达和目标标注,但这一完整的领域已经被分裂成多个独立的子任务。在这项工作中,我们提出了下一代的通用实例感知模型,称为UNI Next。UNI Next将多个实例感知任务改编为统一的对象发现和提取范式,并且可以通过改变输入提示来灵活感知不同类型的对象。这种统一性带来了以下好处:(1)可以从不同任务和标签词汇库中收集巨大的数据,共同训练通用实例级表示,这对于缺乏训练数据的任务特别有益。(2)统一模型参数效率高,可以同时处理多个任务,并节省冗余计算。UNI Next在10个实例级任务中的20个挑战基准表现优异,包括经典图像级任务(对象检测和实例分割)、视觉和语言任务(指 expression comprehension 和分割)、以及六个视频级对象跟踪任务。代码在此https URL上可用。
URL
https://arxiv.org/abs/2303.06674