The goal of building a benchmark (suite of datasets) is to provide a unified protocol for fair evaluation and thus facilitate the evolution of a specific area. Nonetheless, we point out that existing protocols of action recognition could yield partial evaluations due to several limitations. To comprehensively probe the effectiveness of spatiotemporal representation learning, we introduce BEAR, a new BEnchmark on video Action Recognition. BEAR is a collection of 18 video datasets grouped into 5 categories (anomaly, gesture, daily, sports, and instructional), which covers a diverse set of real-world applications. With BEAR, we thoroughly evaluate 6 common spatiotemporal models pre-trained by both supervised and self-supervised learning. We also report transfer performance via standard finetuning, few-shot finetuning, and unsupervised domain adaptation. Our observation suggests that current state-of-the-art cannot solidly guarantee high performance on datasets close to real-world applications, and we hope BEAR can serve as a fair and challenging evaluation benchmark to gain insights on building next-generation spatiotemporal learners. Our dataset, code, and models are released at: this https URL
建立基准(一组数据集)的目标是提供一个统一的标准协议来进行公正评估,从而促进特定领域的演化。然而,我们指出,由于存在多个限制,现有的行动识别协议可能会得出部分评估结果。为了全面测试时间空间表示学习的有效性,我们介绍了BEAR,这是一个视频行动识别的新基准。BEAR是一个由18个视频数据集组成的集合,分为五个类别(异常、手势、日常、运动和教学),涵盖了多种实际应用场景。通过使用BEAR,我们全面评估了6个常见的时间空间模型,并通过标准微调、少量微调和无监督跨域适应等方式进行了迁移性能的测试。我们的观察表明,目前的最新技术无法完全保证接近实际应用场景的数据集的高表现,我们期望BEAR可以作为公正且具有挑战性的评估基准,以获得关于构建新一代时间空间学习器的见解。我们的数据集、代码和模型已发布在以下httpsURL:
https://arxiv.org/abs/2303.13505
Few-shot object detection (FSOD) aims to expand an object detector for novel categories given only a few instances for training. The few training samples restrict the performance of FSOD model. Recent text-to-image generation models have shown promising results in generating high-quality images. How applicable these synthetic images are for FSOD tasks remains under-explored. This work extensively studies how synthetic images generated from state-of-the-art text-to-image generators benefit FSOD tasks. We focus on two perspectives: (1) How to use synthetic data for FSOD? (2) How to find representative samples from the large-scale synthetic dataset? We design a copy-paste-based pipeline for using synthetic data. Specifically, saliency object detection is applied to the original generated image, and the minimum enclosing box is used for cropping the main object based on the saliency map. After that, the cropped object is randomly pasted on the image, which comes from the base dataset. We also study the influence of the input text of text-to-image generator and the number of synthetic images used. To construct a representative synthetic training dataset, we maximize the diversity of the selected images via a sample-based and cluster-based method. However, the severe problem of high false positives (FP) ratio of novel categories in FSOD can not be solved by using synthetic data. We propose integrating CLIP, a zero-shot recognition model, into the FSOD pipeline, which can filter 90% of FP by defining a threshold for the similarity score between the detected object and the text of the predicted category. Extensive experiments on PASCAL VOC and MS COCO validate the effectiveness of our method, in which performance gain is up to 21.9% compared to the few-shot baseline.
有限对象检测(FSOD)旨在扩展对新分类类别的对象检测器,仅提供少数实例进行训练。这些训练样本限制了FSOD模型的性能。最近,生成式文本到图像生成模型在生成高质量的图像方面表现出良好的结果。这些合成图像对于FSOD任务的应用仍然未被充分探索。本文深入研究了如何从先进的文本到图像生成模型中生成合成图像,以改善FSOD任务。我们关注两个方面:(1)如何以复制粘贴的方式使用合成数据进行FSOD任务?(2)如何从大型合成数据集中查找代表性样本?我们设计了一个基于复制粘贴的 pipeline 用于使用合成数据。具体而言,我们通过使用原始生成图像中的关注对象进行对象检测,并使用最小包围盒根据关注映射进行裁剪的主要对象。之后,裁剪对象随机粘贴到来自基础数据集的图像上。我们还研究了输入文本文本生成器和使用合成图像的数量对所选图像的影响。为了构建一个代表性的合成训练集,我们通过样本方法和簇方法最大地扩展了选择的样本的多样性。但在FSOD中,新分类类别的高误报率(FP)比例的严重问题不能用合成数据来解决。我们提出了将 CLIP 一种零次识别模型集成到FSOD管道中的方法,该方法可以通过定义相似性得分之间的检测到对象和预测类别文本的阈值过滤90%的FP。在PASCAL VOC 和 MS COCO 等数据集上的广泛实验验证了我们方法的有效性,其性能提升高达21.9%。与有限对象检测基准线相比。
https://arxiv.org/abs/2303.13221
Large language models have demonstrated surprising ability to perform in-context learning, i.e., these models can be directly applied to solve numerous downstream tasks by conditioning on a prompt constructed by a few input-output examples. However, prior research has shown that in-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats. Therefore, the construction of an appropriate prompt is essential for improving the performance of in-context learning. In this paper, we revisit this problem from the view of predictive bias. Specifically, we introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes. Then we empirically show that prompts with higher bias always lead to unsatisfactory predictive quality. Based on this observation, we propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning. We perform comprehensive experiments with state-of-the-art mainstream models such as GPT-3 on various downstream tasks. Our results indicate that our method can enhance the model's in-context learning performance in an effective and interpretable manner.
大型语言模型已经表现出惊人的在上下文中进行学习的能力,即这些模型可以通过对几个输入输出示例构建的提示进行条件化来解决许多后续任务。然而,先前的研究已经表明,由于训练示例、示例顺序和提示格式的变异,在上下文学习中可能会出现高不稳定性。因此,构建适当的提示是至关重要的,以改善在上下文中的学习表现。在本文中,我们重新考虑这个问题从预测偏差的视角出发。具体来说,我们引入了一种度量方法,以评估固定提示与标签或给定属性的预测偏差。然后,我们经验证了高偏差的提示总是会导致不满意的预测质量。基于这一观察,我们提出了一种基于贪婪搜索的新搜索策略,以识别改善在上下文中的学习表现的最佳提示。我们与最先进的主流模型如GPT-3在各种后续任务上进行综合实验。我们的结果表明,我们的方法可以在有效且可解释的方式增强模型在上下文学习中的表现。
https://arxiv.org/abs/2303.13217
In Class-Incremental Learning (CIL) an image classification system is exposed to new classes in each learning session and must be updated incrementally. Methods approaching this problem have updated both the classification head and the feature extractor body at each session of CIL. In this work, we develop a baseline method, First Session Adaptation (FSA), that sheds light on the efficacy of existing CIL approaches and allows us to assess the relative performance contributions from head and body adaption. FSA adapts a pre-trained neural network body only on the first learning session and fixes it thereafter; a head based on linear discriminant analysis (LDA), is then placed on top of the adapted body, allowing exact updates through CIL. FSA is replay-free i.e.~it does not memorize examples from previous sessions of continual learning. To empirically motivate FSA, we first consider a diverse selection of 22 image-classification datasets, evaluating different heads and body adaptation techniques in high/low-shot offline settings. We find that the LDA head performs well and supports CIL out-of-the-box. We also find that Featurewise Layer Modulation (FiLM) adapters are highly effective in the few-shot setting, and full-body adaption in the high-shot setting. Second, we empirically investigate various CIL settings including high-shot CIL and few-shot CIL, including settings that have previously been used in the literature. We show that FSA significantly improves over the state-of-the-art in 15 of the 16 settings considered. FSA with FiLM adapters is especially performant in the few-shot setting. These results indicate that current approaches to continuous body adaptation are not working as expected. Finally, we propose a measure that can be applied to a set of unlabelled inputs which is predictive of the benefits of body adaptation.
在分类增量学习(CIL)中,每个学习阶段都会暴露新的类别,必须逐步更新。解决这个问题的方法已经更新了CIL在每个阶段的分类头和特征提取器身体。在本研究中,我们开发了一个基准方法——第一阶段适应(FSA),该方法揭示了现有CIL方法的有效性,并让我们评估头和身体适应的性能贡献。在FSA中,只有在第一个学习阶段适应预训练神经网络身体,然后才固定它;基于线性判别分析(LDA)的头部则放置在适应身体的顶部,通过CIL进行精确更新。FSA是回放免费的,即它不记忆先前学习阶段中的示例。为了实证激励FSA,我们首先考虑了22个图像分类数据集,在低/高像素离线设置中评估不同的头部和身体适应技术。我们发现,LDA头部表现良好,并支持CIL的离线准备。我们还发现,特征层调制(FiLM)Adapter在少量射击设置中非常有效,并在高射击设置中实现全身适应。其次,我们实证研究了各种CIL设置,包括高射击和少量射击CIL,包括在文献中已使用过的设置。我们表明,在考虑的16个设置中,FSA比当前最先进的方法显著提高。与FiLMAdapter一起使用的FSA在少量射击设置中尤其有效。这些结果表明,当前的连续身体适应方法没有达到预期的效果。最后,我们提出一个衡量方法,可以应用于未标记输入,预测身体适应的好处。
https://arxiv.org/abs/2303.13199
Font design is of vital importance in the digital content design and modern printing industry. Developing algorithms capable of automatically synthesizing vector fonts can significantly facilitate the font design process. However, existing methods mainly concentrate on raster image generation, and only a few approaches can directly synthesize vector fonts. This paper proposes an end-to-end trainable method, VecFontSDF, to reconstruct and synthesize high-quality vector fonts using signed distance functions (SDFs). Specifically, based on the proposed SDF-based implicit shape representation, VecFontSDF learns to model each glyph as shape primitives enclosed by several parabolic curves, which can be precisely converted to quadratic Bézier curves that are widely used in vector font products. In this manner, most image generation methods can be easily extended to synthesize vector fonts. Qualitative and quantitative experiments conducted on a publicly-available dataset demonstrate that our method obtains high-quality results on several tasks, including vector font reconstruction, interpolation, and few-shot vector font synthesis, markedly outperforming the state of the art.
字体设计在数字内容设计和现代印刷业中至关重要。开发能够自动合成矢量字体的算法可以大大简化字体设计过程。然而,现有的方法主要关注Raster image generation,而且只有几种方法可以直接合成矢量字体。本文提出了一种端到端训练的方法, VecFontSDF,使用 signed distance functions (SDFs) 重构和合成高质量的矢量字体。具体来说,基于所提出的SDF为基础的隐含形状表示, VecFontSDF学习将每个字符视为形状基本单元,被精确转换为在矢量字体产品中广泛使用的 quadratic Bezier 曲线。因此,大多数生成方法可以轻松地扩展用于合成矢量字体。在公开数据集上进行定性和定量实验表明,我们的方法在多个任务中取得了高质量的结果,包括矢量字体重建、插值和少量的矢量字体合成,显著超越了现有技术水平。
https://arxiv.org/abs/2303.12675
Contrastive Language-Image Pre-training, benefiting from large-scale unlabeled text-image pairs, has demonstrated great performance in open-world vision understanding tasks. However, due to the limited Text-3D data pairs, adapting the success of 2D Vision-Language Models (VLM) to the 3D space remains an open problem. Existing works that leverage VLM for 3D understanding generally resort to constructing intermediate 2D representations for the 3D data, but at the cost of losing 3D geometry information. To take a step toward open-world 3D vision understanding, we propose Contrastive Language-Image-Point Cloud Pretraining (CLIP^2) to directly learn the transferable 3D point cloud representation in realistic scenarios with a novel proxy alignment mechanism. Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios. On top of that, we propose a cross-modal contrastive objective to learn semantic and instance-level aligned point cloud representation. Experimental results on both indoor and outdoor scenarios show that our learned 3D representation has great transfer ability in downstream tasks, including zero-shot and few-shot 3D recognition, which boosts the state-of-the-art methods by large margins. Furthermore, we provide analyses of the capability of different representations in real scenarios and present the optional ensemble scheme.
Contrastive Language-Image Pre-training利用大规模未标记文本图像 pairs 展示了在开放世界视觉理解任务中的良好表现。然而,由于文本-3D数据 pairs 有限,将2D视觉-语言模型(VLM)在3D空间中的成功适应仍然是一个开放性问题。现有的工作利用VLM为3D理解而使用,通常只能构建2D中间表示,但却失去了3D几何信息。为了迈向开放世界3D视觉理解,我们提出了Contrastive Language-Image-Point Cloud Pretraining(CLIP^2),通过一种新的代理对齐机制,在真实的场景下直接学习可转移的3D点云表示。具体来说,我们利用2D和3D场景中的自然对应关系,从这些复杂的场景中构建对齐的文本-图像-点代理。此外,我们提出了一个跨modalContrastive目标,以学习语义和实例级别的对齐点云表示。在室内和室外场景中的实验结果显示,我们学习到的3D表示在后续任务中具有很强的转移能力,包括零和经验3D识别,这极大地提高了现有方法。此外,我们提供了不同表示在真实场景下的能力分析,并提出了可选的集成方案。
https://arxiv.org/abs/2303.12417
Prompt tuning is a parameter-efficient method, which learns soft prompts and conditions frozen language models to perform specific downstream tasks. Though effective, prompt tuning under few-shot settings on the one hand heavily relies on a good initialization of soft prompts. On the other hand, it can easily result in overfitting. Existing works leverage pre-training or supervised meta-learning to initialize soft prompts but they cannot data-efficiently generalize to unseen downstream tasks. To address the above problems, this paper proposes a novel Self-sUpervised meta-Prompt learning framework with meta-gradient Regularization for few-shot generalization (SUPMER). We first design a set of self-supervised anchor meta-training tasks with different task formats and further enrich the task distribution with curriculum-based task augmentation. Then a novel meta-gradient regularization method is integrated into meta-prompt learning. It meta-learns to transform the raw gradients during few-shot learning into a domain-generalizable direction, thus alleviating the problem of overfitting. Extensive experiments show that SUPMER achieves better performance for different few-shot downstream tasks, and also exhibits a stronger domain generalization ability.
Prompt Tuner是一种参数高效的方法,通过学习软提示和冻结语言模型来执行特定的后续任务。尽管这种方法很有效,但是在单样本设置下,它很大程度上依赖于良好的软提示初始化。另一方面,它很容易导致过拟合。现有的工作利用预训练或监督的元学习初始化软提示,但它们无法高效地从未见过的后续任务中泛化数据。为了解决这些问题,本文提出了一种 novel 的自我监督元-Prompt learning框架和元梯度 Regularization,用于单样本 generalization (SUPMER)。我们首先设计了一系列不同的任务格式的自我监督基准元-训练任务,并使用课程增强任务扩展任务分布,进一步丰富了任务分布。然后,我们引入了一种新的元梯度 Regularization方法,并在元-Prompt learning中集成它,在单样本学习中将原始梯度转换为域通用的方向,从而减轻过拟合的问题。广泛的实验结果表明,SUPMER在不同单样本后续任务中取得了更好的表现,同时也表现出更强的域通用能力。
https://arxiv.org/abs/2303.12314
In this paper, we consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario, with the goal of detecting and classifying the action instances from arbitrary categories within some untrimmed videos, even not seen at training time. We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification. We make the following contributions. First, to compensate image-text foundation models with temporal motions, we improve category-agnostic action proposal by explicitly aligning embeddings of optical flows, RGB and texts, which has largely been ignored in existing low-shot methods. Second, to improve open-vocabulary action classification, we construct classifiers with strong discriminative power, i.e., avoid lexical ambiguities. To be specific, we propose to prompt the pre-trained CLIP text encoder either with detailed action descriptions (acquired from large-scale language models), or visually-conditioned instance-specific prompt vectors. Third, we conduct thorough experiments and ablation studies on THUMOS14 and ActivityNet1.3, demonstrating the superior performance of our proposed model, outperforming existing state-of-the-art approaches by one significant margin.
在本文中,我们考虑了低精度(零精度和少量精度)情况下的时间行动定位问题,目标是在训练时不能看到某些未修剪视频的任意分类行动中实例的情况下,检测和分类这些行动实例。我们采用了基于Transformer的两步行动定位架构,并使用类无关的行动提议,随后采用开放词汇分类。我们做出了以下贡献。第一,为了补偿图像文本基础模型的时间运动,我们改进了类无关的行动提议,通过明确对齐光学流、RGB和文本的嵌入来提高其精度。这在现有的低精度方法中几乎被忽视了。第二,为了提高开放词汇分类的精度,我们建立了具有强大分类力的Classifier,即避免词义歧义。具体而言,我们提议使用详细的行动描述(从大规模语言模型获取)或视觉条件特定实例优先提示向量来启发预训练的CLIP文本编码器。第三,我们对THUMOS14和ActivityNet1.3进行了完整的实验和分解研究,展示了我们提出的模型的优秀性能,比现有的先进技术高出一个显著的差异。
https://arxiv.org/abs/2303.11732
In this paper, we propose a new challenge that synthesizes a novel view in a more practical environment, where the number of input multi-view images is limited and illumination variations are significant. Despite recent success, neural radiance fields (NeRF) require a massive amount of input multi-view images taken under constrained illuminations. To address the problem, we suggest ExtremeNeRF, which utilizes occlusion-aware multiview albedo consistency, supported by geometric alignment and depth consistency. We extract intrinsic image components that should be illumination-invariant across different views, enabling direct appearance comparison between the input and novel view under unconstrained illumination. We provide extensive experimental results for an evaluation of the task, using the newly built NeRF Extreme benchmark, which is the first in-the-wild novel view synthesis benchmark taken under multiple viewing directions and varying illuminations. The project page is at this https URL
在本文中,我们提出了一个新的挑战,可以在更加实用的环境内合成新的视图,其中输入的多方视图数量有限,照明变化也比较严重。尽管最近取得了成功,神经网络光场(NeRF)需要大量的在限定照明条件下拍摄的多方视图。为了解决这一问题,我们建议使用极限NeRF,它利用有 occlusion-aware 的多方视光一致性,支持几何对齐和深度一致性。我们提取了应该在不同视图之间具有照明不变的固有图像成分,以便在不受限制照明条件下,对输入和新的视图进行直接的外观比较。我们使用新建立的极限NeRF基准测试来评估任务,这是从多个视角和不同照明条件下收集的首个野生新视图合成基准测试。项目页面在此 https URL。
https://arxiv.org/abs/2303.11728
Today's scene graph generation (SGG) models typically require abundant manual annotations to learn new predicate types. Thus, it is difficult to apply them to real-world applications with a long-tailed distribution of predicates. In this paper, we focus on a new promising task of SGG: few-shot SGG (FSSGG). FSSGG encourages models to be able to quickly transfer previous knowledge and recognize novel predicates well with only a few examples. Although many advanced approaches have achieved great success on few-shot learning (FSL) tasks, straightforwardly extending them into FSSGG is not applicable due to two intrinsic characteristics of predicate concepts: 1) Each predicate category commonly has multiple semantic meanings under different contexts. 2) The visual appearance of relation triplets with the same predicate differs greatly under different subject-object pairs. Both issues make it hard to model conventional latent representations for predicate categories with state-of-the-art FSL methods. To this end, we propose a novel Decomposed Prototype Learning (DPL). Specifically, we first construct a decomposable prototype space to capture intrinsic visual patterns of subjects and objects for predicates, and enhance their feature representations with these decomposed prototypes. Then, we devise an intelligent metric learner to assign adaptive weights to each support sample by considering the relevance of their subject-object pairs. We further re-split the VG dataset and compare DPL with various FSL methods to benchmark this task. Extensive results show that DPL achieves excellent performance in both base and novel categories.
当今的场景图生成(SGG)模型通常需要大量的手动标注来学习新的条件类型。因此,很难将其应用于具有长尾巴分布的条件类型的现实世界应用中。在本文中,我们关注SGG模型的一个有前途的任务:少量SGG(FSSGG)。FSSGG鼓励模型能够快速转移先前知识并准确地识别新的条件类型,只需要几个例子。虽然许多先进的方法在少量学习(FSL)任务中取得了成功,但直接将其扩展到FSSGG并不适用,因为条件概念的两个内在特性:1)每个条件类别通常在不同上下文中具有多个语义含义。2)在不同主题对象对中,相同条件类型的关联三体的视觉外观差异很大。这些问题使得使用最先进的FSL方法来建模传统条件表示变得困难。为此,我们提出了一种新的分解原型学习(DPL)方法。具体来说,我们首先构建一个可分解的原型空间,以捕捉主题和对象对的条件视觉模式的固有视觉特征,并利用这些分解原型增强它们的特征表示。然后,我们设计了一种智能度量学习器,通过考虑它们的主题对象对之间的关系,为每个支持样本分配自适应权重。我们进一步重新分割VG数据集,并比较DPL与各种FSL方法以基准 this 任务。广泛的结果表明,DPL在基类和新类中都取得了出色的表现。
https://arxiv.org/abs/2303.10863
In real-world scenarios, it may not always be possible to collect hundreds of labeled samples per class for training deep learning-based SAR Automatic Target Recognition (ATR) models. This work specifically tackles the few-shot SAR ATR problem, where only a handful of labeled samples may be available to support the task of interest. Our approach is composed of two stages. In the first, a global representation model is trained via self-supervised learning on a large pool of diverse and unlabeled SAR data. In the second stage, the global model is used as a fixed feature extractor and a classifier is trained to partition the feature space given the few-shot support samples, while simultaneously being calibrated to detect anomalous inputs. Unlike competing approaches which require a pristine labeled dataset for pretraining via meta-learning, our approach learns highly transferable features from unlabeled data that have little-to-no relation to the downstream task. We evaluate our method in standard and extended MSTAR operating conditions and find it to achieve high accuracy and robust out-of-distribution detection in many different few-shot settings. Our results are particularly significant because they show the merit of a global model approach to SAR ATR, which makes minimal assumptions, and provides many axes for extendability.
在现实世界场景中,并不一定能够在每个类别中收集 hundreds 个标记样本来进行深度学习为基础的SAR自动目标识别模型的训练。本工作专门解决了少量标记样本支持的兴趣任务SAR ATR问题,也就是只有极少数标记样本可用来支持感兴趣的任务。我们的方法是分为两个阶段。在第一阶段,通过自监督学习从大规模未标记SAR数据中训练了一个全局表示模型。在第二阶段,全局模型被用作固定特征提取器和分类器,以根据少量支持样本将特征空间划分为多个特征方向,同时进行校准以检测异常输入。与竞争方法不同,我们的方法从未标记数据中提取高度可移植的特征,这些特征与后续任务几乎没有关系。我们在标准扩展的MSTAR operating条件下评估了我们的方法,发现它在许多不同的少量样本设置中实现了高准确性和鲁棒性的离群值检测。我们的结果特别有意义,因为它们显示了SAR ATR全球模型方法的优点,它采取了最少的假设,并提供了许多扩展方向。
https://arxiv.org/abs/2303.10800
Deep neural networks (DNNs) are often trained on the premise that the complete training data set is provided ahead of time. However, in real-world scenarios, data often arrive in chunks over time. This leads to important considerations about the optimal strategy for training DNNs, such as whether to fine-tune them with each chunk of incoming data (warm-start) or to retrain them from scratch with the entire corpus of data whenever a new chunk is available. While employing the latter for training can be resource-intensive, recent work has pointed out the lack of generalization in warm-start models. Therefore, to strike a balance between efficiency and generalization, we introduce Learn, Unlearn, and Relearn (LURE) an online learning paradigm for DNNs. LURE interchanges between the unlearning phase, which selectively forgets the undesirable information in the model through weight reinitialization in a data-dependent manner, and the relearning phase, which emphasizes learning on generalizable features. We show that our training paradigm provides consistent performance gains across datasets in both classification and few-shot settings. We further show that it leads to more robust and well-calibrated models.
深度神经网络(DNN)通常的训练前提是提供完整的训练数据集。然而,在现实场景中,数据通常会随着时间的推移以块的形式出现。这导致对于训练DNN的最佳策略的重要考虑,例如是否需要对每个 incoming 数据块进行微调(温启动)或者是否需要在任何时候使用整个数据集块重新训练DNN。虽然使用后者进行训练可以耗费更多的资源,但最近的研究表明温启动模型缺乏泛化能力。因此,为了在效率和泛化之间取得平衡,我们提出了学习、遗忘和再学习(Lure)的DNN在线学习范式。Lure在遗忘阶段和再学习阶段之间交替进行,通过数据依赖性权重初始化选择性地遗忘模型中的不希望记得的信息。我们证明,我们的训练范式在分类和少量样本设置中提供一致性的性能增益。我们还证明,它导致更加稳健和校准的模型。
https://arxiv.org/abs/2303.10455
For video models to be transferred and applied seamlessly across video tasks in varied environments, Video Unsupervised Domain Adaptation (VUDA) has been introduced to improve the robustness and transferability of video models. However, current VUDA methods rely on a vast amount of high-quality unlabeled target data, which may not be available in real-world cases. We thus consider a more realistic \textit{Few-Shot Video-based Domain Adaptation} (FSVDA) scenario where we adapt video models with only a few target video samples. While a few methods have touched upon Few-Shot Domain Adaptation (FSDA) in images and in FSVDA, they rely primarily on spatial augmentation for target domain expansion with alignment performed statistically at the instance level. However, videos contain more knowledge in terms of rich temporal and semantic information, which should be fully considered while augmenting target domains and performing alignment in FSVDA. We propose a novel SSA2lign to address FSVDA at the snippet level, where the target domain is expanded through a simple snippet-level augmentation followed by the attentive alignment of snippets both semantically and statistically, where semantic alignment of snippets is conducted through multiple perspectives. Empirical results demonstrate state-of-the-art performance of SSA2lign across multiple cross-domain action recognition benchmarks.
为了在不同环境中顺畅地将视频模型在不同任务中进行转移和应用,引入了视频无监督domain适应(VUDA),以改善视频模型的鲁棒性和可移植性。然而,目前VUDA方法依赖于大量的高质量未标记目标数据,这在现实世界中可能无法获得。因此,我们考虑一个更实际的情况:视频基于片段域适应(FSVDA)的情况,我们仅使用少数目标视频样本来适应视频模型。虽然一些方法在图像和FSVDA中涉及了片段域适应(FSDA),但它们主要依赖于空间扩展来扩大目标域,并通过实例级别上的概率对齐进行语义对齐。然而,视频包含丰富的时间和语义信息,应该在进行扩展目标域和进行对齐的FSVDA中全面考虑。我们提出了一种新的SSA2LIGN方法,以解决片段级别的FSVDA问题,该方法通过简单的片段级扩展来实现目标域的扩展,并通过多个视角进行语义对齐。实验结果证明了SSA2LIGN在不同跨域行动识别基准中的最先进的性能。
https://arxiv.org/abs/2303.10451
Conventional training of deep neural networks requires a large number of the annotated image which is a laborious and time-consuming task, particularly for rare objects. Few-shot object detection (FSOD) methods offer a remedy by realizing robust object detection using only a few training samples per class. An unexplored challenge for FSOD is that instances from unlabeled novel classes that do not belong to the fixed set of training classes appear in the background. These objects behave similarly to label noise, leading to FSOD performance degradation. We develop a semi-supervised algorithm to detect and then utilize these unlabeled novel objects as positive samples during training to improve FSOD performance. Specifically, we propose a hierarchical ternary classification region proposal network (HTRPN) to localize the potential unlabeled novel objects and assign them new objectness labels. Our improved hierarchical sampling strategy for the region proposal network (RPN) also boosts the perception ability of the object detection model for large objects. Our experimental results indicate that our method is effective and outperforms the existing state-of-the-art (SOTA) FSOD methods.
常规训练深度学习网络需要大量注释图像,这是一个繁琐和耗时的任务,特别是对于罕见的物体。零标注物体检测方法(FSOD)可以通过仅使用每个类别的几个训练样本实现可靠的物体检测,提供一个解决方法。FSOD的一个未探索的挑战是未标记的新类实例,它们不属于固定的训练类集合。这些物体的行为类似于标签噪声,导致FSOD性能下降。我们开发了一个半监督算法,在训练期间检测并利用这些未标记的新类实例作为积极样本,以提高FSOD性能。具体来说,我们提出了一个三分类区域提议网络(HTRPN),以定位潜在的未标记新类实例,并赋予它们新物体性标签。我们改进了区域提议网络(RPN)的Hierarchical 采样策略,也增强了对大型物体的检测模型的感知能力。我们的实验结果显示,我们的方法和现有的最先进的FSOD方法相比,有效且表现优异。
https://arxiv.org/abs/2303.10422
GPT series models, such as GPT-3, CodeX, InstructGPT, ChatGPT, and so on, have gained considerable attention due to their exceptional natural language processing capabilities. However, despite the abundance of research on the difference in capabilities between GPT series models and fine-tuned models, there has been limited attention given to the evolution of GPT series models' capabilities over time. To conduct a comprehensive analysis of the capabilities of GPT series models, we select six representative models, comprising two GPT-3 series models (i.e., davinci and text-davinci-001) and four GPT-3.5 series models (i.e., code-davinci-002, text-davinci-002, text-davinci-003, and gpt-3.5-turbo). We evaluate their performance on nine natural language understanding (NLU) tasks using 21 datasets. In particular, we compare the performance and robustness of different models for each task under zero-shot and few-shot scenarios. Our extensive experiments reveal that the overall ability of GPT series models on NLU tasks does not increase gradually as the models evolve, especially with the introduction of the RLHF training strategy. While this strategy enhances the models' ability to generate human-like responses, it also compromises their ability to solve some tasks. Furthermore, our findings indicate that there is still room for improvement in areas such as model robustness.
GPT系列模型,如GPT-3、CodeX、InstructGPT、ChatGPT等,因其卓越的自然语言处理能力而获得了大量关注。然而,尽管有许多研究探讨GPT系列模型和优化模型之间的能力差异,但对人类语言理解任务(NLU)方面GPT系列模型的能力演变的关注却相对较少。为了进行全面的分析,我们选择6个代表性模型,其中包括2个GPT-3系列模型(即davinci和text-davinci-001)和4个GPT-3.5系列模型(即code-davinci-002、text-davinci-002、text-davinci-003和GPT-3.5-Turbo),使用21个数据集对9个NLU任务进行评估。特别是,我们比较了不同模型在零样本和少量样本场景下的性能和鲁棒性。我们的广泛实验表明,GPT系列模型在NLU任务方面的整体能力并没有随着模型的演变而逐渐提高,特别是在引入RLHF训练策略后。尽管这种策略增强了模型生成人类般的响应的能力,但也牺牲了某些任务解决问题的能力。此外,我们的发现表明,在模型鲁棒性等方面,仍有改进的空间。
https://arxiv.org/abs/2303.10420
Radiology AI models have made significant progress in near-human performance or surpassing it. However, AI model's partnership with human radiologist remains an unexplored challenge due to the lack of health information standards, contextual and workflow differences, and data labeling variations. To overcome these challenges, we integrated an AI model service that uses DICOM standard SR annotations into the OHIF viewer in the open-source LibreHealth Radiology Information Systems (RIS). In this paper, we describe the novel Human-AI partnership capabilities of the platform, including few-shot learning and swarm learning approaches to retrain the AI models continuously. Building on the concept of machine teaching, we developed an active learning strategy within the RIS, so that the human radiologist can enable/disable AI annotations as well as "fix"/relabel the AI annotations. These annotations are then used to retrain the models. This helps establish a partnership between the radiologist user and a user-specific AI model. The weights of these user-specific models are then finally shared between multiple models in a swarm learning approach.
医学影像学的人工智能模型在接近人类表现或超越人类表现方面取得了显著进展。然而,与人类放射科医生的合作仍然是一项未探索的挑战,因为这涉及到缺乏健康信息标准、上下文和工作流程差异以及数据标签的变化。为了克服这些挑战,我们在开源的自由医学影像学信息系统(RIS)中集成了一个使用DICOM标准SR annotations的人工智能模型服务,将其与OHIF viewer集成在一起。在本文中,我们描述了平台的 novel Human-AI partnership capabilities,包括少量样本学习和群体学习方法,以持续训练人工智能模型。基于机器学习的概念,我们在RIS中开发了主动学习策略,使人类放射科医生能够启用/禁用AI注释,并“修复”/重新标签AI注释。这些注释随后用于训练模型。这有助于建立放射科医生用户与特定AI模型之间的合作伙伴关系。这些特定模型的权重最后终于在群体学习方法中共享多个模型。
https://arxiv.org/abs/2303.10338
A robot operating in unstructured environments must be able to discriminate between different grasping styles depending on the prospective manipulation task. Having a system that allows learning from remote non-expert demonstrations can very feasibly extend the cognitive skills of a robot for task-oriented grasping. We propose a novel two-step framework towards this aim. The first step involves grasp area estimation by segmentation. We receive grasp area demonstrations for a new task via interactive segmentation, and learn from these few demonstrations to estimate the required grasp area on an unseen scene for the given task. The second step is autonomous grasp estimation in the segmented region. To train the segmentation network for few-shot learning, we built a grasp area segmentation (GAS) dataset with 10089 images grouped into 1121 segmentation tasks. We benefit from an efficient meta learning algorithm for training for few-shot adaptation. Experimental evaluation showed that our method successfully detects the correct grasp area on the respective objects in unseen test scenes and effectively allows remote teaching of new grasp strategies by non-experts.
在无组织环境下运行机器人必须能够根据未来的操作任务区分不同的抓握风格。拥有一个允许从远程非专家演示学习系统的系统可以非常有效地扩展机器人的任务导向抓握的认知技能。为此,我们提出了一个 novel 两个步骤的框架。第一步涉及分割领域的抓握面积估计。我们通过交互式分割获得一个新的任务的抓握面积演示,并从中学习,以估计针对给定任务在一个未访问的场景下的所需抓握面积。第二步是自动分割领域的抓握面积估计。为了训练分割网络进行少量学习,我们建立了一个包含10089张图像的抓握面积分割(GAS)数据集。我们利用高效的多任务学习算法进行少量学习的培训和适应。实验评估表明,我们的方法成功检测到在未访问的测试场景下相应的物体的正确抓握面积,并有效地允许非专家远程教授新的抓握策略。
https://arxiv.org/abs/2303.10195
Visual-Language Models (VLMs) have significantly advanced action video recognition. Supervised by the semantics of action labels, recent works adapt the visual branch of VLMs to learn video representations. Despite the effectiveness proved by these works, we believe that the potential of VLMs has yet to be fully harnessed. In light of this, we exploit the semantic units (SU) hiding behind the action labels and leverage their correlations with fine-grained items in frames for more accurate action recognition. SUs are entities extracted from the language descriptions of the entire action set, including body parts, objects, scenes, and motions. To further enhance the alignments between visual contents and the SUs, we introduce a multi-region module (MRA) to the visual branch of the VLM. The MRA allows the perception of region-aware visual features beyond the original global feature. Our method adaptively attends to and selects relevant SUs with visual features of frames. With a cross-modal decoder, the selected SUs serve to decode spatiotemporal video representations. In summary, the SUs as the medium can boost discriminative ability and transferability. Specifically, in fully-supervised learning, our method achieved 87.8\% top-1 accuracy on Kinetics-400. In K=2 few-shot experiments, our method surpassed the previous state-of-the-art by +7.1% and +15.0% on HMDB-51 and UCF-101, respectively.
视觉语言模型(VLMs)已经极大地提高了动作视频识别的性能。通过执行动作标签语义,最近的工作调整了VLM的视觉分支以学习视频表示。尽管这些工作证明了VLM的潜力,但我们仍然相信VLM的潜力还没有完全 harness。因此,我们利用隐藏在动作标签背后的语义单元(SU)并利用它们与帧中的细粒度物品的相关性来更准确地执行动作识别。 SU 是从整个动作集合的语言描述中提取实体,包括身体部分、物体、场景和运动。为了进一步增强视觉内容和 SU 之间的对齐,我们引入了一个多区域模块(MRA)到 VLM 的视觉分支。 MRA 允许超越原始全局特征的区域感知视觉特征。我们的算法自适应地关注并选择与帧视觉特征相关的 SU。使用跨模态解码器,选定的 SU 用于解码时序视频表示。因此, SU 作为媒介可以增强区分能力和转移性。具体而言,在完全监督学习中,我们的方法在Kinetics-400上实现了87.8\%的第一名准确性。在 K=2的少量样本实验中,我们的方法分别比先前的最先进的技术提高了7.1%和15.0%。
https://arxiv.org/abs/2303.09756
Data augmentation has become a crucial component to train state-of-the-art visual representation models. However, handcrafting combinations of transformations that lead to improved performances is a laborious task, which can result in visually unrealistic samples. To overcome these limitations, recent works have explored the use of generative models as learnable data augmentation tools, showing promising results in narrow application domains, e.g., few-shot learning and low-data medical imaging. In this paper, we introduce a data augmentation module, called DA_IC-GAN, which leverages instance-conditioned GAN generations and can be used off-the-shelf in conjunction with most state-of-the-art training recipes. We showcase the benefits of DA_IC-GAN by plugging it out-of-the-box into the supervised training of ResNets and DeiT models on the ImageNet dataset, and achieving accuracy boosts up to between 1%p and 2%p with the highest capacity models. Moreover, the learnt representations are shown to be more robust than the baselines when transferred to a handful of out-of-distribution datasets, and exhibit increased invariance to variations of instance and viewpoints. We additionally couple DA_IC-GAN with a self-supervised training recipe and show that we can also achieve an improvement of 1%p in accuracy in some settings. With this work, we strengthen the evidence on the potential of learnable data augmentations to improve visual representation learning, paving the road towards non-handcrafted augmentations in model training.
数据增强已经成为训练现代视觉表示模型的关键组成部分。然而,手工组合变换导致性能改善是一项艰苦的任务,可能会导致视觉效果不合理的样本。为了克服这些限制,最近的工作探索了生成模型作为可学习的数据增强工具的使用,在狭窄的应用 domains 内,例如单样本学习和小数据医学成像,取得了令人瞩目的结果。在本文中,我们介绍了 DA_IC-GAN 数据增强模块,它利用实例条件GAN生成器,可以与大多数先进的训练配方一起使用。我们展示 DA_IC-GAN 的优势,通过将其插入 ImageNet 数据集上的ResNet 和 DeiT模型的监督训练中,并将精度Boost到1%p至2%p的最高水平模型上。此外,当将其转移到少数非分布数据集时,学习的表示比基准更加鲁棒,并且具有增加对实例和观点变异的不变性。我们此外与自监督训练配方联用,并表明在某些设置下,我们也能提高1%p的精度。通过这项工作,我们加强了可学习数据增强改善视觉表示学习的潜力的证据,开创了模型训练中不使用手工增强的道路。
https://arxiv.org/abs/2303.09677
Generalized few-shot object detection aims to achieve precise detection on both base classes with abundant annotations and novel classes with limited training data. Existing approaches enhance few-shot generalization with the sacrifice of base-class performance, or maintain high precision in base-class detection with limited improvement in novel-class adaptation. In this paper, we point out the reason is insufficient Discriminative feature learning for all of the classes. As such, we propose a new training framework, DiGeo, to learn Geometry-aware features of inter-class separation and intra-class compactness. To guide the separation of feature clusters, we derive an offline simplex equiangular tight frame (ETF) classifier whose weights serve as class centers and are maximally and equally separated. To tighten the cluster for each class, we include adaptive class-specific margins into the classification loss and encourage the features close to the class centers. Experimental studies on two few-shot benchmark datasets (VOC, COCO) and one long-tail dataset (LVIS) demonstrate that, with a single model, our method can effectively improve generalization on novel classes without hurting the detection of base classes.
通用 few-shot 物体检测的目标是在拥有大量标注的基类和仅有少量训练数据的新类上都实现准确的检测。现有的方法通过牺牲基类性能来增强 few-shot 泛化,或者在基类检测精度不变的情况下,仅少量改善新类适应度。在本文中,我们指出原因是对所有类没有足够的分类特征学习。因此,我们提出了一个新的训练框架 DiGeo,以学习类别间分离和类别内密集性的几何aware特征。为了指导特征簇的分离,我们推导了一个 offline 在线的等角紧框(ETF)分类器,其权重作为类中心,被最大化且同样地分离每个类。为了收紧每个类的特征簇,我们将其自适应类特异性margins 添加到分类损失中,并鼓励接近类中心的特征。在两个 few-shot 基准数据集(VOC 和 COCO)以及一个长尾巴数据集(LVIS)的实验研究中,证明了通过单个模型,我们的方法可以有效地改善新类对基类的泛化,而不会伤害基类检测精度。
https://arxiv.org/abs/2303.09674