Machine unlearning has emerged as a prominent and challenging area of interest, driven in large part by the rising regulatory demands for industries to delete user data upon request and the heightened awareness of privacy. Existing approaches either retrain models from scratch or use several finetuning steps for every deletion request, often constrained by computational resource limitations and restricted access to the original training data. In this work, we introduce a novel class unlearning algorithm designed to strategically eliminate an entire class or a group of classes from the learned model. To that end, our algorithm first estimates the Retain Space and the Forget Space, representing the feature or activation spaces for samples from classes to be retained and unlearned, respectively. To obtain these spaces, we propose a novel singular value decomposition-based technique that requires layer wise collection of network activations from a few forward passes through the network. We then compute the shared information between these spaces and remove it from the forget space to isolate class-discriminatory feature space for unlearning. Finally, we project the model weights in the orthogonal direction of the class-discriminatory space to obtain the unlearned model. We demonstrate our algorithm's efficacy on ImageNet using a Vision Transformer with only $\sim$1.5% drop in retain accuracy compared to the original model while maintaining under 1% accuracy on the unlearned class samples. Further, our algorithm consistently performs well when subject to Membership Inference Attacks showing 7.8% improvement on average across a variety of image classification datasets and network architectures, as compared to other baselines while being $\sim$6x more computationally efficient.
机器学习消退已成为一个突出且具有挑战性的兴趣领域,很大程度上是由对行业在收到请求时删除用户数据的需求增加和隐私意识增强而催生的。现有的方法不是从头重新训练模型,就是对每个删除请求使用多个微调步骤,往往受到计算资源限制和原始训练数据访问的限制。在这项工作中,我们引入了一种名为策略消除类消退的新消退算法,旨在有意识地从已学到的模型中消除整个类或一个类。为此,我们的算法首先估计保留空间和忘记空间,分别表示保留和消退的样本特征或激活空间。为了获得这些空间,我们提出了一种基于层归一化的新的单例值分解技术,该技术要求从网络的前几层中收集网络活动。然后计算这些空间之间的共享信息并将其从忘记空间中移除,以隔离消退类别的特征空间。最后,我们将模型权重在类归一化空间的垂直方向上投影以获得消退模型。我们在ImageNet上使用仅比原始模型保留准确性约下降1.5%的Vision Transformer来验证我们算法的有效性,同时将消退类样本的准确性保持在不到1%的水平。此外,我们的算法在受到成员推断攻击时表现出色,平均将在各种图像分类数据集和网络架构上的准确性提高7.8%,相对于其他基线,同时实现约6倍于其他算法的计算效率。
https://arxiv.org/abs/2312.00761
Transformers have achieved remarkable success in a wide range of natural language processing and computer vision applications. However, the representation capacity of a deep transformer model is degraded due to the over-smoothing issue in which the token representations become identical when the model's depth grows. In this work, we show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity. We then propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens. Minimizing the resulting regularized energy functional, we derive the Neural Transformer with a Regularized Nonlocal Functional (NeuTRENO), a novel class of transformer models that can mitigate the over-smoothing issue. We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations on various practical tasks, including object classification, image segmentation, and language modeling.
Transformer模型在自然语言处理和计算机视觉应用领域取得了显著的成功。然而,由于在模型深度增加时出现的过度平滑问题,深度Transformer模型的表示能力会降低。在本文中,我们证明了Transformer中的自注意力层通过一个促进平滑的函数最小化了一个函数,从而导致标记的均匀性。然后,我们提出了一个新颖的 regularizer,它惩罚自注意力和输入标记之间平滑输出标记的范数,以保留标记的准确性。通过最小化由此产生的 regularized energy functional,我们推导出了一种名为 NeurTransformer with Regularized Nonlocal Functional (NeuTRENO) 的神经Transformer模型,这是一种新的Transformer模型,可以减轻过度平滑问题。我们通过实验实证证明NeuTRENO相对于基线Transformer和最先进的方法在各种实际任务(包括对象分类、图像分割和自然语言建模)中减少标记表示过度平滑的优势。
https://arxiv.org/abs/2312.00751
We present GIFT (Generative Interpretable Fine-tuning Transformers) for fine-tuning pretrained (often large) Transformer models at downstream tasks in a parameter-efficient way with built-in interpretability. Our GIFT is a deep parameter-residual learning method, which addresses two problems in fine-tuning a pretrained Transformer model: Where to apply the parameter-efficient fine-tuning (PEFT) to be extremely lightweight yet sufficiently expressive, and How to learn the PEFT to better exploit the knowledge of the pretrained model in a direct way? For the former, we select the final projection (linear) layer in the multi-head self-attention of a Transformer model, and verify its effectiveness. For the latter, in contrast to the prior art that directly introduce new model parameters (often in low-rank approximation form) to be learned in fine-tuning with downstream data, we propose a method for learning to generate the fine-tuning parameters. Our GIFT is a hyper-Transformer which take as input the pretrained parameters of the projection layer to generate its fine-tuning parameters using a proposed Parameter-to-Cluster Attention (PaCa). The PaCa results in a simple clustering-based forward explainer that plays the role of semantic segmentation in testing. In experiments, our proposed GIFT is tested on the VTAB benchmark and the fine-grained visual classification (FGVC) benchmark. It obtains significantly better performance than the prior art. Our code is available at this https URL
我们提出了GIFT(生成可解释的微调转换器)方法,以在参数效率的前提下,在下游任务中微调预训练(通常较大)Transformer模型。我们的GIFT是一种深度参数残差学习方法,它解决了在微调预训练Transformer模型时遇到的两个问题:如何将参数高效的微调(PEFT)应用到模型的最后投影(线性)层,以及如何通过直接利用预训练模型的知识来学习PEFT。对于第一个问题,我们选择Transformer模型的多头自注意力最后投影层,并验证其有效性。对于第二个问题,与先前的艺术作品不同,我们提出了一种学习生成微调参数的方法。我们的GIFT是一种超Transformer,它接受投影层的预训练参数,并通过所提出的参数到聚类的注意(PaCa)方法生成微调参数。PaCa产生一个简单的聚类为基础的向前解释器,在测试中扮演着语义分割的角色。在实验中,我们对VTAB基准和细粒度视觉分类(FGVC)基准进行了测试。与先前的艺术作品相比,我们的GIFT取得了显著的更好的性能。我们的代码可在此处访问:https://www.xxxxxx
https://arxiv.org/abs/2312.00700
Guided by grammatical structure, words compose to form sentences, and guided by discourse structure, sentences compose to form dialogues and documents. The compositional aspect of sentence and discourse units is often overlooked by machine learning algorithms. A recent initiative called Quantum Natural Language Processing (QNLP) learns word meanings as points in a Hilbert space and acts on them via a translation of grammatical structure into Parametrised Quantum Circuits (PQCs). Previous work extended the QNLP translation to discourse structure using points in a closure of Hilbert spaces. In this paper, we evaluate this translation on a Winograd-style pronoun resolution task. We train a Variational Quantum Classifier (VQC) for binary classification and implement an end-to-end pronoun resolution system. The simulations executed on IBMQ software converged with an F1 score of 87.20%. The model outperformed two out of three classical coreference resolution systems and neared state-of-the-art SpanBERT. A mixed quantum-classical model yet improved these results with an F1 score increase of around 6%.
受语法结构的指导,单词组合形成句子,受语篇结构的指导,句子组合形成对话和文档。句子和语篇单位的组合方面常常被机器学习算法忽视。最近的一个名为量子自然语言处理(QNLP)的倡议将词义学习为希尔伯特空间中的点,并通过将语篇结构从语法结构翻译为参数化量子电路(PQCs)来对这些词义进行操作。之前的 work 将 QNLP 翻译扩展到语篇结构,利用希尔伯特空间中的点。在本文中,我们通过 winograd 风格的同义词消解任务评估了这种翻译。我们训练了一个二分类变分量子分类器(VQC),并实现了端到端同义词消解系统。在 IBMQ 软件上执行的模拟获得了 87.20%的 F1 分数。该模型超过了两个三分之二的经典同义词消解系统,并接近于最先进的 SpanBERT。一种混合量子-经典模型还提高了这些结果,F1 分数增加了约 6%。
https://arxiv.org/abs/2312.00688
Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for training lightweight CLIP models. Firstly, to mitigate the problem that some image-text pairs are not strictly one-to-one correspondence, we improve the conventional global instance-level alignment objective by softening the label of negative samples progressively. Secondly, a relaxed bipartite matching based token-level alignment objective is introduced for finer-grained alignment between image patches and textual words. Moreover, based on the observation that the accuracy of CLIP model does not increase correspondingly as the parameters of text encoder increase, an extra objective of masked language modeling (MLM) is leveraged for maximizing the potential of the shortened text encoder. In practice, an auxiliary fusion module injecting unmasked image embedding into masked text embedding at different network stages is proposed for enhancing the MLM. Extensive experiments show that without introducing additional computational cost during inference, the proposed method achieves a higher performance on multiple downstream tasks.
像CLIP这样的视觉语言预训练在各种下游任务上的表现都相当出色,例如零散射击图像分类和图像-文本检索。大多数现有的CLIP类似作品通常采用较大的图像编码器,如ResNet50和ViT,而轻量级的版本很少被讨论。在本文中,我们提出了一个多级交互范式来训练轻量级CLIP模型。首先,为了减轻一些图像-文本对不是严格一对一对应的问题,我们通过逐渐软化负样本的标签来改进传统的全局实例级对齐目标。其次,引入了一个基于标记的轻量化二元匹配的平滑二元匹配基于词级对齐目标,用于对图像补丁和文本单词进行细粒度对齐。此外,根据观察到CLIP模型的准确性不会随着文本编码器参数的增加而相应增加,我们引入了遮蔽语言建模(MLM)的额外目标,以提高缩短的文本编码器的潜力。在实践中,我们提出了一个在网络阶段为遮蔽图像嵌入注入的辅助融合模块,以增强MLM。大量实验证明,在无需在推理过程中引入额外计算成本的情况下,所提出的方法在多个下游任务上取得了更高的性能。
https://arxiv.org/abs/2312.00674
Diffusion models have gained prominence in generating data for perception tasks such as image classification and object detection. However, the potential in generating high-quality tracking sequences, a crucial aspect in the field of video perception, has not been fully investigated. To address this gap, we propose TrackDiffusion, a novel architecture designed to generate continuous video sequences from the tracklets. TrackDiffusion represents a significant departure from the traditional layout-to-image (L2I) generation and copy-paste synthesis focusing on static image elements like bounding boxes by empowering image diffusion models to encompass dynamic and continuous tracking trajectories, thereby capturing complex motion nuances and ensuring instance consistency among video frames. For the first time, we demonstrate that the generated video sequences can be utilized for training multi-object tracking (MOT) systems, leading to significant improvement in tracker performance. Experimental results show that our model significantly enhances instance consistency in generated video sequences, leading to improved perceptual metrics. Our approach achieves an improvement of 8.7 in TrackAP and 11.8 in TrackAP$_{50}$ on the YTVIS dataset, underscoring its potential to redefine the standards of video data generation for MOT tasks and beyond.
扩散模型已经在生成数据用于感知任务(如图像分类和目标检测)中取得了突出地位。然而,在生成高质量跟踪序列方面,在视频感知领域中还没有完全研究。为了填补这一空白,我们提出了TrackDiffusion,一种新架构,旨在从跟踪器中生成连续视频序列。TrackDiffusion在传统布局到图像(L2I)生成和复制的合成中取得了显著的突破,通过使图像扩散模型涵盖动态和连续跟踪轨迹,从而捕捉复杂运动细节并确保视频帧之间的实例一致性。为了首次证明,生成的视频序列可以用于训练多目标跟踪(MOT)系统,从而显著提高跟踪器性能。实验结果表明,我们的模型在生成视频序列方面显著提高了实例一致性,从而提高了感知指标。我们的方法在YTVIS数据集上的TrackAP和TrackAP$_{50}分别提高了8.7和11.8,强调了其重新定义视频数据生成标准,适用于MOT任务以及更广泛的视频感知应用。
https://arxiv.org/abs/2312.00651
Normalization techniques have been widely used in the field of deep learning due to their capability of enabling higher learning rates and are less careful in initialization. However, the effectiveness of popular normalization technologies is typically limited to specific areas. Unlike the standard Batch Normalization (BN) and Layer Normalization (LN), where BN computes the mean and variance along the (N,H,W) dimensions and LN computes the mean and variance along the (C,H,W) dimensions (N, C, H and W are the batch, channel, spatial height and width dimension, respectively), this paper presents a novel normalization technique called Batch Channel Normalization (BCN). To exploit both the channel and batch dependence and adaptively and combine the advantages of BN and LN based on specific datasets or tasks, BCN separately normalizes inputs along the (N, H, W) and (C, H, W) axes, then combines the normalized outputs based on adaptive parameters. As a basic block, BCN can be easily integrated into existing models for various applications in the field of computer vision. Empirical results show that the proposed technique can be seamlessly applied to various versions of CNN or Vision Transformer architecture. The code is publicly available at this https URL
正常化技术在深度学习领域得到了广泛应用,因为它们能够实现更高的学习率,而且在初始化时较为简单。然而,流行的高正常化技术的效果通常局限于特定的应用领域。与标准的大规模批均值和方差计算(BN)和层均值和方差计算(LN)不同,它们沿着(N,C,H和W)维度计算均值和方差。本文提出了一种名为批通道正常化(BCN)的新正常化技术。为了利用通道和批次的依赖性,并基于特定数据集或任务自适应地结合BN和LN的优点,BCN分别沿着(N,H,W)和(C,H,W)维度对输入进行正常化,然后根据自适应参数将正常化的输出进行组合。作为基本单元,BCN可以轻松地集成到各种计算机视觉领域的模型中。通过实验结果表明,所提出的技术可以平滑地应用于各种版本的CNN或Transformer架构。代码可在此处公开获取:https://url
https://arxiv.org/abs/2312.00596
There is a growing demand for explainable, transparent, and data-driven models within the domain of fraud detection. Decisions made by fraud detection models need to be explainable in the event of a customer dispute. Additionally, the decision-making process in the model must be transparent to win the trust of regulators and business stakeholders. At the same time, fraud detection solutions can benefit from data due to the noisy, dynamic nature of fraud and the availability of large historical data sets. Finally, fraud detection is notorious for its class imbalance: there are typically several orders of magnitude more legitimate transactions than fraudulent ones. In this paper, we present Deep Symbolic Classification (DSC), an extension of the Deep Symbolic Regression framework to classification problems. DSC casts classification as a search problem in the space of all analytic functions composed of a vocabulary of variables, constants, and operations and optimizes for an arbitrary evaluation metric directly. The search is guided by a deep neural network trained with reinforcement learning. Because the functions are mathematical expressions that are in closed-form and concise, the model is inherently explainable both at the level of a single classification decision and the model's decision process. Furthermore, the class imbalance problem is successfully addressed by optimizing for metrics that are robust to class imbalance such as the F1 score. This eliminates the need for oversampling and undersampling techniques that plague traditional approaches. Finally, the model allows to explicitly balance between the prediction accuracy and the explainability. An evaluation on the PaySim data set demonstrates competitive predictive performance with state-of-the-art models, while surpassing them in terms of explainability. This establishes DSC as a promising model for fraud detection systems.
在欺诈检测领域,越来越多的用户需要可解释、透明和数据驱动的模型。欺诈检测模型的决策需要在客户纠纷发生时解释。此外,模型的决策过程必须是透明的,以赢得监管机构和利益相关者的信任。同时,由于欺诈行为的数据噪动和大量历史数据集的可用性,欺诈检测解决方案可以从欺诈和正常交易中受益。最后,欺诈检测以其类不平衡问题而闻名:通常有多个维度的合法交易比欺诈交易多得多。在本文中,我们提出了Deep Symbolic Classification(DSC),一种扩展了Deep Symbolic Regression框架的分类问题。DSC将分类视为所有由变量、常数和运算符组成的分析函数的搜索问题,并直接优化任意评估指标。搜索由一个经过强化学习的深度神经网络进行指导。由于函数是数学表达式,处于闭式和简洁的形式,因此模型在单个分类决策水平和模型决策过程的层面上具有可解释性。此外,通过优化对类不平衡指标具有鲁棒性的度量(如F1分数),成功解决了类不平衡问题。这消除了传统方法中过采样和欠采样技术带来的需求。最后,模型使预测准确性和可解释性之间建立了显式平衡。对PaySim数据集的评估表明,与最先进的模型相比,DSC具有竞争力的预测性能,而在可解释性方面超越了它们。这使DSC成为欺诈检测系统的一个有前景的模型。
https://arxiv.org/abs/2312.00586
Unsupervised domain adaptation (UDA) seeks to bridge the domain gap between the target and source using unlabeled target data. Source-free UDA removes the requirement for labeled source data at the target to preserve data privacy and storage. However, work on source-free UDA assumes knowledge of domain gap distribution, and hence is limited to either target-aware or classification task. To overcome it, we propose TASFAR, a novel target-agnostic source-free domain adaptation approach for regression tasks. Using prediction confidence, TASFAR estimates a label density map as the target label distribution, which is then used to calibrate the source model on the target domain. We have conducted extensive experiments on four regression tasks with various domain gaps, namely, pedestrian dead reckoning for different users, image-based people counting in different scenes, housing-price prediction at different districts, and taxi-trip duration prediction from different departure points. TASFAR is shown to substantially outperform the state-of-the-art source-free UDA approaches by averagely reducing 22% errors for the four tasks and achieve notably comparable accuracy as source-based UDA without using source data.
无监督领域适应(UDA)试图通过未标记的目标数据来弥合目标来源之间的领域差距。无标签目标数据源-免费UDA消除了在目标上保留数据隐私和存储的要求。然而,对于无标签目标数据源-免费UDA的工作,它假定领域差距分布,因此仅限于目标感知或分类任务。要克服这个问题,我们提出了TASFAR,一种新颖的目标无关源-免费领域适应回归任务的新方法。通过预测置信度,TASFAR估计目标标签分布作为目标标签分布,然后用于在目标领域上调节源模型的权重。我们在四个不同的回归任务上进行了广泛的实验,包括不同用户的人行道死亡估算、不同场景的基于图像的人计数、不同区域的房屋价格预测和不同出发点的出租车行驶时间预测。TASFAR证明了在四个任务上平均减少了22%的错误,同时实现了与基于源数据的无标签UDA相当的精度,而无需使用源数据。
https://arxiv.org/abs/2312.00540
Despite the prevalence of pretrained language models in natural language understanding tasks, understanding lengthy text such as document is still challenging due to the data sparseness problem. Inspired by that humans develop their ability of understanding lengthy text from reading shorter text, we propose a simple yet effective summarization-based data augmentation, SUMMaug, for document classification. We first obtain easy-to-learn examples for the target document classification task by summarizing the input of the original training examples, while optionally merging the original labels to conform to the summarized input. We then use the generated pseudo examples to perform curriculum learning. Experimental results on two datasets confirmed the advantage of our method compared to existing baseline methods in terms of robustness and accuracy. We release our code and data at this https URL.
尽管在自然语言处理任务中预训练语言模型的普及程度很高,但理解长篇文档,如文档,仍然具有挑战性,因为数据稀疏性问题。受到人类通过阅读较短文本来理解长篇文本的能力的启发,我们提出了一个简单而有效的基于摘要的增强数据,SUMMaug,用于文档分类。我们首先通过摘要原始训练例子来获得易于学习的目标文档分类任务的示例,同时可选择性地合并原始标签以符合摘要输入。然后使用生成的伪例子进行课程学习。在两个数据集上的实验结果证实了我们的方法与现有基线方法在鲁棒性和准确性方面的优势。我们将代码和数据发布在此处:https://www.url。
https://arxiv.org/abs/2312.00513
Objective: Despite the recent increase in research activity, deep-learning models have not yet been widely accepted in medicine. The shortage of high-quality annotated data often hinders the development of robust and generalizable models, which do not suffer from degraded effectiveness when presented with newly-collected, out-of-distribution (OOD) datasets. Methods: Contrastive Self-Supervised Learning (SSL) offers a potential solution to the scarcity of labeled data as it takes advantage of unlabeled data to increase model effectiveness and robustness. In this research, we propose applying contrastive SSL for detecting abnormalities in phonocardiogram (PCG) samples by learning a generalized representation of the signal. Specifically, we perform an extensive comparative evaluation of a wide range of audio-based augmentations and evaluate trained classifiers on multiple datasets across different downstream tasks. Results: We experimentally demonstrate that, depending on its training distribution, the effectiveness of a fully-supervised model can degrade up to 32% when evaluated on unseen data, while SSL models only lose up to 10% or even improve in some cases. Conclusions: Contrastive SSL pretraining can assist in providing robust classifiers which can generalize to unseen, OOD data, without relying on time- and labor-intensive annotation processes by medical experts. Furthermore, the proposed extensive evaluation protocol sheds light on the most promising and appropriate augmentations for robust PCG signal processing. Significance: We provide researchers and practitioners with a roadmap towards producing robust models for PCG classification, in addition to an open-source codebase for developing novel approaches.
目标:尽管最近研究活动有所增加,但深度学习模型在医学界尚未得到广泛接受。高质量带标签的数据不足往往阻碍了具有稳健和泛化能力的模型的开发,这些模型在接触到新颖且分布不在预期的(OOD)数据时不会出现性能下降。方法:对比自监督学习(SSL)是一种潜在的解决方法,因为它利用未标记数据来提高模型的有效性和稳健性。在这项研究中,我们提出应用对比SSL来检测心电图(PCG)样本中的异常,通过学习信号的一般表示来完成。具体来说,我们在广泛的音频增强技术和多个数据集上进行了深入的比较评估,并在不同的下游任务上评估训练后的分类器。结果:我们通过实验证明了,根据其训练分布,完全监督模型的效果可能会降低32%,而SSL模型只会损失10%或者甚至提高。结论:对比SSL预训练可以帮助提供具有稳健性的分类器,使其能够泛化到未见过的、分布不在预期的数据中。此外,所提出的详细评估协议揭示了最有可能的和适当的增强技术,有助于对PCG信号处理实现更好的泛化。意义:我们为研究人员和实践者提供了一个走向生产具有稳健性的PCG分类模型的路线图,同时提供一个用于开发新方法的开放源代码库。
https://arxiv.org/abs/2312.00502
A prompt is a sequence of symbol or tokens, selected from a vocabulary according to some rule, which is prepended/concatenated to a textual query. A key problem is how to select the sequence of tokens: in this paper we formulate it as a combinatorial optimization problem. The high dimensionality of the token space com-pounded by the length of the prompt sequence requires a very efficient solution. In this paper we propose a Bayesian optimization method, executed in a continuous em-bedding of the combinatorial space. In this paper we focus on hard prompt tuning (HPT) which directly searches for discrete tokens to be added to the text input with-out requiring access to the large language model (LLM) and can be used also when LLM is available only as a black-box. This is critically important if LLMs are made available in the Model as a Service (MaaS) manner as in GPT-4. The current manu-script is focused on the optimization of discrete prompts for classification tasks. The discrete prompts give rise to difficult combinatorial optimization problem which easily become intractable given the dimension of the token space in realistic applications. The optimization method considered in this paper is Bayesian optimization (BO) which has become the dominant approach in black-box optimization for its sample efficiency along with its modular structure and versatility. In this paper we use BoTorch, a library for Bayesian optimization research built on top of pyTorch. Albeit preliminary and obtained using a 'vanilla' version of BO, the experiments on RoB-ERTa on six benchmarks, show a good performance across a variety of tasks and enable an analysis of the tradeoff between size of the search space, accuracy and wall clock time.
一个提示(Prompt)是一组根据某种规则从词汇表中选取的符号或标记,附着在文本查询的前面或连接到文本查询。一个关键问题是如何选择这些标记的序列:在本文中,我们将其表示为一个组合优化问题。由于标记空间中维度很高,我们需要非常高效的解决方案。本文我们提出了一个贝叶斯优化方法,在组合空间的连续嵌入中执行。本文我们重点关注难提示调整(HPT),直接在文本输入中寻找离散标记,而不需要访问大型语言模型(LLM),并且当LLM仅作为黑盒时,也可以使用。这在LLMs以服务(MaaS)方式提供时尤为重要,如在GPT-4中。当前的manu-script专注于为分类任务优化离散提示。离散提示导致了一个困难的组合优化问题,在现实应用中,由于标记空间的维度,这个问题很容易变得难以求解。本文中考虑的优化方法是贝叶斯优化(BO),它是基于pyTorch构建的黑色盒优化领域的主导方法,具有高样本效率和灵活的结构。本文我们使用BoTorch,一个基于pyTorch的贝叶斯优化研究库。尽管是初步的,通过使用“原版”BO版本,RoB-ERTa在六个基准测试上的实验表明,在各种任务上表现良好,并可以分析搜索空间大小、准确性和墙钟时间之间的权衡。
https://arxiv.org/abs/2312.00471
Hyperdimensional Computing (HDC) is a brain-inspired and light-weight machine learning method. It has received significant attention in the literature as a candidate to be applied in the wearable internet of things, near-sensor artificial intelligence applications and on-device processing. HDC is computationally less complex than traditional deep learning algorithms and typically achieves moderate to good classification performance. A key aspect that determines the performance of HDC is the encoding of the input data to the hyperdimensional (HD) space. This article proposes a novel light-weight approach relying only on native HD arithmetic vector operations to encode binarized images that preserves similarity of patterns at nearby locations by using point of interest selection and local linear mapping. The method reaches an accuracy of 97.35% on the test set for the MNIST data set and 84.12% for the Fashion-MNIST data set. These results outperform other studies using baseline HDC with different encoding approaches and are on par with more complex hybrid HDC models. The proposed encoding approach also demonstrates a higher robustness to noise and blur compared to the baseline encoding.
超维度计算(HDC)是一种基于大脑启发和轻量级的机器学习方法。在文献中,HDC被广泛被视为可应用于可穿戴互联网、近传感器人工智能应用和本地处理领域的潜在候选者。HDC比传统深度学习算法在计算上更加简单,通常具有中等至良好的分类性能。决定HDC性能的关键方面是输入数据编码到超维度(HD)空间。本文提出了一种仅依赖于本地高维度算术向量操作来编码二进制图像的方法,通过点对点选择和局部线性映射来保留近距离位置的图案相似性。在MNIST测试集上,该方法达到了97.35%的准确度,而在Fashion-MNIST测试集上,准确度为84.12%。这些结果与其他使用不同编码方法的基础HDC研究相比,表现优异,且与更复杂的混合HDC模型相当。所提出的编码方法还表明,与基线编码相比,对噪声和模糊的鲁棒性更高。
https://arxiv.org/abs/2312.00454
Vision Transformers have received significant attention due to their impressive performance in many vision tasks. While the token mixer or attention block has been studied in great detail, the channel mixer or feature mixing block (FFN or MLP) has not been explored in depth albeit it accounts for a bulk of the parameters and computation in a model. In this work, we study whether sparse feature mixing can replace the dense connections and confirm this with a block diagonal MLP structure that improves the accuracy by supporting larger expansion ratios. To improve the feature clusters formed by this structure and thereby further improve the accuracy, a lightweight, parameter-free, channel covariance attention (CCA) mechanism is introduced as a parallel branch during training. This design of CCA enables gradual feature mixing across channel groups during training whose contribution decays to zero as the training progresses to convergence. This allows the CCA block to be discarded during inference, thus enabling enhanced performance with no additional computational cost. The resulting $\textit{Scalable CHannEl MixEr}$ (SCHEME) can be plugged into any ViT architecture to obtain a gamut of models with different trade-offs between complexity and performance by controlling the block diagonal structure size in the MLP. This is shown by the introduction of a new family of SCHEMEformer models. Experiments on image classification, object detection, and semantic segmentation, with different ViT backbones, consistently demonstrate substantial accuracy gains over existing designs, especially under lower FLOPs regimes. For example, the SCHEMEformer establishes a new SOTA of 79.7% accuracy for ViTs using pure attention mixers on ImageNet-1K at 1.77G FLOPs.
由于在许多视觉任务中的出色表现,Transformer Vision模型已经引起了很大的关注。尽管在token mixer或attention block上已经进行了详细研究,但通道混合器或特征混合块(FFN或MLP)尚未深入研究,尽管它占据了模型中大部分的参数和计算。在本文中,我们研究是否稀疏特征混合可以取代密集连接,并通过支持更大的扩展比来证实这一结论。为了提高由该结构形成的特征簇的准确性,在训练过程中引入了一个轻量级、参数无关的通道协方差注意(CCA)机制作为并行分支。该设计的CCA允许在训练过程中逐步混合通道组,其贡献在训练达到收敛时逐渐消失。这使得CCA块在推理时可以被丢弃,从而实现在不增加计算成本的情况下提高性能。通过控制MLP中块的扩展比,可以将得到的具有不同复杂度和性能的模型插接到任何ViT架构中。这通过引入一个新的SCHEMEformer模型家族来证明。在不同的ViT骨干网络、图像分类、目标检测和语义分割实验中,与现有设计相比,具有显著的准确性提升,尤其是在较低的FLOPs条件下。例如,SCHEMEformer在ImageNet-1K上使用纯注意力混合器建立了79.7%的准确率的新SOTA。
https://arxiv.org/abs/2312.00412
Active learning aims to enhance model performance by strategically labeling informative data points. While extensively studied, its effectiveness on large-scale, real-world datasets remains underexplored. Existing research primarily focuses on single-source data, ignoring the multi-domain nature of real-world data. We introduce a multi-domain active learning benchmark to bridge this gap. Our benchmark demonstrates that traditional single-domain active learning strategies are often less effective than random selection in multi-domain scenarios. We also introduce CLIP-GeoYFCC, a novel large-scale image dataset built around geographical domains, in contrast to existing genre-based domain datasets. Analysis on our benchmark shows that all multi-domain strategies exhibit significant tradeoffs, with no strategy outperforming across all datasets or all metrics, emphasizing the need for future research.
主动学习旨在通过策略地标记有用的数据点来增强模型的性能。虽然它在许多研究中进行了广泛研究,但在大型现实世界数据集上的效果仍然没有被充分探讨。现有研究主要集中在单一数据源上,忽略了现实世界中数据的多样性。我们引入了一个多领域主动学习基准来填补这一空白。我们的基准表明,传统单一领域主动学习策略在多领域场景中通常效果不如随机选择。我们还引入了CLIP-GeoYFCC,一个以地理域为中心的大型图像数据集,与现有的基于类别的域数据集不同。分析我们基准的结果表明,所有多领域策略都存在显著的权衡,没有策略在所有数据集或所有指标上优于其他策略,强调了未来研究的必要性。
https://arxiv.org/abs/2312.00364
In recent years, Classical Convolutional Neural Networks (CNNs) have been applied for image recognition successfully. Quantum Convolutional Neural Networks (QCNNs) are proposed as a novel generalization to CNNs by using quantum mechanisms. The quantum mechanisms lead to an efficient training process in QCNNs by reducing the size of input from $N$ to $log_2N$. This paper implements and compares both CNNs and QCNNs by testing losses and prediction accuracy on three commonly used datasets. The datasets include the MNIST hand-written digits, Fashion MNIST and cat/dog face images. Additionally, data augmentation (DA), a technique commonly used in CNNs to improve the performance of classification by generating similar images based on original inputs, is also implemented in QCNNs. Surprisingly, the results showed that data augmentation didn't improve QCNNs performance. The reasons and logic behind this result are discussed, hoping to expand our understanding of Quantum machine learning theory.
近年来,经典卷积神经网络(CNNs)已成功应用于图像识别。量子卷积神经网络(QCNNs)作为一种新型的对CNNs的扩展,通过利用量子机制来实现。量子机制导致QCNNs通过将输入从$N$减小到$log_2N$来实现高效的训练过程。本文通过测试三个常用数据集(包括MNIST手写数字、时尚MNIST和猫/狗面部图像)上的损失和预测准确度,实现了并比较了CNNs和QCNNs。数据增强(DA),一种常用于CNNs中提高分类性能的技术,也在QCNNs中实现了。令人惊讶的是,数据增强没有提高QCNNs的性能。本文讨论了这一结果的原因和逻辑,并希望扩展我们对量子机器学习理论的理解。
https://arxiv.org/abs/2312.00358
After pre-training by generating the next word conditional on previous words, the Language Model (LM) acquires the ability of In-Context Learning (ICL) that can learn a new task conditional on the context of the given in-context examples (ICEs). Similarly, visually-conditioned Language Modelling is also used to train Vision-Language Models (VLMs) with ICL ability. However, such VLMs typically exhibit weaker classification abilities compared to contrastive learning-based models like CLIP, since the Language Modelling objective does not directly contrast whether an object is paired with a text. To improve the ICL of classification, using more ICEs to provide more knowledge is a straightforward way. However, this may largely increase the selection time, and more importantly, the inclusion of additional in-context images tends to extend the length of the in-context sequence beyond the processing capacity of a VLM. To alleviate these limitations, we propose to manipulate the label space of each ICE to increase its knowledge density, allowing for fewer ICEs to convey as much information as a larger set would. Specifically, we propose two strategies which are Label Distribution Enhancement and Visual Descriptions Enhancement to improve In-context classification performance on diverse datasets, including the classic ImageNet and more fine-grained datasets like CUB-200. Specifically, using our approach on ImageNet, we increase accuracy from 74.70\% in a 4-shot setting to 76.21\% with just 2 shots. surpassing CLIP by 0.67\%. On CUB-200, our method raises 1-shot accuracy from 48.86\% to 69.05\%, 12.15\% higher than CLIP. The code is given in https://anonymous.4open.science/r/MLS_ICC.
在通过生成先前的单词的下一个词的条件下进行预训练后,语言模型(LM)获得了在给定上下文例子(ICEs)上进行新任务学习的(ICL)能力。同样,视觉条件下的语言建模也被用来训练具有 ICL 能力的 Vision-Language 模型(VLMs)。然而,这种 VLMs 通常表现出与基于对比学习模型的 CLIP 等相比较弱的分类能力,因为语言建模目标不直接比较一个对象是否与文本配对。为了提高 ICL 的分类能力,使用更多的 ICE 提供更多的知识是一个直接的方法。然而,这可能会大大增加选择时间,更重要的是,添加更多的上下文图像通常会使得 VLM 的上下文序列长度超过其处理能力。为了减轻这些限制,我们提出了一种操纵每个 ICE 的标签空间以增加其知识密度的方法,从而允许更少的 ICE 提供与更大的集合相同的信息。具体来说,我们提出了两种策略:标签分布增强和视觉描述增强,以提高在各种数据集上的 In-Context 分类性能,包括经典的 ImageNet 和更精细的数据集如 CUB-200。具体来说,在 ImageNet 上,我们使用我们的方法将准确率从 74.70% 提高到 76.21%,超越了 CLIP 0.67%。在 CUB-200 上,我们的方法将 1 shot 准确率从 48.86% 提高到 69.05%,比 CLIP 高出 12.15%。代码可以从 https://anonymous.4open.science/r/MLS_ICC.
https://arxiv.org/abs/2312.00351
The recent advances in artificial intelligence and deep learning facilitate automation in various applications including home automation, smart surveillance systems, and healthcare among others. Human Activity Recognition is one of its emerging applications, which can be implemented in a classroom environment to enhance safety, efficiency, and overall educational quality. This paper proposes a system for detecting and recognizing the activities of students in a classroom environment. The dataset has been structured and recorded by the authors since a standard dataset for this task was not available at the time of this study. Transfer learning, a widely adopted method within the field of deep learning, has proven to be helpful in complex tasks like image and video processing. Pretrained models including VGG-16, ResNet-50, InceptionV3, and Xception are used for feature extraction and classification tasks. Xception achieved an accuracy of 93%, on the novel classroom dataset, outperforming the other three models in consideration. The system proposed in this study aims to introduce a safer and more productive learning environment for students and educators.
近年来人工智能和深度学习的进步促进了各种应用的自动化,包括家庭自动化、智能安防系统和医疗等。其中,人活动识别是其新兴应用之一,可以在教室环境中实施,以提高安全性、效率和整体教育质量。本文提出了一个检测和识别教室环境中学生活动的系统。数据集由作者根据研究时期没有可用的标准数据集结构化和记录。在深度学习领域被广泛采用的迁移学习方法已经被证明在像图像和视频处理这样的复杂任务中很有帮助。用于特征提取和分类任务的预训练模型包括VGG-16、ResNet-50、InceptionV3和Xception。Xception获得了93%的新兴教室数据集上的准确率,超过了考虑的其他三个模型。本研究提出的学习系统旨在为学生和教育者提供更安全、更高效的学习环境。
https://arxiv.org/abs/2312.00348
Stein's paradox holds considerable sway in high-dimensional statistics, highlighting that the sample mean, traditionally considered the de facto estimator, might not be the most efficacious in higher dimensions. To address this, the James-Stein estimator proposes an enhancement by steering the sample means toward a more centralized mean vector. In this paper, first, we establish that normalization layers in deep learning use inadmissible estimators for mean and variance. Next, we introduce a novel method to employ the James-Stein estimator to improve the estimation of mean and variance within normalization layers. We evaluate our method on different computer vision tasks: image classification, semantic segmentation, and 3D object classification. Through these evaluations, it is evident that our improved normalization layers consistently yield superior accuracy across all tasks without extra computational burden. Moreover, recognizing that a plethora of shrinkage estimators surpass the traditional estimator in performance, we study two other prominent shrinkage estimators: Ridge and LASSO. Additionally, we provide visual representations to intuitively demonstrate the impact of shrinkage on the estimated layer statistics. Finally, we study the effect of regularization and batch size on our modified batch normalization. The studies show that our method is less sensitive to batch size and regularization, improving accuracy under various setups.
Steins paradox 在高维统计中具有相当大的影响力,表明传统上被认为是实质估价器的样本均值可能不是在更高维中最为有效的。为解决这个问题,詹姆斯 - 斯坦恩估计器提出了一种通过将样本均值引导向更集中均值向量来增强的方法。在本文中,首先,我们证明了深度学习中的归一化层使用不合法的均值和方差估价器。接下来,我们引入了一种新方法,使用詹姆斯 - 斯坦恩估计器来改进归一化层中均值和方差的估计。我们对不同的计算机视觉任务进行评估:图像分类、语义分割和 3D 对象分类。通过这些评估,可以很明显地看出我们改进的归一化层在所有任务上都具有卓越的准确性和不需要额外的计算负担。此外,认识到许多收缩估计器在性能上超过了传统估计器,我们研究了另外两个重要的收缩估计器:Ridge 和 LASSO。此外,我们还提供了可视化表示,直观地展示了收缩对估计层统计量的影响。最后,我们研究了正则化和批量大小的影响,我们的修改后的批均正常化方法在不同设置下的准确性得到提高。研究表明,我们的方法对于批量大小的变化和正则化的影响较小,从而提高了准确性。
https://arxiv.org/abs/2312.00313
In the backdrop of increasing data requirements of Deep Neural Networks for object recognition that is growing more untenable by the day, we present Developmental PreTraining (DPT) as a possible solution. DPT is designed as a curriculum-based pre-training approach designed to rival traditional pre-training techniques that are data-hungry. These training approaches also introduce unnecessary features that could be misleading when the network is employed in a downstream classification task where the data is sufficiently different from the pre-training data and is scarce. We design the curriculum for DPT by drawing inspiration from human infant visual development. DPT employs a phased approach where carefully-selected primitive and universal features like edges and shapes are taught to the network participating in our pre-training regime. A model that underwent the DPT regime is tested against models with randomised weights to evaluate the viability of DPT.
在数据需求不断增加的深度神经网络物体识别任务日益难以承受的情况下,我们提出了发展性预训练(DPT)作为可能的解决方案。DPT 是一种以课程为基础的预训练方法,旨在与数据饥饿的传统预训练技术相竞争。这些训练方法还引入了不必要的特征,当网络用于下游分类任务时,这些数据可能与预训练数据存在较大差异且数据不足时,可能会误导网络。我们通过从婴儿视觉发展中获得灵感来设计 DPT 的课程。DPT 使用阶段式方法,仔细选择基本和通用特征,如边缘和形状,来教授参与预训练的网络。经过 DPT 训练的模型被用于与随机权重模型进行比较,以评估 DPT 的可行性。
https://arxiv.org/abs/2312.00304