This paper reveals that every image can be understood as a first-order norm+linear autoregressive process, referred to as FINOLA, where norm+linear denotes the use of normalization before the linear model. We demonstrate that images of size 256$\times$256 can be reconstructed from a compressed vector using autoregression up to a 16$\times$16 feature map, followed by upsampling and convolution. This discovery sheds light on the underlying partial differential equations (PDEs) governing the latent feature space. Additionally, we investigate the application of FINOLA for self-supervised learning through a simple masked prediction technique. By encoding a single unmasked quadrant block, we can autoregressively predict the surrounding masked region. Remarkably, this pre-trained representation proves effective for image classification and object detection tasks, even in lightweight networks, without requiring fine-tuning. The code will be made publicly available.
这篇文章表明,每个图像都可以被视为一个第一阶 norms+线性 autoregressive 过程,也称为 FINOLA,其中 norms+线性表示在线性模型之前使用标准化。我们证明了,大小为 256x256 的图像可以通过自回归从压缩向量重构到 16x16 特征图,然后进行增广和卷积。这个发现揭示了支配潜在特征空间的基 partial differential equations (PDEs)。此外,我们通过简单的蒙面预测技术研究了 FINOLA 对自监督学习的应用。通过编码一个未暴露的 Quadrant 块,我们可以自回归预测周围的蒙面区域。令人惊讶地,这个预训练表示证明对于图像分类和物体检测任务有效,即使在轻量级网络中,也不需要微调。代码将公开可用。
https://arxiv.org/abs/2305.16319
For computer vision tasks, Vision Transformers (ViTs) have become one of the go-to deep net architectures. Despite being inspired by Convolutional Neural Networks (CNNs), ViTs remain sensitive to small shifts in the input image. To address this, we introduce novel designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding. With our proposed modules, we achieve truly shift-equivariant ViTs on four well-established models, namely, Swin, SwinV2, MViTv2, and CvT, both in theory and practice. Empirically, we tested these models on image classification and semantic segmentation, achieving competitive performance across three different datasets while maintaining 100% shift consistency.
对计算机视觉任务而言,视觉转换器(ViTs)已成为深度学习架构的首选之一。尽管受到了卷积神经网络(CNNs)的启发,ViTs仍然对输入图像的微小变化非常敏感。为了解决这一问题,我们提出了ViTs中的每个模块的全新设计,例如 tokenization、自注意力、块融合和位置编码。利用我们提出的模块,我们实现了真正的变换同构ViTs,对四个已知模型(Swin、SwinV2、MViTv2和CvT)进行了验证,在理论和实践上都实现了100%的变换一致性。具体来说,我们在实践中测试了这些模型的图像分类和语义分割性能,在不同数据集上取得了竞争表现,同时保持了100%的变换一致性。
https://arxiv.org/abs/2305.16316
Many fine-grained classification tasks, like rare animal identification, have limited training data and consequently classifiers trained on these datasets often fail to generalize to variations in the domain like changes in weather or location. As such, we explore how natural language descriptions of the domains seen in training data can be used with large vision models trained on diverse pretraining datasets to generate useful variations of the training data. We introduce ALIA (Automated Language-guided Image Augmentation), a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains and augment the training data via language-guided image editing. To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information. The resulting dataset is visually consistent with the original training data and offers significantly enhanced diversity. On fine-grained and cluttered datasets for classification and detection, ALIA surpasses traditional data augmentation and text-to-image generated data by up to 15\%, often even outperforming equivalent additions of real data. Code is avilable at this https URL.
许多精细的分类任务,例如稀有动物识别,训练数据有限,因此训练在这些数据上的分类器往往无法泛化到domain的变化,例如天气或地点的变化。因此,我们探索如何使用训练数据中的自然语言描述生成大型视觉模型,通过语言引导的图像编辑来生成有用的训练数据变异。我们引入了ALIA(自动语言引导图像增强),这种方法利用大型视觉和语言模型自动生成dataset的domain的自然语言描述,并通过语言引导的图像编辑增强训练数据。为了维持数据完整性,训练在原始数据上的分类器过滤掉最小图像编辑和那些损坏类相关的信息。 resulting dataset与原始训练数据 visually consistent,并提供了显著的增加多样性。在精细的分类和检测数据集上,ALIA超过了传统的数据增强和文本到图像生成的数据,可以达到15\%以上的超越,常常甚至超越了真实的数据增加。代码在这个httpsURL上可用。
https://arxiv.org/abs/2305.16289
The adoption of Deep Neural Networks (DNNs) has greatly benefited Natural Language Processing (NLP) during the past decade. However, the demands of long documents analysis are quite different from those of shorter texts, with the ever increasing size of documents uploaded online rendering NLP on long documents a critical area of research. This paper surveys the current state-of-the-art in the domain, overviewing the relevant neural building blocks and subsequently focusing on two main NLP tasks: Document Classification, Summarization as well as mentioning uses in Sentiment Analysis. We detail the challenges, issues and current solutions related to long-document NLP. We also list publicly available, labelled, long-document datasets used in current research.
在过去的十年中,采用深度神经网络(DNNs)极大地促进了自然语言处理(NLP)的发展。然而,对长文档的分析需求与对短文本的分析需求 quite different,随着在线文档上传内容的日益增加,使得对长文档的NLP分析成为一个重要的研究领域。本文综述了该领域当前的研究进展,概述了相关的神经网络构建块,随后重点探讨了 two main NLP任务:文档分类、摘要提取以及在Sentiment Analysis中的具体应用。本文详细描述了与长文档NLP相关的挑战、问题和当前的解决方案。此外,我们还列出了目前公开可用、标签明确的长文档数据集。
https://arxiv.org/abs/2305.16259
This paper studies the online node classification problem under a transductive learning setting. Current methods either invert a graph kernel matrix with $\mathcal{O}(n^3)$ runtime and $\mathcal{O}(n^2)$ space complexity or sample a large volume of random spanning trees, thus are difficult to scale to large graphs. In this work, we propose an improvement based on the \textit{online relaxation} technique introduced by a series of works (Rakhlin et al.,2012; Rakhlin and Sridharan, 2015; 2017). We first prove an effective regret $\mathcal{O}(\sqrt{n^{1+\gamma}})$ when suitable parameterized graph kernels are chosen, then propose an approximate algorithm FastONL enjoying $\mathcal{O}(k\sqrt{n^{1+\gamma}})$ regret based on this relaxation. The key of FastONL is a \textit{generalized local push} method that effectively approximates inverse matrix columns and applies to a series of popular kernels. Furthermore, the per-prediction cost is $\mathcal{O}(\text{vol}({\mathcal{S}})\log 1/\epsilon)$ locally dependent on the graph with linear memory cost. Experiments show that our scalable method enjoys a better tradeoff between local and global consistency.
本研究在一种递归学习设定下研究了在线节点分类问题。当前的方法要么使用 $\mathcal{O}(n^3)$ 的运行时长和 $\mathcal{O}(n^2)$ 的空间复杂度来翻转图的卷积矩阵,要么使用大量的随机连通树来采样,因此难以处理大型图。在这项工作中,我们基于一系列研究提出的 \textit{在线放松} 技术提出了改进,我们首先证明了当选择合适的参数化图kernel时,选择的卷积矩阵的有效 regret 为 $\mathcal{O}(\sqrt{n^{1+\gamma}})$,然后基于这个放松技术提出了一个基于这个放松技术的近似算法 FastONL,其有效 regret为 $\mathcal{O}(k\sqrt{n^{1+\gamma}})$。FastONL 的关键是一种 \textit{广义本地推送} 方法,有效地近似了逆矩阵列并适用于一系列流行的卷积kernel。此外,每个预测的成本为 $\mathcal{O}(\text{vol}({\mathcal{S}})\log 1/\epsilon)$ locally 依赖于具有线性内存成本的图。实验结果表明,我们的可扩展方法在 local 和 global 一致性之间的更好的权衡中取得了更好的结果。
https://arxiv.org/abs/2305.16257
Real-life multilingual systems should be able to efficiently incorporate new languages as data distributions fed to the system evolve and shift over time. To do this, systems need to handle the issue of catastrophic forgetting, where the model performance drops for languages or tasks seen further in its past. In this paper, we study catastrophic forgetting, as well as methods to minimize this, in a massively multilingual continual learning framework involving up to 51 languages and covering both classification and sequence labeling tasks. We present LR ADJUST, a learning rate scheduling method that is simple, yet effective in preserving new information without strongly overwriting past knowledge. Furthermore, we show that this method is effective across multiple continual learning approaches. Finally, we provide further insights into the dynamics of catastrophic forgetting in this massively multilingual setup.
真实的多语言系统应该能够高效地整合新的语言能力,随着输入系统的数据分布的演变和变化而不断发展。要做到这一点,系统需要处理灾难性遗忘的问题,即模型性能对过去的语言或任务下降。在本文中,我们研究了灾难性遗忘的问题,以及减少这一问题的方法,在一个涉及51种语言、涵盖分类和序列标签任务的大型多语言持续学习框架中。我们提出了LR调整,一种简单的学习率调度方法,能够在不强烈覆盖旧知识的情况下,有效地保留新信息。此外,我们证明,这种方法适用于多个持续学习方法。最后,我们提供了对这种大型多语言 setup 灾难性遗忘动态的更深入理解。
https://arxiv.org/abs/2305.16252
Face swapping combines one face's identity with another face's non-appearance attributes (expression, head pose, lighting) to generate a synthetic face. This technology is rapidly improving, but falls flat when reconstructing some attributes, particularly gaze. Image-based loss metrics that consider the full face do not effectively capture the perceptually important, yet spatially small, eye regions. Improving gaze in face swaps can improve naturalness and realism, benefiting applications in entertainment, human computer interaction, and more. Improved gaze will also directly improve Deepfake detection efforts, serving as ideal training data for classifiers that rely on gaze for classification. We propose a novel loss function that leverages gaze prediction to inform the face swap model during training and compare against existing methods. We find all methods to significantly benefit gaze in resulting face swaps.
脸交换将一个人脸的身份与另一个人脸的非出现属性(表情、头部姿势、照明)生成一个合成人脸。这项技术正在迅速发展,但在某些属性的重建方面表现平平,特别是眼睛区域。考虑整个面部的损失度量并没有有效捕捉感知上重要但空间上较小的眼睛区域。改善脸交换中的眼睛区域可以改善自然性和真实感,受益于娱乐、人机交互和其他应用领域。改善眼睛区域将直接改善 Deepfake 检测努力,作为依赖于眼睛识别的分类器的理想训练数据。我们提议一种新损失函数,利用眼睛预测在训练期间通知脸交换模型,并与其他方法进行比较。我们发现所有方法都显著地受益于最终脸交换中的眼睛区域。
https://arxiv.org/abs/2305.16138
Generative modeling has experienced substantial progress in recent years, particularly in text-to-image and text-to-video synthesis. However, the medical field has not yet fully exploited the potential of large-scale foundational models for synthetic data generation. In this paper, we introduce GenerateCT, the first method for text-conditional computed tomography (CT) generation, addressing the limitations in 3D medical imaging research and making our entire framework open-source. GenerateCT consists of a pre-trained large language model, a transformer-based text-conditional 3D chest CT generation architecture, and a text-conditional spatial super-resolution diffusion model. We also propose CT-ViT, which efficiently compresses CT volumes while preserving auto-regressiveness in-depth, enabling the generation of 3D CT volumes with variable numbers of axial slices. Our experiments demonstrate that GenerateCT can produce realistic, high-resolution, and high-fidelity 3D chest CT volumes consistent with medical language text prompts. We further investigate the potential of GenerateCT by training a model using generated CT volumes for multi-abnormality classification of chest CT volumes. Our contributions provide a valuable foundation for future research in text-conditional 3D medical image generation and have the potential to accelerate advancements in medical imaging research. Our code, pre-trained models, and generated data are available at this https URL.
生成建模在近年来取得了显著进展,特别是在文本到图像和文本到视频合成方面。然而,医学领域尚未完全充分利用大规模基础模型生成合成数据的潜力。在本文中,我们介绍了GenerateCT,这是一种针对文本ConditionalComputedTomography(CT)生成的第一方法,解决了三维医学成像研究的局限性,使我们整个框架开源。GenerateCT由一个预先训练的大型语言模型、基于Transformer的文本Conditional3D胸部CT生成架构和一个文本Conditional空间超分辨率扩散模型组成。我们还提出了CT-ViT,它高效压缩CT体积,同时保持自回归性的深度,使能够生成具有不同 axial slices 的3DCT体积。我们的实验表明,GenerateCT可以与医学语言文本 prompts保持一致地生成现实、高分辨率和高逼真的3D胸部CT体积。我们进一步研究了GenerateCT的潜力,通过使用生成的CT体积训练一个模型,以对胸部CT体积的多个异常进行分类。我们的贡献为未来文本Conditional3D医学图像生成研究提供了宝贵的基础,并可能加速医学成像研究的前进。我们的代码、预训练模型和生成数据可在这个httpsURL上可用。
https://arxiv.org/abs/2305.16037
Learning quality document embeddings is a fundamental problem in natural language processing (NLP), information retrieval (IR), recommendation systems, and search engines. Despite recent advances in the development of transformer-based models that produce sentence embeddings with self-contrastive learning, the encoding of long documents (Ks of words) is still challenging with respect to both efficiency and quality considerations. Therefore, we train Longfomer-based document encoders using a state-of-the-art unsupervised contrastive learning method (SimCSE). Further on, we complement the baseline method -- siamese neural network -- with additional convex neural networks based on functional Bregman divergence aiming to enhance the quality of the output document representations. We show that overall the combination of a self-contrastive siamese network and our proposed neural Bregman network outperforms the baselines in two linear classification settings on three long document topic classification tasks from the legal and biomedical domains.
学习高质量的文档嵌入是自然语言处理(NLP)、信息检索(IR)、推荐系统、搜索引擎等应用领域中的 fundamental problem。尽管Transformer-based模型的开发取得了 recent advances,生成具有自对比学习功能的语句嵌入,但对于较长的文档(单词数量)的编码仍然具有效率和质量方面的挑战。因此,我们使用一种先进的无监督对比学习方法(SimCSE)训练基于Longfomer的文档编码器。接着,我们使用基于功能梯度散射的凸神经网络作为基方法,并补充了以提高输出文档表示质量的 additional 凸神经网络。我们证明了, overall,自对比的Siamese网络和我们的神经网络梯度散射网络的组合在三个法律和生物医学领域Longdocument主题分类任务中的两个线性分类设置中比基方法表现更好。
https://arxiv.org/abs/2305.16031
We propose SING (StabIlized and Normalized Gradient), a plug-and-play technique that improves the stability and generalization of the Adam(W) optimizer. SING is straightforward to implement and has minimal computational overhead, requiring only a layer-wise standardization of the gradients fed to Adam(W) without introducing additional hyper-parameters. We support the effectiveness and practicality of the proposed approach by showing improved results on a wide range of architectures, problems (such as image classification, depth estimation, and natural language processing), and in combination with other optimizers. We provide a theoretical analysis of the convergence of the method, and we show that by virtue of the standardization, SING can escape local minima narrower than a threshold that is inversely proportional to the network's depth.
我们提出了SING(稳定和归一化梯度),这是一个可插拔的技术,可以提高Adam(W)优化器的稳定性和泛化能力。SING易于实现,并具有最小的计算 overhead,只需要在每个層上进行标准化,而不引入额外的超参数。我们支持提出的方法和其有效性和实用性,通过展示在各种架构、问题(如图像分类、深度估计和自然语言处理)以及与其他优化器的组合下改进的结果。我们提供了该方法的收敛理论分析,并表明通过标准化,SING可以逃避比阈值更窄的局部最小值。
https://arxiv.org/abs/2305.15997
In Knowledge Distillation, the teacher is generally much larger than the student, making the solution of the teacher likely to be difficult for the student to learn. To ease the mimicking difficulty, we introduce a triplet knowledge distillation mechanism named TriKD. Besides teacher and student, TriKD employs a third role called anchor model. Before distillation begins, the pre-trained anchor model delimits a subspace within the full solution space of the target problem. Solutions within the subspace are expected to be easy targets that the student could mimic well. Distillation then begins in an online manner, and the teacher is only allowed to express solutions within the aforementioned subspace. Surprisingly, benefiting from accurate but easy-to-mimic hints, the student can finally perform well. After the student is well trained, it can be used as the new anchor for new students, forming a curriculum learning strategy. Our experiments on image classification and face recognition with various models clearly demonstrate the effectiveness of our method. Furthermore, the proposed TriKD is also effective in dealing with the overfitting issue. Moreover, our theoretical analysis supports the rationality of our triplet distillation.
在知识蒸馏中,通常老师比学生大得多,这使得老师的解决方案可能对学生来说难以理解。为了减轻模仿困难,我们引入了名为TriKD的三知识蒸馏机制。除了老师和学生,TriKD还雇用了一个叫锚模型的第三角色。在蒸馏开始之前,预先训练的锚模型在目标问题的完整解决方案空间中限定了一个子空间。在子空间内的解决方案期望是学生可以轻松模仿的目标。蒸馏开始在线进行,老师只能表达上述子空间内的解决方案。令人惊讶地,受益于准确的但易于模仿的线索,学生最终能够表现出色。一旦学生经过训练,它可以用作新学生的新锚,形成课程学习策略。我们对各种模型的图像分类和人脸识别实验清楚地证明了我们方法的有效性。此外,我们提出的TriKD也 effective in处理过拟合问题。此外,我们的理论分析支持我们三知识蒸馏的 rationality。
https://arxiv.org/abs/2305.15975
Despite the increasing relevance of explainable AI, assessing the quality of explanations remains a challenging issue. Due to the high costs associated with human-subject experiments, various proxy metrics are often used to approximately quantify explanation quality. Generally, one possible interpretation of the quality of an explanation is its inherent value for teaching a related concept to a student. In this work, we extend artificial simulatability studies to the domain of graph neural networks. Instead of costly human trials, we use explanation-supervisable graph neural networks to perform simulatability studies to quantify the inherent usefulness of attributional graph explanations. We perform an extensive ablation study to investigate the conditions under which the proposed analyses are most meaningful. We additionally validate our methods applicability on real-world graph classification and regression datasets. We find that relevant explanations can significantly boost the sample efficiency of graph neural networks and analyze the robustness towards noise and bias in the explanations. We believe that the notion of usefulness obtained from our proposed simulatability analysis provides a dimension of explanation quality that is largely orthogonal to the common practice of faithfulness and has great potential to expand the toolbox of explanation quality assessments, specifically for graph explanations.
尽管解释性人工智能越来越相关,但评估解释质量仍然是一个挑战性的问题。由于人类受试者实验的高成本,通常使用各种指标来大约量化解释质量。一般而言,一个可能的解释质量概念是,解释对于向学生传授相关概念本身的有用性。在这个工作中,我们将人工可模拟研究扩展到图形神经网络领域。我们不要用昂贵的人类试验来执行可解释性图形神经网络的模拟研究,而是使用解释监督的图形神经网络执行模拟研究,以量化 attributed 图形解释的固有有用性。我们进行了广泛的去噪研究,以研究提出的分析中最有意义的条件。我们还验证了我们方法适用于现实世界图形分类和回归数据集的适用性。我们发现,相关的解释可以显著增强图形神经网络样本效率,并分析解释中的噪声和偏差的鲁棒性。我们认为,我们提出的可模拟研究的有用性概念提供了解释质量的一个维度,这在很大程度上与一致性的实践相排斥,并且有很大的潜力扩展解释质量评估的工具箱,特别是对于图形解释。
https://arxiv.org/abs/2305.15961
Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.
大型预训练模型通过实现多模态学习对计算机视觉产生了重大影响。CLIP模型在图像分类、对象检测和语义分割等方面取得了令人印象深刻的结果。然而,模型在3D点云处理任务方面的性能受到3D投影和CLIP训练图像之间的域差的限制。本文提出了DiffCLIP,一个新的预训练框架,结合稳定的扩散控制Net,最小化视觉分支中的域差。此外,在文本分支中引入了少量的任务风格prompt generation模块。在ModelNet10、ModelNet40和扫描对象NN数据集上进行广泛的实验表明,DiffCLIP具有很强的3D理解能力。通过稳定的扩散和风格prompt generation,DiffCLIP实现了对扫描对象NN中 obj_bg 对象零样本分类的准确率为43.2%,这是当前最先进的性能,而ModelNet10中的对象零样本分类的准确率为80.6%,与当前最先进的性能相当。
https://arxiv.org/abs/2305.15957
Detecting 3D mask attacks to a face recognition system is challenging. Although genuine faces and 3D face masks show significantly different remote photoplethysmography (rPPG) signals, rPPG-based face anti-spoofing methods often suffer from performance degradation due to unstable face alignment in the video sequence and weak rPPG signals. To enhance the rPPG signal in a motion-robust way, a landmark-anchored face stitching method is proposed to align the faces robustly and precisely at the pixel-wise level by using both SIFT keypoints and facial landmarks. To better encode the rPPG signal, a weighted spatial-temporal representation is proposed, which emphasizes the face regions with rich blood vessels. In addition, characteristics of rPPG signals in different color spaces are jointly utilized. To improve the generalization capability, a lightweight EfficientNet with a Gated Recurrent Unit (GRU) is designed to extract both spatial and temporal features from the rPPG spatial-temporal representation for classification. The proposed method is compared with the state-of-the-art methods on five benchmark datasets under both intra-dataset and cross-dataset evaluations. The proposed method shows a significant and consistent improvement in performance over other state-of-the-art rPPG-based methods for face spoofing detection.
检测面部识别系统的三维口罩攻击是一项挑战性的任务。虽然真实的面部和3D口罩显示显著不同的远程光偏振测量(rPPG)信号,但基于rPPG的面部反伪造方法经常由于视频序列中面部不稳定性以及较弱的rPPG信号而性能下降。为了在运动条件下增强rPPG信号,一种地标性框架面部拼接方法被提出,通过同时使用SIFT关键点和面部地标来 robustly and precisely align the faces at the pixel-level。为了更好地编码rPPG信号,一种加权时间和空间表示被提出,该表示强调具有丰富血管的面部区域。此外,不同颜色空间中的rPPG信号特征也被共同利用。为了提高泛化能力,一种轻量级高效的神经网络和一个门控循环单元(GRU)被设计,从rPPG时间和空间表示中分别提取空间和时间特征来进行分类。在内部数据集和跨数据集评估中,该方法与最先进的方法进行了比较。该方法在面部仿冒检测中的表现比其他任何基于rPPG的面部伪造方法都显著提高。
https://arxiv.org/abs/2305.15940
Unsupervised commonsense reasoning (UCR) is becoming increasingly popular as the construction of commonsense reasoning datasets is expensive, and they are inevitably limited in their scope. A popular approach to UCR is to fine-tune language models with external knowledge (e.g., knowledge graphs), but this usually requires a large number of training examples. In this paper, we propose to transform the downstream multiple choice question answering task into a simpler binary classification task by ranking all candidate answers according to their reasonableness. To this end, for training the model, we convert the knowledge graph triples into reasonable and unreasonable texts. Extensive experimental results show the effectiveness of our approach on various multiple choice question answering benchmarks. Furthermore, compared with existing UCR approaches using KGs, ours is less data hungry. Our code is available at this https URL.
无监督常识推理(UCR)正在变得越来越流行,因为构建常识推理数据集的成本很高,不可避免地会受到限制。UCR的一个流行的方法是通过外部知识(例如知识图谱)优化语言模型,但这通常需要大量训练示例。在本文中,我们提议将后续多项选择回答任务转换为更简单的二进制分类任务,通过按合理性排序所有备选答案来这样做。为了训练模型,我们将知识图谱三元组转换为合理和不合理的文本。广泛的实验结果表明,我们的方法在各种多项选择回答基准测试中的有效性。此外,与使用KGs的现有UCR方法相比,我们的方法的数据需求较少。我们的代码可用在此处https://github.com/lihaoyi21/UCR代码库中。
https://arxiv.org/abs/2305.15932
Attention mechanisms have greatly improved the performance of deep-learning models on visual, NLP, and multimodal tasks while also providing tools to aid in the model's interpretability. In particular, attention scores over input regions or concrete image features can be used to measure how much the attended elements contribute to the model inference. The recently proposed Concept Transformer (CT) generalizes the Transformer attention mechanism from such low-level input features to more abstract, intermediate-level latent concepts that better allow human analysts to more directly assess an explanation for the reasoning of the model about any particular output classification. However, the concept learning employed by CT implicitly assumes that across every image in a class, each image patch makes the same contribution to concepts that characterize membership in that class. Instead of using the CT's image-patch-centric concepts, object-centric concepts could lead to better classification performance as well as better explainability. Thus, we propose Concept-Centric Transformers (CCT), a new family of concept transformers that provides more robust explanations and performance by integrating a novel concept-extraction module based on object-centric learning. We test our proposed CCT against the CT and several other existing approaches on classification problems for MNIST (odd/even), CIFAR100 (super-classes), and CUB-200-2011 (bird species). Our experiments demonstrate that CCT not only achieves significantly better classification accuracy than all selected benchmark classifiers across all three of our test problems, but it generates more consistent concept-based explanations of classification output when compared to CT.
注意力机制已经极大地改进了深度学习模型在视觉、自然语言处理和多任务任务中的表现,同时提供了工具来帮助模型的解释性。特别是,注意力得分 over 输入区域或具体图像特征可以使用来衡量被关注元素对模型推理的贡献。最近提出的概念Transformer(CT)将Transformer注意力机制从这样的低级别输入特征扩展到更抽象、中等级别的潜在概念,更好地允许人类分析员更直接评估模型对于任何特定输出分类推理的解释。然而,CT所使用的概念学习隐含地假设在每个类中的每个图像 patch 都对概念“属于”该类的特征作出相同贡献。相反,不使用CT的图像 patch 中心概念,对象中心概念可能会导致更好的分类性能和更好的解释性。因此,我们提出了概念中心Transformer(CCT),一个新型的概念Transformer转换器家族,通过基于对象中心学习的 novel 概念提取模块,集成了一个独特的概念提取模块。我们对MNIST(奇偶性)、CIFAR100(超类)、CUB-200-2011(鸟类物种)等训练问题进行了CT和多个其他现有方法的比较测试,我们的实验结果表明,CCT不仅比我们测试的三个所有精选基准分类器在所有三个测试问题上实现了更好的分类精度,而且生成分类输出的概念基于解释更一致。
https://arxiv.org/abs/2305.15775
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD). To reduce the gap and improve the performance, current methods often resort to complicated training schemes, loss functions, and feature alignments, which are task-specific and feature-specific. In this paper, we state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature, and propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models. Our approach is based on the observation that student features typically contain more noises than teacher features due to the smaller capacity of student model. To address this, we propose to denoise student features using a diffusion model trained by teacher features. This allows us to perform better distillation between the refined clean feature and teacher feature. Additionally, we introduce a light-weight diffusion model with a linear autoencoder to reduce the computation cost and an adpative noise matching module to improve the denoising performance. Extensive experiments demonstrate that DiffKD is effective across various types of features and achieves state-of-the-art performance consistently on image classification, object detection, and semantic segmentation tasks. Code will be available at this https URL.
教师和学生之间的表示差距是知识蒸馏(KD)领域的一个新兴话题。为了缩小差距并提高性能,当前的方法常常采用复杂的训练计划、损失函数和特征对齐,这些任务和特征特定的。在本文中,我们指出这些方法的核心是排除噪声信息并蒸馏特征中的有价值信息,并提出了一种新的KD方法称为DiffKD,使用扩散模型来明确消除特征。我们的的方法是基于观察,学生特征通常包含比教师特征更多的噪声,因为学生模型的容量较小。为了解决这一问题,我们提议使用教师特征训练的扩散模型来消除学生特征。这允许我们在 refined clean feature 和教师特征之间的蒸馏任务中更好地进行知识蒸馏。此外,我们介绍了一种轻量级扩散模型,并配置了一个线性自编码器,以降低计算成本,并引入了一种自适应噪声匹配模块,以提高去噪性能。广泛的实验表明,DiffKD 适用于各种特征类型,并在图像分类、对象检测和语义分割任务中实现了最先进的性能。代码将在本链接中提供。
https://arxiv.org/abs/2305.15712
Monocular 3D detection is a challenging task due to the lack of accurate 3D information. Existing approaches typically rely on geometry constraints and dense depth estimates to facilitate the learning, but often fail to fully exploit the benefits of three-dimensional feature extraction in frustum and 3D space. In this paper, we propose \textbf{OccupancyM3D}, a method of learning occupancy for monocular 3D detection. It directly learns occupancy in frustum and 3D space, leading to more discriminative and informative 3D features and representations. Specifically, by using synchronized raw sparse LiDAR point clouds, we define the space status and generate voxel-based occupancy labels. We formulate occupancy prediction as a simple classification problem and design associated occupancy losses. Resulting occupancy estimates are employed to enhance original frustum/3D features. As a result, experiments on KITTI and Waymo open datasets demonstrate that the proposed method achieves a new state of the art and surpasses other methods by a significant margin. Codes and pre-trained models will be available at \url{this https URL}.
单目3D检测是一项具有挑战性的任务,因为缺乏准确的3D信息。现有的方法通常依赖于几何约束和密集深度估计来促进学习,但往往无法 fully Exploiting 3D feature extraction in the aspect ratio and 3D space的 benefits。在本文中,我们提出了 \textbf{OccupancyM3D},一种学习单目3D检测占用率的方法。它直接学习 aspect ratio 和 3D空间中的占用率,导致更歧视性和 informative 3D features 和表示。具体来说,通过使用同步的原始稀疏LiDAR点云,我们定义空间状态并生成以立方体表示的占用标签。我们将其作为简单的分类问题并提出相关的占用损失。结果占用估计被用于增强原始 aspect ratio 和 3D features。因此,在KITTI和 Waymo开放数据集的实验中,表明所提出的方法实现了新的先进技术,并以显著优势超越了其他方法。代码和预训练模型将可在 \url{this https URL} 上提供。
https://arxiv.org/abs/2305.15694
Recent studies have demonstrated that natural-language prompts can help to leverage the knowledge learned by pre-trained language models for the binary sentence-level sentiment classification task. Specifically, these methods utilize few-shot learning settings to fine-tune the sentiment classification model using manual or automatically generated prompts. However, the performance of these methods is sensitive to the perturbations of the utilized prompts. Furthermore, these methods depend on a few labeled instances for automatic prompt generation and prompt ranking. This study aims to find high-quality prompts for the given task in a zero-shot setting. Given a base prompt, our proposed approach automatically generates multiple prompts similar to the base prompt employing positional, reasoning, and paraphrasing techniques and then ranks the prompts using a novel metric. We empirically demonstrate that the top-ranked prompts are high-quality and significantly outperform the base prompt and the prompts generated using few-shot learning for the binary sentence-level sentiment classification task.
最近的研究表明,自然语言提示可以帮助利用预训练语言模型学会的知识,以进行二进制句子级别的情感分类任务。具体来说,这些方法使用少量的学习实例来微调情感分类模型,通过手动或自动生成的提示进行。然而,这些方法的性能对使用提示的扰动很敏感。此外,这些方法依赖于少量的标记实例来自动生成提示和提示排名。本研究旨在在一个零样本环境中找到针对给定任务的高质量提示。给定一个基础提示,我们提出的方法自动生成多个类似于基础提示的位置、推理和改写技巧的提示,然后使用一种新颖的度量来排名这些提示。我们的经验表明,排名最高的提示是高质量的,显著优于基础提示和通过少量学习生成的提示,用于二进制句子级别的情感分类任务。
https://arxiv.org/abs/2305.15689
In text classification, the traditional attention mechanisms usually focus too much on frequent words, and need extensive labeled data in order to learn. This paper proposes a perturbation-based self-supervised attention approach to guide attention learning without any annotation overhead. Specifically, we add as much noise as possible to all the words in the sentence without changing their semantics and predictions. We hypothesize that words that tolerate more noise are less significant, and we can use this information to refine the attention distribution. Experimental results on three text classification tasks show that our approach can significantly improve the performance of current attention-based models, and is more effective than existing self-supervised methods. We also provide a visualization analysis to verify the effectiveness of our approach.
在文本分类中,传统的注意力机制通常过于关注常用的词汇,并需要大量标记数据来学习。本文提出了一种基于干扰的自监督注意力方法,以避免标注 overhead,指导注意力学习。具体来说,我们尽可能增加句子中所有单词的干扰,但不会改变其语义和预测。我们假设那些能够容忍更多干扰的词语重要性较低,可以利用这些信息来优化注意力分布。对三个文本分类任务的实验结果表明,我们的方法可以显著改善当前基于注意力模型的性能,比现有的自监督方法更有效。我们还提供了可视化分析来验证我们方法的有效性。
https://arxiv.org/abs/2305.15684