Counterfactual Explanations (CEs) have emerged as a major paradigm in explainable AI research, providing recourse recommendations for users affected by the decisions of machine learning models. However, when slight changes occur in the parameters of the underlying model, CEs found by existing methods often become invalid for the updated models. The literature lacks a way to certify deterministic robustness guarantees for CEs under model changes, in that existing methods to improve CEs' robustness are heuristic, and the robustness performances are evaluated empirically using only a limited number of retrained models. To bridge this gap, we propose a novel interval abstraction technique for parametric machine learning models, which allows us to obtain provable robustness guarantees of CEs under the possibly infinite set of plausible model changes $\Delta$. We formalise our robustness notion as the $\Delta$-robustness for CEs, in both binary and multi-class classification settings. We formulate procedures to verify $\Delta$-robustness based on Mixed Integer Linear Programming, using which we further propose two algorithms to generate CEs that are $\Delta$-robust. In an extensive empirical study, we demonstrate how our approach can be used in practice by discussing two strategies for determining the appropriate hyperparameter in our method, and we quantitatively benchmark the CEs generated by eleven methods, highlighting the effectiveness of our algorithms in finding robust CEs.
事实解释(CE)作为一种解释性人工智能研究的主要范式,为受到机器学习模型决策影响的用户提供了恢复建议。然而,当模型的参数发生微小变化时,由现有方法找到的CEs往往对于更新后的模型变得无效。由于现有方法提高CE的 robustness 都是启发式的,并且只使用有限的重新训练模型来评估CE的 robustness,因此我们本文提出了一种新颖的参数抽象技术,用于参数机器学习模型,允许我们在可能无限个可信模型变化$\Delta$下获得CE的证明鲁棒性保证。我们将我们的鲁棒性概念定义为CE的$\Delta$鲁棒性。在二分类和多分类分类设置中,我们将鲁棒性定义为CE的$\Delta$鲁棒性。我们根据混合整数线性规划制定验证$\Delta$鲁棒性的过程,并进一步提出两种生成CEs使其具有$\Delta$鲁棒的算法。在广泛的实证研究中,我们讨论了两种确定适当超参数的方法,并通过比较十一种方法生成的CE的 effectiveness定量分析了我们的算法在找到稳健CE方面的效果。
https://arxiv.org/abs/2404.13736
Despite the rapid evolution of semantic segmentation for land cover classification in high-resolution remote sensing imagery, integrating multiple data modalities such as Digital Surface Model (DSM), RGB, and Near-infrared (NIR) remains a challenge. Current methods often process only two types of data, missing out on the rich information that additional modalities can provide. Addressing this gap, we propose a novel \textbf{L}ightweight \textbf{M}ultimodal data \textbf{F}usion \textbf{Net}work (LMFNet) to accomplish the tasks of fusion and semantic segmentation of multimodal remote sensing images. LMFNet uniquely accommodates various data types simultaneously, including RGB, NirRG, and DSM, through a weight-sharing, multi-branch vision transformer that minimizes parameter count while ensuring robust feature extraction. Our proposed multimodal fusion module integrates a \textit{Multimodal Feature Fusion Reconstruction Layer} and \textit{Multimodal Feature Self-Attention Fusion Layer}, which can reconstruct and fuse multimodal features. Extensive testing on public datasets such as US3D, ISPRS Potsdam, and ISPRS Vaihingen demonstrates the effectiveness of LMFNet. Specifically, it achieves a mean Intersection over Union ($mIoU$) of 85.09\% on the US3D dataset, marking a significant improvement over existing methods. Compared to unimodal approaches, LMFNet shows a 10\% enhancement in $mIoU$ with only a 0.5M increase in parameter count. Furthermore, against bimodal methods, our approach with trilateral inputs enhances $mIoU$ by 0.46 percentage points.
尽管在高分辨率遥感图像中,语义分割的快速进化为土地覆盖分类带来了便利,但整合多种数据模态(如数字表面模型(DSM)、红光和近红外(NIR))仍然具有挑战性。目前的方法通常仅处理两种数据类型,而忽略了其他模态提供的丰富信息。为解决这一空白,我们提出了一个新颖的轻量级多模态数据融合网络(LMFNet)来执行多模态遥感图像的融合和语义分割任务。LMFNet独特地将各种数据类型同时集成在一个加权共享、多分支视觉 transformer中,在确保参数数量的同时最小化特征提取。我们提出的多模态融合模块包括一个多模态特征融合重构层和一个多模态特征自注意融合层,可以重构和融合多模态特征。对包括US3D、ISPRS Potsdam和ISPRS Vaihingen等公共数据集的广泛测试表明,LMFNet的有效性得到了充分验证。具体来说,在美国3D数据集上,LMFNet的平均交集 over 统一(mIoU)值为85.09\%,相较于现有方法有显著的改进。与单模态方法相比,LMFNet在参数数量仅增加0.5M的情况下,mIoU提高了10\%。此外,与二模态方法相比,我们的具有三角形输入的方法提高了mIoU的0.46个百分点。
https://arxiv.org/abs/2404.13659
Graph neural networks (GNNs) have revolutionized the field of machine learning on non-Euclidean data such as graphs and networks. GNNs effectively implement node representation learning through neighborhood aggregation and achieve impressive results in many graph-related tasks. However, most neighborhood aggregation approaches are summation-based, which can be problematic as they may not be sufficiently expressive to encode informative graph structures. Furthermore, though the graph pooling module is also of vital importance for graph learning, especially for the task of graph classification, research on graph down-sampling mechanisms is rather limited. To address the above challenges, we propose a concatenation-based graph convolution mechanism that injectively updates node representations to maximize the discriminative power in distinguishing non-isomorphic subgraphs. In addition, we design a novel graph pooling module, called WL-SortPool, to learn important subgraph patterns in a deep-learning manner. WL-SortPool layer-wise sorts node representations (i.e. continuous WL colors) to separately learn the relative importance of subtrees with different depths for the purpose of classification, thus better characterizing the complex graph topology and rich information encoded in the graph. We propose a novel Subgraph Pattern GNN (SPGNN) architecture that incorporates these enhancements. We test the proposed SPGNN architecture on many graph classification benchmarks. Experimental results show that our method can achieve highly competitive results with state-of-the-art graph kernels and other GNN approaches.
图形神经网络(GNNs)在非欧氏数据(如图形和网络)领域已经颠覆了机器学习。GNNs通过聚类和实现节点表示学习有效地实现了节点表示学习,并在许多图形相关任务中取得了令人印象深刻的成果。然而,大多数聚类方法是基于求和的,这可能会有问题,因为他们可能不足以编码有用的图形结构。此外,尽管图形池化模块对于图形学习(尤其是图形分类)也非常重要,但关于图形 down-sampling机制的研究仍然相当有限。为了应对上述挑战,我们提出了一个基于连接的图形卷积机制,通过注入式更新节点表示以最大化区分类别的 discriminative power。此外,我们还设计了一个名为WL-SortPool的新颖图形池化模块,以在深度学习的方式学习中学习重要的子图模式。WL-SortPool对节点表示(即连续的WL颜色)进行层间排序,以分别学习具有不同深度的子树之间的相对重要性,从而更好地描述复杂的图形拓扑结构和图中所编码的丰富信息。我们提出了一个包含这些增强的全新的子图模式图形神经网络(SPGNN)架构。我们在许多图形分类基准上测试了所提出的SPGNN架构。实验结果表明,我们的方法可以与最先进的图形核和其他GNN方法一样实现高度竞争性的结果。
https://arxiv.org/abs/2404.13655
Hierarchical vision transformers (ViTs) have two advantages over conventional ViTs. First, hierarchical ViTs achieve linear computational complexity with respect to image size by local self-attention. Second, hierarchical ViTs create hierarchical feature maps by merging image patches in deeper layers for dense prediction. However, existing pruning methods ignore the unique properties of hierarchical ViTs and use the magnitude value as the weight importance. This approach leads to two main drawbacks. First, the "local" attention weights are compared at a "global" level, which may cause some "locally" important weights to be pruned due to their relatively small magnitude "globally". The second issue with magnitude pruning is that it fails to consider the distinct weight distributions of the network, which are essential for extracting coarse to fine-grained features at various hierarchical levels. To solve the aforementioned issues, we have developed a Data-independent Module-Aware Pruning method (DIMAP) to compress hierarchical ViTs. To ensure that "local" attention weights at different hierarchical levels are compared fairly in terms of their contribution, we treat them as a module and examine their contribution by analyzing their information distortion. Furthermore, we introduce a novel weight metric that is solely based on weights and does not require input images, thereby eliminating the dependence on the patch merging process. Our method validates its usefulness and strengths on Swin Transformers of different sizes on ImageNet-1k classification. Notably, the top-5 accuracy drop is only 0.07% when we remove 52.5% FLOPs and 52.7% parameters of Swin-B. When we reduce 33.2% FLOPs and 33.2% parameters of Swin-S, we can even achieve a 0.8% higher relative top-5 accuracy than the original model. Code is available at: this https URL
分层视觉 transformer (ViTs) 相对于传统的 ViTs 具有两个优势。首先,分层 ViT 通过局部自注意实现线性计算复杂度与图像大小的关系。其次,分层 ViT 通过在深层合并图像补丁以进行密集预测来创建分层特征图。然而,现有的剪枝方法忽略了分层 ViT 的独特性质,并将 magnitude 值用作权重重要性。这种方法导致了两个主要问题。首先,“局部”注意力权重在一个“全局”水平上进行比较,可能导致由于其相对较小的“全局”大小而“局部”重要的权重被剪掉。其次,规模剪枝方法未能考虑网络的明显特征分布,这对于在各个层次上提取粗到细的特征至关重要。为解决上述问题,我们开发了一种数据无关的模块感知剪枝方法(DIMAP)来压缩分层 ViT。为了确保不同层次的“局部”注意力权重在贡献方面进行公平比较,我们将它们视为一个模块,并通过分析它们的信息畸变来检查它们的贡献。此外,我们引入了一种仅基于权重的新权重度量,无需输入图像,从而消除了依赖于补丁合并过程的依赖。我们的方法在 ImageNet-1k 分类任务上的不同大小的 Swin Transformers 上进行了验证。值得注意的是,当我们去除 Swin-B 中 52.5% 的 FLOPs 和 52.7% 的参数时, top-5 准确率下降仅为 0.07%。当我们减少 Swin-S 中 33.2% 的 FLOPs 和 33.2% 的参数时,我们甚至可以实现与原始模型 0.8% 更高的相对 top-5 准确率。代码可在此处访问:https://this URL
https://arxiv.org/abs/2404.13648
In this work, we propose a novel tree-based explanation technique, PEACH (Pretrained-embedding Explanation Across Contextual and Hierarchical Structure), that can explain how text-based documents are classified by using any pretrained contextual embeddings in a tree-based human-interpretable manner. Note that PEACH can adopt any contextual embeddings of the PLMs as a training input for the decision tree. Using the proposed PEACH, we perform a comprehensive analysis of several contextual embeddings on nine different NLP text classification benchmarks. This analysis demonstrates the flexibility of the model by applying several PLM contextual embeddings, its attribute selections, scaling, and clustering methods. Furthermore, we show the utility of explanations by visualising the feature selection and important trend of text classification via human-interpretable word-cloud-based trees, which clearly identify model mistakes and assist in dataset debugging. Besides interpretability, PEACH outperforms or is similar to those from pretrained models.
在这项工作中,我们提出了一种新颖的基于树的解释技术,称为PEACH(预训练嵌入解释跨上下文和层次结构),可以在基于树的整个人类可解释方式中解释文本文档的分类。请注意,PEACH可以采用任何PLM的上下文嵌入作为训练输入来训练决策树。通过使用提出的PEACH,我们对九个不同的自然语言处理文本分类基准进行了全面分析。这种分析证明了模型的灵活性,通过应用多个PLM的上下文嵌入、属性选择、缩放和聚类方法,对其进行了分析。此外,我们通过可视化通过人类可解释的词云树来展示解释的有用性,该树清楚地指出了模型的错误,并有助于数据集的调试。除了可解释性之外,PEACH超越或与预训练模型相当。
https://arxiv.org/abs/2404.13645
Arbitrary style transfer holds widespread attention in research and boasts numerous practical applications. The existing methods, which either employ cross-attention to incorporate deep style attributes into content attributes or use adaptive normalization to adjust content features, fail to generate high-quality stylized images. In this paper, we introduce an innovative technique to improve the quality of stylized images. Firstly, we propose Style Consistency Instance Normalization (SCIN), a method to refine the alignment between content and style features. In addition, we have developed an Instance-based Contrastive Learning (ICL) approach designed to understand the relationships among various styles, thereby enhancing the quality of the resulting stylized images. Recognizing that VGG networks are more adept at extracting classification features and need to be better suited for capturing style features, we have also introduced the Perception Encoder (PE) to capture style features. Extensive experiments demonstrate that our proposed method generates high-quality stylized images and effectively prevents artifacts compared with the existing state-of-the-art methods.
任意风格迁移在研究和实践中具有广泛的关注,并拥有许多实际应用。现有方法中,e要么采用跨注意来将深度风格属性融入内容属性,要么使用自适应归一化来调整内容特征,都无法生成高质量的风格化图像。在本文中,我们提出了改进风格化图像质量的新技术。首先,我们提出了Style Consistency Instance Normalization(SCIN)方法,这是一种优化内容与风格特征之间对齐的方法。此外,我们还开发了一种基于实例的对比学习(ICL)方法,旨在理解各种风格之间的关系,从而提高生成风格化图像的质量。认识到VGG网络更擅长提取分类特征,需要更好地适应捕捉风格特征,我们还引入了感知编码器(PE)以捕捉风格特征。大量实验证明,与现有最先进的方法相比,我们提出的方法生成的风格化图像质量高,并有效防止了伪影。
https://arxiv.org/abs/2404.13584
Online task-free continual learning (OTFCL) is a more challenging variant of continual learning which emphasizes the gradual shift of task boundaries and learns in an online mode. Existing methods rely on a memory buffer composed of old samples to prevent forgetting. However,the use of memory buffers not only raises privacy concerns but also hinders the efficient learning of new samples. To address this problem, we propose a novel framework called I2CANSAY that gets rid of the dependence on memory buffers and efficiently learns the knowledge of new data from one-shot samples. Concretely, our framework comprises two main modules. Firstly, the Inter-Class Analogical Augmentation (ICAN) module generates diverse pseudo-features for old classes based on the inter-class analogy of feature distributions for different new classes, serving as a substitute for the memory buffer. Secondly, the Intra-Class Significance Analysis (ISAY) module analyzes the significance of attributes for each class via its distribution standard deviation, and generates the importance vector as a correction bias for the linear classifier, thereby enhancing the capability of learning from new samples. We run our experiments on four popular image classification datasets: CoRe50, CIFAR-10, CIFAR-100, and CUB-200, our approach outperforms the prior state-of-the-art by a large margin.
在线无任务持续学习(OTFCL)是一种更具挑战性的连续学习变体,它强调了任务边界的逐步转变和以在线方式学习。现有的方法依赖于由旧样本组成的记忆缓冲区来防止遗忘。然而,使用记忆缓冲区不仅引发了隐私问题,而且还会阻碍对新样本的有效学习。为解决这个问题,我们提出了一个名为I2CANSAY的新框架,它消除了对记忆缓冲区的依赖,并有效地从一挥发性样本中学习新数据的知識。具体来说,我们的框架由两个主要模块组成。首先,跨类别相似性增强(ICAN)模块根据不同新类之间的类內聚类相似性生成多样伪特征,作为记忆缓冲区的补充。其次,内类重要性分析(ISAY)模块通过分布标准差分析每个类的属性,并生成修正偏差向量作为线性分类器的增强剂,从而提高从新样本中学习的可能性。我们对四个流行的图像分类数据集:CoRe50,CIFAR-10,CIFAR-100和CUB-200)进行实验,我们的方法在很大程度上超过了先前的 state-of-the-art 水平。
https://arxiv.org/abs/2404.13576
This study introduces an innovative approach to classifying various types of Persian rice using image-based deep learning techniques, highlighting the practical application of everyday technology in food categorization. Recognizing the diversity of Persian rice and its culinary significance, we leveraged the capabilities of convolutional neural networks (CNNs), specifically by fine-tuning a ResNet model for accurate identification of different rice varieties and employing a U-Net architecture for precise segmentation of rice grains in bulk images. This dual-methodology framework allows for both individual grain classification and comprehensive analysis of bulk rice samples, addressing two crucial aspects of rice quality assessment. Utilizing images captured with consumer-grade cell phones reflects a realistic scenario in which individuals can leverage this technology for assistance with grocery shopping and meal preparation. The dataset, comprising various rice types photographed under natural conditions without professional lighting or equipment, presents a challenging yet practical classification problem. Our findings demonstrate the feasibility of using non-professional images for food classification and the potential of deep learning models, like ResNet and U-Net, to adapt to the nuances of everyday objects and textures. This study contributes to the field by providing insights into the applicability of image-based deep learning in daily life, specifically for enhancing consumer experiences and knowledge in food selection. Furthermore, it opens avenues for extending this approach to other food categories and practical applications, emphasizing the role of accessible technology in bridging the gap between sophisticated computational methods and everyday tasks.
这项研究采用了一种创新的方法对各种类型的波斯大米进行分类,利用基于图像的深度学习技术,强调了日常技术在食品分类中的实际应用。我们认识到波斯大米的多样性和其烹饪重要性,并充分利用卷积神经网络(CNN)的特性,通过微调ResNet模型进行不同大米品种的准确识别,并采用U-Net架构对批量图片中的大米颗粒进行精确分割。这种双重方法框架允许实现单粒谷物分类和全面分析散装大米样本,解决了评估大米质量的两个关键方面。利用消费者级智能手机拍摄的照片反映了现实情况,即个人可以利用这项技术协助购物和烹饪。这个数据集包括各种大米类型在自然条件下拍摄的照片,没有专业照明或设备,呈现了一个具有挑战性但实用的分类问题。我们的研究结果表明,可以使用非专业图像进行食品分类,深度学习模型(如ResNet和U-Net)可以适应日常物品和纹理的细微差别。这项研究为该领域提供了关于图像为基础的深度学习在日常生活中的应用,特别是为了提高消费者在食品选择中的体验和知识的见解。此外,它还开辟了将这种方法扩展到其他食品类别和实际应用领域的途径,强调可访问技术在连接复杂计算方法和日常任务之间的作用。
https://arxiv.org/abs/2404.13555
Machine learning models have made incredible progress, but they still struggle when applied to examples from unseen domains. This study focuses on a specific problem of domain generalization, where a model is trained on one source domain and tested on multiple target domains that are unseen during training. We propose IMO: Invariant features Masks for Out-of-Distribution text classification, to achieve OOD generalization by learning invariant features. During training, IMO would learn sparse mask layers to remove irrelevant features for prediction, where the remaining features keep invariant. Additionally, IMO has an attention module at the token level to focus on tokens that are useful for prediction. Our comprehensive experiments show that IMO substantially outperforms strong baselines in terms of various evaluation metrics and settings.
机器学习模型已经取得了巨大的进展,但在应用到未见过的领域时,它们仍然存在困难。本研究关注于领域泛化问题,即在训练过程中,模型学习一个未见过的领域,而在测试过程中,对多个未见过的领域进行测试。我们提出IMO:Invariant features Masks for Out-of-Distribution text classification,通过学习不变的特征来实现OOD泛化。在训练过程中,IMO会学习稀疏的掩码层,用于消除预测过程中的无关特征,而剩余的特征保持不变。此外,IMO在词级别有一个注意力模块,专注于对预测有用的词进行关注。我们的全面实验结果表明,IMO在各种评估指标和设置方面都显著优于强大的基线。
https://arxiv.org/abs/2404.13504
Transformer has been applied in the field of computer vision due to its excellent performance in natural language processing, surpassing traditional convolutional neural networks and achieving new state-of-the-art. ViT divides an image into several local patches, known as "visual sentences". However, the information contained in the image is vast and complex, and focusing only on the features at the "visual sentence" level is not enough. The features between local patches should also be taken into consideration. In order to achieve further improvement, the TNT model is proposed, whose algorithm further divides the image into smaller patches, namely "visual words," achieving more accurate results. The core of Transformer is the Multi-Head Attention mechanism, and traditional attention mechanisms ignore interactions across different attention heads. In order to reduce redundancy and improve utilization, we introduce the nested algorithm and apply the Nested-TNT to image classification tasks. The experiment confirms that the proposed model has achieved better classification performance over ViT and TNT, exceeding 2.25%, 1.1% on dataset CIFAR10 and 2.78%, 0.25% on dataset FLOWERS102 respectively.
由于其在自然语言处理方面的卓越表现,Transformer 在计算机视觉领域得到了广泛应用,超越了传统的卷积神经网络,并达到了最先进的状态。ViT 将图像分割成几个局部区域,称为“视觉句子”。然而,图像中的信息非常丰富和复杂,仅关注“视觉句子”层面的特征是不够的。在局部区域之间也应该考虑特征。为了实现进一步的改进,我们提出了 TNT 模型,其算法将图像进一步分割成更小的区域,即“视觉词”,从而实现更准确的结果。Transformer 的核心是多头注意力机制,而传统的注意力机制忽略了不同注意头之间的交互。为了减少冗余并提高利用率,我们引入了嵌套算法,并将 Nested-TNT 应用于图像分类任务。实验证实,与 ViT 和 TNT 相比,所提出的模型在 CIFAR10 数据集和 FLOWERS102 数据集上获得了更好的分类性能,分别超过 2.25% 和 2.78%。
https://arxiv.org/abs/2404.13434
The majority of existing hyperspectral anomaly detection (HAD) methods use the low-rank representation (LRR) model to separate the background and anomaly components, where the anomaly component is optimized by handcrafted sparse priors (e.g., $\ell_{2,1}$-norm). However, this may not be ideal since they overlook the spatial structure present in anomalies and make the detection result largely dependent on manually set sparsity. To tackle these problems, we redefine the optimization criterion for the anomaly component in the LRR model with a self-supervised network called self-supervised anomaly prior (SAP). This prior is obtained by the pretext task of self-supervised learning, which is customized to learn the characteristics of hyperspectral anomalies. Specifically, this pretext task is a classification task to distinguish the original hyperspectral image (HSI) and the pseudo-anomaly HSI, where the pseudo-anomaly is generated from the original HSI and designed as a prism with arbitrary polygon bases and arbitrary spectral bands. In addition, a dual-purified strategy is proposed to provide a more refined background representation with an enriched background dictionary, facilitating the separation of anomalies from complex backgrounds. Extensive experiments on various hyperspectral datasets demonstrate that the proposed SAP offers a more accurate and interpretable solution than other advanced HAD methods.
目前大多数超分辨率异常检测(HAD)方法使用低秩表示(LRR)模型将背景和异常成分进行分离,其中异常成分通过手工构建稀疏先验(例如,$\ell_{2,1}$范数)进行优化。然而,这可能不是理想的,因为它们忽视了异常中的空间结构,并将检测结果的准确性很大程度上依赖于人为设置的稀疏性。为了解决这些问题,我们通过自监督网络重新定义了LRR模型中异常成分的优化准则,称为自监督异常先验(SAP)。这一先验是通过自监督学习的预处理任务获得的,该任务专门用于学习超分辨率异常的特征。具体来说,这一预处理任务是一种分类任务,用于区分原始超分辨率图像(HSI)和伪异常HSI,其中伪异常是从原始HSI生成的,并设计为一个具有任意多边形基和任意频带的棱镜。此外,我们提出了一个双净化策略,以提供具有丰富背景字典的更精细的背景表示,促进异常与复杂背景的分离。在各种超分辨率数据集上进行的大量实验证明,与其它高级HAD方法相比,所提出的SAP具有更准确和可解释的解决方案。
https://arxiv.org/abs/2404.13342
The popular subword tokenizers of current language models, such as Byte-Pair Encoding (BPE), are known not to respect morpheme boundaries, which affects the downstream performance of the models. While many improved tokenization algorithms have been proposed, their evaluation and cross-comparison is still an open problem. As a solution, we propose a combined intrinsic-extrinsic evaluation framework for subword tokenization. Intrinsic evaluation is based on our new UniMorph Labeller tool that classifies subword tokenization as either morphological or alien. Extrinsic evaluation, in turn, is performed via the Out-of-Vocabulary Generalization Challenge 1.0 benchmark, which consists of three newly specified downstream text classification tasks. Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that, in all language models studied (including ALBERT, BERT, RoBERTa, and DeBERTa), alien tokenization leads to poorer generalizations compared to morphological tokenization for semantic compositionality of word meanings.
目前流行的语言模型子词划分器,如Byte-Pair Encoding(BPE)等,已知不尊重语素边界,这会影响模型的下游性能。虽然已经提出了许多改进的词素划分算法,但它们的评估和跨比仍是未解决的问题。为了解决问题,我们提出了一个结合内部和外部评估的词素划分框架。内部评估基于我们新的UniMorph Labeller工具,将词素划分分为语素或外星。外部评估通过Out-of-Vocabulary Generalization Challenge 1.0基准进行,该基准包括三个新的下游文本分类任务。我们的实证研究结果表明,UniMorph Labeller的准确率为98%,而在所有研究语言模型中(包括ALBERT、BERT、RoBERTa和DeBERTa),外星词素划分会导致语义组成性较弱,与语素词分相比更差。
https://arxiv.org/abs/2404.13292
Ultrasonic metal welding (UMW) is a key joining technology with widespread industrial applications. Condition monitoring (CM) capabilities are critically needed in UMW applications because process anomalies significantly deteriorate the joining quality. Recently, machine learning models emerged as a promising tool for CM in many manufacturing applications due to their ability to learn complex patterns. Yet, the successful deployment of these models requires substantial training data that may be expensive and time-consuming to collect. Additionally, many existing machine learning models lack generalizability and cannot be directly applied to new process configurations (i.e., domains). Such issues may be potentially alleviated by pooling data across manufacturers, but data sharing raises critical data privacy concerns. To address these challenges, this paper presents a Federated Transfer Learning with Task Personalization (FTL-TP) framework that provides domain generalization capabilities in distributed learning while ensuring data privacy. By effectively learning a unified representation from feature space, FTL-TP can adapt CM models for clients working on similar tasks, thereby enhancing their overall adaptability and performance jointly. To demonstrate the effectiveness of FTL-TP, we investigate two distinct UMW CM tasks, tool condition monitoring and workpiece surface condition classification. Compared with state-of-the-art FL algorithms, FTL-TP achieves a 5.35%--8.08% improvement of accuracy in CM in new target domains. FTL-TP is also shown to perform excellently in challenging scenarios involving unbalanced data distributions and limited client fractions. Furthermore, by implementing the FTL-TP method on an edge-cloud architecture, we show that this method is both viable and efficient in practice. The FTL-TP framework is readily extensible to various other manufacturing applications.
超声金属焊接(UMW)是一种广泛应用于工业领域的连接技术。在UMW应用中,状态监测(CM)功能至关重要,因为过程异常会显著恶化连接质量。最近,机器学习模型在许多制造应用中显示出成为有前景的工具,因为它们能够学习复杂的模式。然而,这些模型的成功部署需要大量的训练数据,这可能昂贵且耗时。此外,许多现有机器学习模型缺乏泛化能力,不能直接应用于新的过程配置(即领域)。这些问题可能通过汇集制造商的数据得到缓解,但数据共享会引发关键的数据隐私问题。为了应对这些挑战,本文提出了一个分散学习任务个性化(FTL-TP)框架,在分布式学习过程中提供领域泛化能力,同时确保数据隐私。通过从特征空间有效地学习统一表示,FTL-TP可以适应在类似任务上工作的客户端CM模型,从而提高它们的整体适应性和性能。为了证明FTL-TP的有效性,我们研究了两个不同的UMW CM任务,即工具状况监测和工件表面状况分类。与最先进的FL算法相比,FTL-TP在CM在新目标域中的准确度提高了5.35%--8.08%。FTL-TP在涉及不平衡数据分布和有限客户端份额的具有挑战性的场景中也表现出色。此外,通过在边缘云架构上实现FTL-TP方法,我们证明了这种方法在实践中是可行且高效的。FTL-TP框架易于扩展到各种其他制造应用。
https://arxiv.org/abs/2404.13278
Seamless integration of physical objects as interactive digital entities remains a challenge for spatial computing. This paper introduces Augmented Object Intelligence (AOI), a novel XR interaction paradigm designed to blur the lines between digital and physical by endowing real-world objects with the ability to interact as if they were digital, where every object has the potential to serve as a portal to vast digital functionalities. Our approach utilizes object segmentation and classification, combined with the power of Multimodal Large Language Models (MLLMs), to facilitate these interactions. We implement the AOI concept in the form of XR-Objects, an open-source prototype system that provides a platform for users to engage with their physical environment in rich and contextually relevant ways. This system enables analog objects to not only convey information but also to initiate digital actions, such as querying for details or executing tasks. Our contributions are threefold: (1) we define the AOI concept and detail its advantages over traditional AI assistants, (2) detail the XR-Objects system's open-source design and implementation, and (3) show its versatility through a variety of use cases and a user study.
将物理对象无缝集成为具有交互数字实体的形式仍然是一个空间计算的挑战。本文介绍了增强型对象智能(AOI),一种新颖的XR交互范式,旨在通过赋予现实世界的物体数字交互的能力,模糊数字和物理之间的界限。我们的方法结合了物体分割和分类以及多模态大型语言模型的力量,促进这些交互。我们将AOI理念实现为XR-Objects,一个开源原型系统,用户可以通过丰富和相关的语境与他们的物理环境进行交互。该系统不仅能够传递信息,还可以启动数字操作,例如查询详细信息或执行任务。我们的贡献是三方面的:(1)我们定义了AOI概念,并详细介绍了其与传统AI助手的优势;(2)详细介绍了XR-Objects系统的开源设计和实现;(3)通过各种用例展示了其多才多艺,并通过用户研究来证明其有效性。
https://arxiv.org/abs/2404.13274
Advancements in deep learning are revolutionizing the classification of remote-sensing images. Transformer-based architectures, utilizing self-attention mechanisms, have emerged as alternatives to conventional convolution methods, enabling the capture of long-range dependencies along with global relationships in the image. Motivated by these advancements, this paper presents StrideNET, a novel dual-branch architecture designed for terrain recognition and implicit properties estimation. The terrain recognition branch utilizes the Swin Transformer, leveraging its hierarchical representation and low computational cost to efficiently capture both local and global features. The terrain properties branch focuses on the extraction of surface properties such as roughness and slipperiness using a statistical texture analysis method. By computing surface terrain properties, an enhanced environmental perception can be obtained. The StrideNET model is trained on a dataset comprising four target terrain classes: Grassy, Marshy, Sandy, and Rocky. StrideNET attains competitive performance compared to contemporary methods. The implications of this work extend to various applications, including environmental monitoring, land use and land cover (LULC) classification, disaster response, precision agriculture, and much more.
深度学习在远地感测图像分类方面的进步正在彻底改变这种技术的应用。基于Transformer架构并利用自注意力机制的模型已成为传统卷积方法的替代品,能够同时捕捉图像中的长期依赖关系和全局关系。为了激励这些进步,本文提出了StrideNET,一种专为地形识别和隐含特征估计而设计的双分支架构。地形识别分支利用Swin Transformer,利用其层次表示和低计算成本,有效地捕捉到局部和全局特征。地形属性分支关注使用统计纹理分析方法提取表面属性,如粗糙度和斜率。通过计算表面地形属性,可以获得增强的生态环境感知。StrideNET模型在包括四种目标地形类别:草丛、沼泽、沙滩和岩石的训练数据集上进行训练。与当代方法相比,StrideNET具有竞争力的性能。本工作的影响范围扩展到各种应用领域,包括环境监测、土地使用和土地覆盖(LULC)分类、灾害响应、精确农业和更多。
https://arxiv.org/abs/2404.13270
In recent years, Vision Transformers (ViTs) have shown promising classification performance over Convolutional Neural Networks (CNNs) due to their self-attention mechanism. Many researchers have incorporated ViTs for Hyperspectral Image (HSI) classification. HSIs are characterised by narrow contiguous spectral bands, providing rich spectral data. Although ViTs excel with sequential data, they cannot extract spectral-spatial information like CNNs. Furthermore, to have high classification performance, there should be a strong interaction between the HSI token and the class (CLS) token. To solve these issues, we propose a 3D-Convolution guided Spectral-Spatial Transformer (3D-ConvSST) for HSI classification that utilizes a 3D-Convolution Guided Residual Module (CGRM) in-between encoders to "fuse" the local spatial and spectral information and to enhance the feature propagation. Furthermore, we forego the class token and instead apply Global Average Pooling, which effectively encodes more discriminative and pertinent high-level features for classification. Extensive experiments have been conducted on three public HSI datasets to show the superiority of the proposed model over state-of-the-art traditional, convolutional, and Transformer models. The code is available at this https URL.
近年来,由于自注意力机制,Vision Transformers (ViTs) 在卷积神经网络(CNNs)上表现出了良好的分类性能。许多研究者将ViTs应用于高光谱图像(HSI)分类。HSIs的特点是狭窄的连续频带,提供丰富的光谱数据。尽管ViTs在序列数据上表现出色,但它们无法像CNNs一样提取光谱-空间信息。此外,为了获得高分类性能,HSI令牌与类(CLS)令牌之间应该存在强烈的相互作用。为解决这些问题,我们提出了一个3D卷积引导的高光谱空间Transformer(3D-ConvSST)用于HSI分类,该模型在编码器之间利用3D卷积引导残差模块(CGRM)来“融合”局部空间和光谱信息,并增强特征传播。此外,我们摒弃了类标签,而是应用全局平均池化,这有效地为分类编码更具有区分性和相关性的高级特征。在三个公开的HSI数据集上进行了广泛的实验,以证明与最先进的传统卷积、转换器模型相比,所提出的模型具有优越性。代码可在此https URL上获取。
https://arxiv.org/abs/2404.13252
Representation learning from Gigapixel Whole Slide Images (WSI) poses a significant challenge in computational pathology due to the complicated nature of tissue structures and the scarcity of labeled data. Multi-instance learning methods have addressed this challenge, leveraging image patches to classify slides utilizing pretrained models using Self-Supervised Learning (SSL) approaches. The performance of both SSL and MIL methods relies on the architecture of the feature encoder. This paper proposes leveraging the Vision Mamba (Vim) architecture, inspired by state space models, within the DINO framework for representation learning in computational pathology. We evaluate the performance of Vim against Vision Transformers (ViT) on the Camelyon16 dataset for both patch-level and slide-level classification. Our findings highlight Vim's enhanced performance compared to ViT, particularly at smaller scales, where Vim achieves an 8.21 increase in ROC AUC for models of similar size. An explainability analysis further highlights Vim's capabilities, which reveals that Vim uniquely emulates the pathologist workflow-unlike ViT. This alignment with human expert analysis highlights Vim's potential in practical diagnostic settings and contributes significantly to developing effective representation-learning algorithms in computational pathology. We release the codes and pretrained weights at \url{this https URL}.
从Gigapixel Whole Slide Images(WSI)中进行表示学习在计算病理学中是一个具有重大挑战性的问题,因为组织结构的复杂性和标注数据的稀疏性。多实例学习方法已经解决了这个挑战,通过利用图像补丁对预训练模型进行分类,利用自监督学习(SSL)方法实现。 SSL和MIL方法的表现都依赖于特征编码器的架构。本文提出利用Vision Mamba(Vim)架构,受到状态空间模型的启发,在DINO框架中进行计算病理学中代表学习的建议。我们在Camelyon16数据集上评估Vim与Vision Transformers(ViT)的性能,包括补丁水平和滑动级别分类。我们的研究结果表明,与ViT相比,Vim在较小规模上表现出色,特别是在较小规模上,Vim的ROC AUC模型大小增加了8.21。可解释性分析进一步强调了Vim的功能,揭示了Vim独特地模拟了病理学家工作流程,类似于ViT。这种与人类专家分析的 alignment 突出了Vim在实际诊断场景中的潜在能力,并显著地促进了开发有效的计算病理学中的表示学习算法。我们发布了代码和预训练权重在this链接处。
https://arxiv.org/abs/2404.13222
Semantic segmentation plays a crucial role in enabling comprehensive scene understanding for robotic systems. However, generating annotations is challenging, requiring labels for every pixel in an image. In scenarios like autonomous driving, there's a need to progressively incorporate new classes as the operating environment of the deployed agent becomes more complex. For enhanced annotation efficiency, ideally, only pixels belonging to new classes would be annotated. This approach is known as Continual Semantic Segmentation (CSS). Besides the common problem of classical catastrophic forgetting in the continual learning setting, CSS suffers from the inherent ambiguity of the background, a phenomenon we refer to as the "background shift'', since pixels labeled as background could correspond to future classes (forward background shift) or previous classes (backward background shift). As a result, continual learning approaches tend to fail. This paper proposes a Backward Background Shift Detector (BACS) to detect previously observed classes based on their distance in the latent space from the foreground centroids of previous steps. Moreover, we propose a modified version of the cross-entropy loss function, incorporating the BACS detector to down-weight background pixels associated with formerly observed classes. To combat catastrophic forgetting, we employ masked feature distillation alongside dark experience replay. Additionally, our approach includes a transformer decoder capable of adjusting to new classes without necessitating an additional classification head. We validate BACS's superior performance over existing state-of-the-art methods on standard CSS benchmarks.
语义分割在使机器人系统全面理解场景方面扮演了关键角色。然而,生成注释是具有挑战性的,需要为图像的每个像素提供标签。在自动驾驶等场景中,需要随着部署代理的操作环境变得越来越复杂而逐步引入新的类别。为了提高注释效率,理想情况下,只注释属于新类别的像素。这种方法被称为连续语义分割(CSS)。 除了在连续学习设置中经典灾难性遗忘的问题之外,CSS还受到背景固有不确定性的困扰,这种现象我们称之为“背景漂移”,因为被标注为背景的像素可能对应未来的类别(前向背景漂移)或过去的类别(后向背景漂移)。因此,连续学习方法往往失败。本文提出了一种基于背景漂移检测器(BACS)的Backward Background Shift Detector(BACS)来检测以前观察到的类别,根据它们在先前步骤中前景集中点的距离。此外,我们还提出了一个修改过的交叉熵损失函数,该函数包含BACS检测器,用于减轻与以前观察到的类别相关的背景像素的权重。 为了应对灾难性遗忘,我们还采用遮罩特征蒸馏以及暗经验回放。此外,我们的方法包括一个能够适应新类别的Transformer解码器,而无需增加额外的分类头。我们在标准CSS基准测试上验证BACS的卓越性能。
https://arxiv.org/abs/2404.13148
Underwater images taken from autonomous underwater vehicles (AUV's) often suffer from low light, high turbidity, poor contrast, motion-blur and excessive light scattering and hence require image enhancement techniques for object recognition. Machine learning methods are being increasingly used for object recognition under such adverse conditions. These enhanced object recognition methods of images taken from AUV's has potential applications in underwater pipeline and optical fibre surveillance, ocean bed resource extraction, ocean floor mapping, underwater species exploration, etc. While the classical machine learning methods are very efficient in terms of accuracy, they require large datasets and high computational time for image classification. In the current work, we use quantum-classical hybrid machine learning methods for real-time under-water object recognition on-board an AUV for the first time. We use real-time motion-blurred and low-light images taken from an on-board camera of AUV built in-house and apply existing hybrid machine learning methods for object recognition. Our hybrid methods consist of quantum encoding and flattening of classical images using quantum circuits and sending them to classical neural networks for image classification. The results of hybrid methods carried out using Pennylane based quantum simulators both on GPU and using pre-trained models on an on-board NVIDIA GPU chipset are compared with results from corresponding classical machine learning methods. We observe that the hybrid quantum machine learning methods show an efficiency greater than 65\% and reduction in run-time by one-thirds and require 50\% smaller dataset sizes for training the models compared to classical machine learning methods. We hope that our work opens up further possibilities in quantum enhanced real-time computer vision in autonomous vehicles.
自主水下车辆(AUV)拍摄的水下图像通常存在低光、高浊度、对比度差、运动模糊和过度光线散射等问题,因此需要图像增强技术来进行目标识别。机器学习方法在AUV拍摄的水下图像目标识别方面得到了越来越多的应用。利用AUV拍摄的水下图像的增强目标识别方法具有潜在的应用,如水下管道和光纤监测、海底资源开采、海底地形图、水下物种探索等。尽管经典的机器学习方法在准确性方面非常有效,但它们需要大量数据和高的计算时间进行图像分类。在当前工作中,我们使用量子经典混合机器学习方法进行AUV上实时水下物体识别,这是第一次在AUV上实现。我们使用AUV自带相机上的实时运动模糊和低光图像,并应用现有的混合机器学习方法进行目标识别。我们的混合方法包括量子编码和经典图像平铺,利用量子电路对经典图像进行量子编码,并将其发送到经典神经网络进行图像分类。使用Pennylane基于量子模拟器的混合方法在GPU和预训练的模型上进行的结果与相应的经典机器学习方法的结果进行了比较。我们观察到,混合量子机器学习方法显示出比经典机器学习方法超过65%的效率,并且在运行时间上减少了三分之一,同时训练模型的数据集需要量比经典方法小50%。我们希望我们的工作为自主车辆的量子增强实时计算机视觉开辟更广阔的可能性。
https://arxiv.org/abs/2404.13130
AI in dermatology is evolving at a rapid pace but the major limitation to training trustworthy classifiers is the scarcity of data with ground-truth concept level labels, which are meta-labels semantically meaningful to humans. Foundation models like CLIP providing zero-shot capabilities can help alleviate this challenge by leveraging vast amounts of image-caption pairs available on the internet. CLIP can be fine-tuned using domain specific image-caption pairs to improve classification performance. However, CLIP's pre-training data is not well-aligned with the medical jargon that clinicians use to perform diagnoses. The development of large language models (LLMs) in recent years has led to the possibility of leveraging the expressive nature of these models to generate rich text. Our goal is to use these models to generate caption text that aligns well with both the clinical lexicon and with the natural human language used in CLIP's pre-training data. Starting with captions used for images in PubMed articles, we extend them by passing the raw captions through an LLM fine-tuned on the field's several textbooks. We find that using captions generated by an expressive fine-tuned LLM like GPT-3.5 improves downstream zero-shot concept classification performance.
翻译:皮肤病学中的人工智能正在以快速发展的速度不断演变,但训练可靠分类器的主要限制是缺乏具有真实概念级别标签的数据,这些标签在人类中具有语义意义。像CLIP这样的基础模型提供零散能力,通过利用互联网上大量可用的图像标题对数据进行微调,可以缓解这一挑战。CLIP可以通过针对特定领域的图像标题进行微调来提高分类性能。然而,CLIP的预训练数据与医生在诊断过程中使用的医学术语并不完全对齐。近年来,大型语言模型(LLMs)的发展使得利用这些模型的表达性特征生成丰富文本的可能性成为可能。我们的目标是使用这些模型生成与临床词汇和CLIP预训练数据中自然人类语言相符的文本。从PubMed文章中使用的图像的摘要开始,我们通过在几个教材上微调的LLM对原始摘要进行扩展。我们发现,使用像GPT-3.5这样的具有表达性的微调的LLM生成captions可以提高下游的零散概念分类性能。
https://arxiv.org/abs/2404.13043