Backdoor attacks pose a significant threat to deep neural networks, particularly as recent advancements have led to increasingly subtle implantation, making the defense more challenging. Existing defense mechanisms typically rely on an additional clean dataset as a standard reference and involve retraining an auxiliary model or fine-tuning the entire victim model. However, these approaches are often computationally expensive and not always feasible in practical applications. In this paper, we propose a novel and lightweight defense mechanism, termed PAD-FT, that does not require an additional clean dataset and fine-tunes only a very small part of the model to disinfect the victim model. To achieve this, our approach first introduces a simple data purification process to identify and select the most-likely clean data from the poisoned training dataset. The self-purified clean dataset is then used for activation clipping and fine-tuning only the last classification layer of the victim model. By integrating data purification, activation clipping, and classifier fine-tuning, our mechanism PAD-FT demonstrates superior effectiveness across multiple backdoor attack methods and datasets, as confirmed through extensive experimental evaluation.
渗透攻击对深度神经网络构成了重大威胁,尤其是在最近的进步使得隐蔽性越来越强的情况下,使得防御变得更加困难。现有的防御机制通常依赖于额外的干净数据集作为标准参考,并涉及重新训练辅助模型或对整个受害者模型进行微调。然而,这些方法通常计算代价高昂,并且并不总是可行的实际应用中可行的。在本文中,我们提出了一种新颖且轻量级的防御机制,称为PAD-FT,它不需要额外的干净数据集,并且只微调模型的小部分来消毒受害者模型。为了实现这一目标,我们的方法首先引入了一个简单的数据净化过程,以识别和选择训练数据集中最有可能的干净数据。然后,使用自我净化干净数据集对激活剪裁和受害模型最后分类层的微调进行处理。通过整合数据净化、激活剪裁和分类器微调,我们的机制PAD-FT在多个渗透攻击方法和数据集上表现出了卓越的效率,这是通过广泛的实验评估得出的结论。
https://arxiv.org/abs/2409.12072
The task of determining whether two texts are paraphrases has long been a challenge in NLP. However, the prevailing notion of paraphrase is often quite simplistic, offering only a limited view of the vast spectrum of paraphrase phenomena. Indeed, we find that evaluating models in a paraphrase dataset can leave uncertainty about their true semantic understanding. To alleviate this, we release paraphrasus, a benchmark designed for multi-dimensional assessment of paraphrase detection models and finer model selection. We find that paraphrase detection models under a fine-grained evaluation lens exhibit trade-offs that cannot be captured through a single classification dataset.
确定两个文本是否为同义词一直以来是自然语言处理中的一个挑战。然而,当前对同义词的普遍认识往往相当简单,只提供对同义词现象广阔范围的一个狭隘视角。事实上,我们发现,在语义理解模型的同义词数据集中评估模型可能会导致对其实际语义理解的不确定性。为了解决这个问题,我们发布了paraphrasus,一个为多维度评估同义词检测模型和更细粒度模型选择而设计的基准。我们发现,在细粒度评估透镜下,同义词检测模型存在无法通过单一分类数据集捕捉到的权衡。
https://arxiv.org/abs/2409.12060
Side-scan sonar (SSS) imagery presents unique challenges in the classification of man-made objects on the seafloor due to the complex and varied underwater environments. Historically, experts have manually interpreted SSS images, relying on conventional machine learning techniques with hand-crafted features. While Convolutional Neural Networks (CNNs) significantly advanced automated classification in this domain, they often fall short when dealing with diverse seafloor textures, such as rocky or ripple sand bottoms, where false positive rates may increase. Recently, Vision Transformers (ViTs) have shown potential in addressing these limitations by utilizing a self-attention mechanism to capture global information in image patches, offering more flexibility in processing spatial hierarchies. This paper rigorously compares the performance of ViT models alongside commonly used CNN architectures, such as ResNet and ConvNext, for binary classification tasks in SSS imagery. The dataset encompasses diverse geographical seafloor types and is balanced between the presence and absence of man-made objects. ViT-based models exhibit superior classification performance across f1-score, precision, recall, and accuracy metrics, although at the cost of greater computational resources. CNNs, with their inductive biases, demonstrate better computational efficiency, making them suitable for deployment in resource-constrained environments like underwater vehicles. Future research directions include exploring self-supervised learning for ViTs and multi-modal fusion to further enhance performance in challenging underwater environments.
侧扫声纳(SSS)图像因为水下环境的复杂性和多样性,在海洋底部对人造物进行分类存在独特的挑战。历史上,专家们手动解释了SSS图像,并依赖手工特征的传统的机器学习技术。虽然卷积神经网络(CNN)在这个领域显著提高了自动分类,但它们往往在对多样水下纹理(如岩石或波浪沙底)处理时不够灵活,可能导致误检率增加。最近,Vision Transformers(ViTs)通过利用自注意机制在图像补丁中捕捉全局信息,为解决这些局限提供了可能,使得在处理空间层次结构时更加灵活。本文对ViT模型的性能与通常使用的CNN架构(如ResNet和ConvNext)在SSS图像上的二分类任务进行了严格的比较。该数据集涵盖了各种地理海底类型,并且在人造物存在与不存在之间具有平衡。ViT模型在f1分数、精度、召回率和准确率指标上都表现优异,尽管代价是更大的计算资源。由于其归纳偏见,CNN展示了更好的计算效率,使它们适用于像水下车辆这样的资源受限环境。未来的研究方向包括探索ViT的自监督学习以及多模态融合,以进一步增强在具有挑战性的水下环境中的性能。
https://arxiv.org/abs/2409.12026
Understanding the relationships between geometric structures and semantic concepts is crucial for building accurate models of complex environments. In indoors, certain spatial constraints, such as the relative positioning of planes, remain consistent despite variations in layout. This paper explores how these invariant relationships can be captured in a graph SLAM framework by representing high-level concepts like rooms and walls, linking them to geometric elements like planes through an optimizable factor graph. Several efforts have tackled this issue with add-hoc solutions for each concept generation and with manually-defined factors. This paper proposes a novel method for metric-semantic factor graph generation which includes defining a semantic scene graph, integrating geometric information, and learning the interconnecting factors, all based on Graph Neural Networks (GNNs). An edge classification network (G-GNN) sorts the edges between planes into same room, same wall or none types. The resulting relations are clustered, generating a room or wall for each cluster. A second family of networks (F-GNN) infers the geometrical origin of the new nodes. The definition of the factors employs the same F-GNN used for the metric attribute of the generated nodes. Furthermore, share the new factor graph with the S-Graphs+ algorithm, extending its graph expressiveness and scene representation with the ultimate goal of improving the SLAM performance. The complexity of the environments is increased to N-plane rooms by training the networks on L-shaped rooms. The framework is evaluated in synthetic and simulated scenarios as no real datasets of the required complex layouts are available.
理解几何结构与语义概念之间的关系对于构建复杂环境中的准确模型至关重要。在室内环境中,尽管布局的變化会导致某些空间约束(如平面的相对位置)发生变化,但某些空间约束仍然保持不变。本文探讨了如何通过表示高层次概念(如房间和墙)和通过一个可优化因素图来将它们与几何元素(如平面)联系起来,从而在图形SLAM框架中捕捉这些不变的关系。为了处理每个概念的生成,以及通过手动定义因素来处理这个问题,已经提出了许多解决方案。本文提出了一种新颖的方法,基于图神经网络(GNNs)生成度量语义特征图,包括定义语义场景图、整合几何信息并学习连接因子,所有这些基于GNNs。边缘分类网络(G-GNN)将平面之间的边归类为同一房间、同一墙或不存在类型。归一化关系生成了每个簇的房间或墙。 第二类网络(F-GNN)推断新节点的几何起源。定义因素采用与生成的节点度量相同的F-GNN。此外,将新因素图与S-Graphs+算法共享,通过 ultimate goal of improving the SLAM performance 扩展其图形表现力和场景表示,以增加环境的复杂性。通过在L形房间上训练网络来增加环境的复杂性,使环境复杂度达到N-plane rooms。 在 synthetic 和 simulated 场景中评估该框架,因为没有要求模拟的复杂布局的现实数据集可用,所以无法进行评估。
https://arxiv.org/abs/2409.11972
Large language models (LLMs) have enabled a range of applications in zero-shot and few-shot learning settings, including the generation of synthetic datasets for training and testing. However, to reliably use these synthetic datasets, it is essential to understand how representative they are of real-world data. We investigate this by assessing the effectiveness of generating synthetic data through LLM and using it as a benchmark for various NLP tasks. Our experiments across six datasets, and three different tasks, show that while synthetic data can effectively capture performance of various methods for simpler tasks, such as intent classification, it falls short for more complex tasks like named entity recognition. Additionally, we propose a new metric called the bias factor, which evaluates the biases introduced when the same LLM is used to both generate benchmarking data and to perform the tasks. We find that smaller LLMs exhibit biases towards their own generated data, whereas larger models do not. Overall, our findings suggest that the effectiveness of synthetic data as a benchmark varies depending on the task, and that practitioners should rely on data generated from multiple larger models whenever possible.
大语言模型(LLMs)在零 shot 和零散学习设置中已经实现了各种应用,包括为训练和测试生成合成数据。然而,为了可靠地使用这些合成数据,了解它们在现实世界数据中的代表性非常重要。我们通过评估通过LLM生成合成数据的有效性,并将其用作各种自然语言处理任务的基准,来研究这个问题。在六个数据集和三个不同任务上的实验中,我们的实验结果表明,虽然合成数据可以有效地捕捉各种方法在简单任务上的表现,比如意图分类,但它对于更复杂的任务,比如命名实体识别,仍然存在不足。此外,我们提出了一种新的指标,称为偏差因子,用于评估当相同LLM用于生成基准数据和执行任务时引入的偏见。我们发现,较小的LLM倾向于对自己的生成数据存在偏见,而较大的模型则不存在这种偏见。总体而言,我们的研究结果表明,合成数据作为基准的有效性取决于任务,并且实践者应尽可能地依赖来自多个较大模型的数据。
https://arxiv.org/abs/2409.11968
In this technical report, we describe the SNTL-NTU team's submission for Task 1 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection and classification of acoustic scenes and events (DCASE) 2024 challenge. Three systems are introduced to tackle training splits of different sizes. For small training splits, we explored reducing the complexity of the provided baseline model by reducing the number of base channels. We introduce data augmentation in the form of mixup to increase the diversity of training samples. For the larger training splits, we use FocusNet to provide confusing class information to an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) models and baseline models trained on the original sampling rate of 44.1 kHz. We use Knowledge Distillation to distill the ensemble model to the baseline student model. Training the systems on the TAU Urban Acoustic Scene 2022 Mobile development dataset yielded the highest average testing accuracy of (62.21, 59.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively over the three systems.
在这份技术报告中,我们描述了SNTL-NTU团队为2024年语音识别与分类任务(DCASE)提交的任务1:数据高效的低复杂度音频场景分类。我们引入了三种系统来处理不同大小的训练集。对于小训练集,我们通过减少提供的基线模型的基通道复杂度来降低模型的复杂度。我们引入了数据增强的形式为mixup,以增加训练样本的多样性。对于较大的训练集,我们使用FocusNet来向由多个Patchout faST Spectrogram Transformer(PaSST)模型和基于原始采样率44.1 kHz的基准模型组成的集成模型提供混乱的分类信息。我们使用知识蒸馏将集成模型分解为基线学生模型。在TAU urban acoustic scene 2022移动开发数据集上训练系统,在划分(100, 50, 25, 10, 5)%的测试准确率上取得了最高平均值(62.21, 59.82, 56.81, 53.03, 47.97)。
https://arxiv.org/abs/2409.11964
We present Agglomerative Token Clustering (ATC), a novel token merging method that consistently outperforms previous token merging and pruning methods across image classification, image synthesis, and object detection & segmentation tasks. ATC merges clusters through bottom-up hierarchical clustering, without the introduction of extra learnable parameters. We find that ATC achieves state-of-the-art performance across all tasks, and can even perform on par with prior state-of-the-art when applied off-the-shelf, i.e. without fine-tuning. ATC is particularly effective when applied with low keep rates, where only a small fraction of tokens are kept and retaining task performance is especially difficult.
我们提出了聚类标记合并(ATC)方法,一种在图像分类、图像合成和目标检测与分割任务中始终优于先前标记合并和剪枝方法的新标记合并方法。ATC通过自下而上的层次聚类将聚类合并,无需引入额外的可学习参数。我们发现,ATC在所有任务上都实现了最先进的性能,甚至可以在未进行微调的情况下与先前的最优状态相媲美,即在没有特别调整的情况下。ATC在保留率较低的情况下特别有效,其中只有很少的标记被保留,保留任务的表现尤其难以维持。
https://arxiv.org/abs/2409.11923
State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.
状态空间模型(SSMs)通过将状态空间技术集成到深度学习中,引入了一种新颖的上下文建模方法。然而,由于它们的数据无关矩阵,它们在全局上下文建模方面遇到困难。Mamba模型通过S6选择性扫描算法提供了数据相关的变体,通过增强上下文建模,特别是对于长序列,解决了这个问题。然而,基于Mamba的架构在参数数量上很难扩展,这是视觉应用的主要限制。本文解决了大型SSM在图像分类和动作识别方面的可扩展性问题,而不需要使用诸如知识蒸馏等技术。我们分析了Mamba基和注意力基模型,提出了一个Mamba-Attention并行架构,通过增强可扩展性、鲁棒性和性能来解决SSM的问题。我们证明了具有稳定和高效并行架构的Mamba基架构可以解决图像和视频的扩展性问题,并增加了对常见伪影像压缩等伪影的鲁棒性。我们对ImageNet-1K、Kinetics-400和Something-Something-v2基准进行的深入评估表明,通过我们提出的方法,可以提高最先进的Mamba基架构的准确度最高达+1.7。
https://arxiv.org/abs/2409.11867
Deep trackers have proven success in visual tracking. Typically, these trackers employ optimally pre-trained deep networks to represent all diverse objects with multi-channel features from some fixed layers. The deep networks employed are usually trained to extract rich knowledge from massive data used in object classification and so they are capable to represent generic objects very well. However, these networks are too complex to represent a specific moving object, leading to poor generalization as well as high computational and memory costs. This paper presents a novel and general framework termed channel distillation to facilitate deep trackers. To validate the effectiveness of channel distillation, we take discriminative correlation filter (DCF) and ECO for example. We demonstrate that an integrated formulation can turn feature compression, response map generation, and model update into a unified energy minimization problem to adaptively select informative feature channels that improve the efficacy of tracking moving objects on the fly. Channel distillation can accurately extract good channels, alleviating the influence of noisy channels and generally reducing the number of channels, as well as adaptively generalizing to different channels and networks. The resulting deep tracker is accurate, fast, and has low memory requirements. Extensive experimental evaluations on popular benchmarks clearly demonstrate the effectiveness and generalizability of our framework.
深度跟踪器已经在视觉跟踪方面取得了成功。通常,这些跟踪器使用预训练的深度网络来表示所有不同对象的多个通道特征,这些网络在某些固定层上进行优化。使用的深度网络通常是为了从大规模数据中提取丰富的知识,因此它们能够很好地表示通用对象。然而,这些网络过于复杂,无法表示特定的运动物体,导致泛化差劣以及高计算和内存成本。本文提出了一种名为通道剥离的新颖且通用的框架,以帮助深度跟踪器。为了验证通道剥离的有效性,我们以判别相关滤波器(DCF)和ECO为例。我们证明了整合公式可以将特征压缩、响应图生成和模型更新统一为一个能量最小化问题,以便在飞行中选择有用的特征通道,提高跟踪移动物体的有效性。通道剥离可以准确地提取好的通道,减轻噪音通道的影响,并通常减少通道数量,同时适应不同通道和网络。通过在流行基准上进行广泛的实验评估,我们充分证明了我们框架的有效性和通用性。
https://arxiv.org/abs/2409.11785
Current audio-visual representation learning can capture rough object categories (e.g., ``animals'' and ``instruments''), but it lacks the ability to recognize fine-grained details, such as specific categories like ``dogs'' and ``flutes'' within animals and instruments. To address this issue, we introduce DETECLAP, a method to enhance audio-visual representation learning with object information. Our key idea is to introduce an audio-visual label prediction loss to the existing Contrastive Audio-Visual Masked AutoEncoder to enhance its object awareness. To avoid costly manual annotations, we prepare object labels from both audio and visual inputs using state-of-the-art language-audio models and object detectors. We evaluate the method of audio-visual retrieval and classification using the VGGSound and AudioSet20K datasets. Our method achieves improvements in recall@10 of +1.5% and +1.2% for audio-to-visual and visual-to-audio retrieval, respectively, and an improvement in accuracy of +0.6% for audio-visual classification.
目前,音频-视觉表示学习可以捕捉粗对象类别(例如,``动物``和``乐器``),但它缺乏识别细粒度细节的能力,例如动物和乐器中具体的类别,如``狗``和``长笛``。为解决这个问题,我们引入了DETECLAP,一种通过对象信息增强音频-视觉表示学习的方法。我们的关键想法是向现有的对比性音频-视觉遮蔽自动编码器引入音频-视觉标签预测损失,以增强其对象意识。为了避免昂贵的手动注释,我们使用最先进的语言-音频模型和物体检测器为音频和视觉输入准备对象标签。我们使用VGGSound和AudioSet20K数据集评估音频-视觉检索和分类的方法。我们的方法在音频-视觉和视觉-音频检索方面的召回率分别提高了+1.5%和+1.2%,而在音频-视觉分类方面的准确度提高了+0.6%。
https://arxiv.org/abs/2409.11729
As Large Language Models (LLMs) advance in natural language processing, there is growing interest in leveraging their capabilities to simplify software interactions. In this paper, we propose a novel system that integrates LLMs for both classifying natural language inputs into corresponding API calls and automating the creation of sample datasets tailored to specific API functions. By classifying natural language commands, our system allows users to invoke complex software functionalities through simple inputs, improving interaction efficiency and lowering the barrier to software utilization. Our dataset generation approach also enables the efficient and systematic evaluation of different LLMs in classifying API calls, offering a practical tool for developers or business owners to assess the suitability of LLMs for customized API management. We conduct experiments on several prominent LLMs using generated sample datasets for various API functions. The results show that GPT-4 achieves a high classification accuracy of 0.996, while LLaMA-3-8B performs much worse at 0.759. These findings highlight the potential of LLMs to transform API management and validate the effectiveness of our system in guiding model testing and selection across diverse applications.
随着大型语言模型(LLMs)在自然语言处理方面的进步,越来越多的人开始利用其能力来简化软件交互。在本文中,我们提出了一个新系统,该系统将LLMs集成到分类自然语言输入到相应API调用中,并自动创建针对特定API功能的样本数据。通过分类自然语言命令,我们的系统允许用户通过简单的输入调用复杂的软件功能,提高交互效率并降低软件利用门槛。我们的数据生成方法还使开发人员或企业所有者能够通过分类API调用评估不同的LLM,为开发人员或企业所有者评估LLM的定制API管理提供了一个实用的工具。我们在各种显著的LLM上进行实验,使用生成的API函数样本数据。结果表明,GPT-4的分类准确率为0.996,而LLaMA-3-8B的分类准确率则大大较低,为0.759。这些发现突出了LLMs在API管理中的潜力,验证了我们的系统在指导模型测试和选择跨多个应用的软件选择方面的有效性。
https://arxiv.org/abs/2409.11703
Histopathology analysis is the gold standard for medical diagnosis. Accurate classification of whole slide images (WSIs) and region-of-interests (ROIs) localization can assist pathologists in diagnosis. The gigapixel resolution of WSI and the absence of fine-grained annotations make direct classification and analysis challenging. In weakly supervised learning, multiple instance learning (MIL) presents a promising approach for WSI classification. The prevailing strategy is to use attention mechanisms to measure instance importance for classification. However, attention mechanisms fail to capture inter-instance information, and self-attention causes quadratic computational complexity. To address these challenges, we propose AMD-MIL, an agent aggregator with a mask denoise mechanism. The agent token acts as an intermediate variable between the query and key for computing instance importance. Mask and denoising matrices, mapped from agents-aggregated value, dynamically mask low-contribution representations and eliminate noise. AMD-MIL achieves better attention allocation by adjusting feature representations, capturing micro-metastases in cancer, and improving interpretability. Extensive experiments on CAMELYON-16, CAMELYON-17, TCGA-KIDNEY, and TCGA-LUNG show AMD-MIL's superiority over state-of-the-art methods.
病理学分析是医疗诊断的黄金标准。准确地对整个切片图像(WSIs)和感兴趣区域(ROIs)进行分类和定位可以帮助病理学家进行诊断。WSI的巨像素分辨率以及缺乏细粒度注释使得直接分类和分析具有挑战性。在弱监督学习中,多实例学习(MIL)对WSI分类是一个有前途的方法。然而,目前的策略使用关注机制来衡量实例的重要性进行分类。然而,关注机制无法捕捉实例之间的交互信息,自注意力会导致平方计算复杂度。为了应对这些挑战,我们提出了AMD-MIL,一个带口罩去噪机制的代理聚合器。代理令牌充当查询和键的中间变量,用于计算实例重要性。口罩和去噪矩阵从代理聚合值进行映射,动态地遮盖低贡献表示并消除噪声。通过调整特征表示,AMD-MIL实现了更好的关注分配,捕获了癌症中的微转移,并提高了可解释性。在CAMELYON-16、CAMELYON-17、TCGA-KIDNEY和TCGA-LUNG等大量实验中,AMD-MIL优越于最先进的方法。
https://arxiv.org/abs/2409.11664
Eye movement biometrics is a secure and innovative identification method. Deep learning methods have shown good performance, but their network architecture relies on manual design and combined priori knowledge. To address these issues, we introduce automated network search (NAS) algorithms to the field of eye movement recognition and present Relax DARTS, which is an improvement of the Differentiable Architecture Search (DARTS) to realize more efficient network search and training. The key idea is to circumvent the issue of weight sharing by independently training the architecture parameters $\alpha$ to achieve a more precise target architecture. Moreover, the introduction of module input weights $\beta$ allows cells the flexibility to select inputs, to alleviate the overfitting phenomenon and improve the model performance. Results on four public databases demonstrate that the Relax DARTS achieves state-of-the-art recognition performance. Notably, Relax DARTS exhibits adaptability to other multi-feature temporal classification tasks.
眼动生物识别是一种安全和创新的身份识别方法。深度学习方法显示出良好的性能,但它们的网络架构依赖于手动设计,并且结合了先验知识。为解决这些问题,我们将自动网络搜索(NAS)算法引入眼动识别领域,并提出了Relax DARTS,这是基于Differentiable Architecture Search(DARTS)的改进,以实现更有效的网络搜索和训练。关键想法是独立训练架构参数$\alpha$,以绕过权重共享的问题,实现更精确的目标架构。此外,引入模块输入权重$\beta$,使得细胞具有选择输入的灵活性,减轻过拟合现象,提高模型性能。在四个公开数据库上的结果表明,Relax DARTS实现了最先进的识别性能。值得注意的是,Relax DARTS表现出对其他多特征时序分类任务的适应性。
https://arxiv.org/abs/2409.11652
Tuberculosis (TB) is caused by the bacterium Mycobacterium tuberculosis, primarily affecting the lungs. Early detection is crucial for improving treatment effectiveness and reducing transmission risk. Artificial intelligence (AI), particularly through image classification of chest X-rays, can assist in TB detection. However, class imbalance in TB chest X-ray datasets presents a challenge for accurate classification. In this paper, we propose a few-shot learning (FSL) approach using the Prototypical Network algorithm to address this issue. We compare the performance of ResNet-18, ResNet-50, and VGG16 in feature extraction from the TBX11K Chest X-ray dataset. Experimental results demonstrate classification accuracies of 98.93% for ResNet-18, 98.60% for ResNet-50, and 33.33% for VGG16. These findings indicate that the proposed method outperforms others in mitigating data imbalance, which is particularly beneficial for disease classification applications.
结核病(TB)是由结核杆菌(Mycobacterium tuberculosis)引起的,主要影响肺。早期诊断对提高治疗效果和降低传播风险至关重要。人工智能(AI),特别是通过胸部X光片图像分类,可以在TB诊断中发挥作用。然而,TB胸部X光片数据集中的分类不平衡给准确分类带来了挑战。在本文中,我们提出了一种使用原型网络(Prototypical Network)算法的几 shot学习(FSL)方法来解决这一问题。我们比较了ResNet-18、ResNet-50和VGG16在TBX11K胸部X光片数据集中的特征提取方面的性能。实验结果表明,ResNet-18的分类准确率为98.93%,ResNet-50的分类准确率为98.60%,而VGG16的分类准确率为33.33%。这些发现表明,与其它方法相比,所提出的方法在缓解数据不平衡方面表现优异,这对疾病分类应用特别有益。
https://arxiv.org/abs/2409.11644
Convolutional neural network (CNN) performs well in Hyperspectral Image (HSI) classification tasks, but its high energy consumption and complex network structure make it difficult to directly apply it to edge computing devices. At present, spiking neural networks (SNN) have developed rapidly in HSI classification tasks due to their low energy consumption and event driven characteristics. However, it usually requires a longer time step to achieve optimal accuracy. In response to the above problems, this paper builds a spiking neural network (SNN-SWMR) based on the leaky integrate-and-fire (LIF) neuron model for HSI classification tasks. The network uses the spiking width mixed residual (SWMR) module as the basic unit to perform feature extraction operations. The spiking width mixed residual module is composed of spiking mixed convolution (SMC), which can effectively extract spatial-spectral features. Secondly, this paper designs a simple and efficient arcsine approximate derivative (AAD), which solves the non-differentiable problem of spike firing by fitting the Dirac function. Through AAD, we can directly train supervised spike neural networks. Finally, this paper conducts comparative experiments with multiple advanced HSI classification algorithms based on spiking neural networks on six public hyperspectral data sets. Experimental results show that the AAD function has strong robustness and a good fitting effect. Meanwhile, compared with other algorithms, SNN-SWMR requires a time step reduction of about 84%, training time, and testing time reduction of about 63% and 70% at the same accuracy. This study solves the key problem of SNN based HSI classification algorithms, which has important practical significance for promoting the practical application of HSI classification algorithms in edge devices such as spaceborne and airborne devices.
卷积神经网络(CNN)在超分辨率图像(HSI)分类任务中表现良好,但其高能耗和复杂的网络结构使得直接应用于边缘计算设备较为困难。目前,由于其低能耗和事件驱动的特点,快速发展的尖峰神经网络(SNN)在HSI分类任务中得到了广泛应用。然而,通常需要更长的时间步才能达到最优准确度。为了应对上述问题,本文基于漏电神经元模型(LIF)构建了一种尖峰神经网络(SNN-SWMR)用于HSI分类任务。该网络使用尖峰宽度混合残差(SWMR)模块作为基本单位执行特征提取操作。尖峰宽度混合残差模块由尖峰混合卷积(SMC)和残差连接组成,可以有效提取空间-光谱特征。其次,本文设计了一个简单且高效的arcsin近似导数(AAD),通过调整狄拉克函数解决了尖峰放电的非可导性问题。通过AAD,我们可以直接训练监督尖峰神经网络。最后,本文基于尖峰神经网络(SNN)对六个公共超分辨率数据集进行了比较实验。实验结果表明,AAD函数具有很强的鲁棒性,良好的拟合效果。同时,与其它算法相比,SNN-SWMR需要时间步减少约84%,训练时间和测试时间分别减少约63%和70%。本研究解决了基于尖峰神经网络(SNN)的HSI分类算法中的关键问题,这对促进超分辨率分类算法在边缘设备(如太空和航空设备)的实际应用具有重要现实意义。
https://arxiv.org/abs/2409.11619
The Forward-Forward (FF) algorithm is a recent, purely forward-mode learning method, that updates weights locally and layer-wise and supports supervised as well as unsupervised learning. These features make it ideal for applications such as brain-inspired learning, low-power hardware neural networks, and distributed learning in large models. However, while FF has shown promise on written digit recognition tasks, its performance on natural images and time-series remains a challenge. A key limitation is the need to generate high-quality negative examples for contrastive learning, especially in unsupervised tasks, where versatile solutions are currently lacking. To address this, we introduce the Self-Contrastive Forward-Forward (SCFF) method, inspired by self-supervised contrastive learning. SCFF generates positive and negative examples applicable across different datasets, surpassing existing local forward algorithms for unsupervised classification accuracy on MNIST (MLP: 98.7%), CIFAR-10 (CNN: 80.75%), and STL-10 (CNN: 77.3%). Additionally, SCFF is the first to enable FF training of recurrent neural networks, opening the door to more complex tasks and continuous-time video and text processing.
前馈-前馈(FF)算法是一种最近提出、完全朝前学习的算法,它通过局部和层级的更新来更新权重,并支持监督学习和无监督学习。这些特点使得它非常适合应用于诸如类脑学习、低功耗硬件神经网络和大模型分布式学习等应用。然而,尽管FF在书面数字识别任务上已经表现出良好的前景,但在自然图像和时间序列上的表现仍然具有挑战性。一个关键的限制是生成高质量的反例来进行对比学习,特别是在无监督任务中,目前还缺乏多样化的解决方案。为了解决这个问题,我们引入了自监督前馈-前馈(SCFF)方法,灵感来自自监督对比学习。SCFF生成适用于各种数据集的反例,超过了现有局部前馈算法的无监督分类准确率(MNIST:98.7%,CIFAR-10:80.75%,STL-10:77.3%)。此外,SCFF是第一个使FF训练循环神经网络成为可能,为更复杂任务和连续时间视频和文本处理打开了大门。
https://arxiv.org/abs/2409.11593
In the medical domain, acquiring large datasets poses significant challenges due to privacy concerns. Nonetheless, the development of a robust deep-learning model for retinal disease diagnosis necessitates a substantial dataset for training. The capacity to generalize effectively on smaller datasets remains a persistent challenge. The scarcity of data presents a significant barrier to the practical implementation of scalable medical AI solutions. To address this issue, we've combined a wide range of data sources to improve performance and generalization to new data by giving it a deeper understanding of the data representation from multi-modal datasets and developed a self-supervised framework based on large language models (LLMs), SwinV2 to gain a deeper understanding of multi-modal dataset representations, enhancing the model's ability to extrapolate to new data for the detection of eye diseases using optical coherence tomography (OCT) images. We adopt a two-phase training methodology, self-supervised pre-training, and fine-tuning on a downstream supervised classifier. An ablation study conducted across three datasets employing various encoder backbones, without data fusion, with low data availability setting, and without self-supervised pre-training scenarios, highlights the robustness of our method. Our findings demonstrate consistent performance across these diverse conditions, showcasing superior generalization capabilities compared to the baseline model, ResNet-50.
在医疗领域,获取大量数据存在隐私方面的重大挑战。然而,为诊断眼病开发稳健的深度学习模型需要大量数据进行训练。在较小的数据集上有效扩展的能力仍然是一个持续的挑战。数据的稀缺性成为实现可扩展医疗人工智能解决方案的实际实施的一个显著障碍。为解决这个问题,我们结合了各种数据源以提高性能和泛化能力,通过从多模态数据集中的数据表示的更深的理解,为模型开发了一个基于大型语言模型(LLMs)的自我监督框架。这有助于更深入地理解多模态数据集的表示,增强模型对通过光学共轭传递断层扫描(OCT)图像检测眼病的扩展能力。我们采用了两阶段培训方法:自监督预训练和下游监督分类器的微调。在三个数据集上进行的一项消融研究,使用各种编码器后端,没有数据融合,数据可用性设置较低,没有自监督预训练场景,揭示了我们方法的稳健性。我们的研究结果表明,这些多样条件下,模型具有相似的性能,展示了与基线模型相比优越的泛化能力。
https://arxiv.org/abs/2409.11375
When employing deep neural networks (DNNs) for semantic segmentation in safety-critical applications like automotive perception or medical imaging, it is important to estimate their performance at runtime, e.g. via uncertainty estimates or prediction quality estimates. Previous works mostly performed uncertainty estimation on pixel-level. In a line of research, a connected-component-wise (segment-wise) perspective was taken, approaching uncertainty estimation on an object-level by performing so-called meta classification and regression to estimate uncertainty and prediction quality, respectively. In those works, each predicted segment is considered individually to estimate its uncertainty or prediction quality. However, the neighboring segments may provide additional hints on whether a given predicted segment is of high quality, which we study in the present work. On the basis of uncertainty indicating metrics on segment-level, we use graph neural networks (GNNs) to model the relationship of a given segment's quality as a function of the given segment's metrics as well as those of its neighboring segments. We compare different GNN architectures and achieve a notable performance improvement.
在使用深度神经网络(DNNs)进行语义分割,特别是在安全关键应用(如汽车感知或医学成像)中时,在运行时估计其性能非常重要,例如通过不确定性估计或预测质量估计。之前的工作主要在像素级别进行不确定性估计。在研究领域,我们采用了连通组件(段)视角,通过所谓的元分类和回归来估计不确定性,分别估计预测质量。在这些工作中,每个预测段都被单独考虑以估计其不确定性或预测质量。然而,相邻的段可能提供关于给定预测段是否具有高质量的其他提示,我们将在本研究中研究这一点。基于不确定性的指标,我们使用图神经网络(GNNs)来建模给定段质量与给定段指标以及其相邻段质量之间的关系。我们比较了不同的GNN架构,并实现了显著的性能改进。
https://arxiv.org/abs/2409.11373
Hallucination, the generation of factually incorrect content, is a growing challenge in Large Language Models (LLMs). Existing detection and mitigation methods are often isolated and insufficient for domain-specific needs, lacking a standardized pipeline. This paper introduces THaMES (Tool for Hallucination Mitigations and EvaluationS), an integrated framework and library addressing this gap. THaMES offers an end-to-end solution for evaluating and mitigating hallucinations in LLMs, featuring automated test set generation, multifaceted benchmarking, and adaptable mitigation strategies. It automates test set creation from any corpus, ensuring high data quality, diversity, and cost-efficiency through techniques like batch processing, weighted sampling, and counterfactual validation. THaMES assesses a model's ability to detect and reduce hallucinations across various tasks, including text generation and binary classification, applying optimal mitigation strategies like In-Context Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base of academic papers, political news, and Wikipedia reveal that commercial models like GPT-4o benefit more from RAG than ICL, while open-weight models like Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT significantly enhances the performance of Llama-3.1-8B-Instruct in both evaluation tasks.
幻觉,即在大型语言模型(LLMs)中生成事实性不正确的内容,是一个日益增长挑战。现有的检测和缓解方法通常孤立且不足以满足领域特定需求,缺乏标准化流程。本文介绍了一种名为THaMES(用于幻觉缓解和评估)的集成框架和库,填补这一空白。THaMES提供了一个端到端解决方案,用于在LLMs上评估和缓解幻觉,包括自动测试集生成、多角度基准测试和可定制的缓解策略。它通过批处理、加权抽样和反事实验证等技术自动创建测试集,确保高数据质量、多样性和性价比。THaMES评估模型在各种任务上的幻觉检测和缓解能力,包括文本生成和二进制分类,采用最优的缓解策略如In-Context Learning(ICL)、反向传播增强生成(RAG)和参数高效的微调(PEFT)。使用学术论文、政治新闻和维基百科的知识库对最先进的LLM进行评估,发现商业模型如GPT-4o在RAG上的效果比ICL更好,而open-weight模型如Llama-3.1-8B-Instruct和Mistral-Nemo在ICL上的效果比RAG更好。此外,PEFT显著增强了Llama-3.1-8B-Instruct在两个评估任务上的性能。
https://arxiv.org/abs/2409.11353
Numerous methods have been proposed to adapt a pre-trained foundational CLIP model for few-shot classification. As CLIP is trained on a large corpus, it generalises well through adaptation to few-shot classification. In this work, we analyse the intra-modal overlap in image space in terms of embedding representation. Our analysis shows that, due to contrastive learning, embeddings from CLIP model exhibit high cosine similarity distribution overlap in the image space between paired and unpaired examples affecting the performance of few-shot training-free classification methods which rely on similarity in the image space for their predictions. To tackle intra-modal overlap we propose to train a lightweight adapter on a generic set of samples from the Google Open Images dataset demonstrating that this improves accuracy for few-shot training-free classification. We validate our contribution through extensive empirical analysis and demonstrate that reducing the intra-modal overlap leads to a) improved performance on a number of standard datasets, b) increased robustness to distribution shift and c) higher feature variance rendering the features more discriminative for downstream tasks.
为了适应少样本分类,已经提出了许多方法来对预训练的CLIP模型进行修改。由于CLIP在大数据集上训练,因此通过适应少样本分类表现良好。在这项工作中,我们分析图像空间中内模型的嵌入表示。我们的分析表明,由于对比学习,CLIP模型的嵌入在成对和未成对实例之间在图像空间表现出高余弦相似性分布重叠,这会影响那些基于图像空间预测的少样本训练 free 的分类算法的准确性。为了应对内模冲突,我们提出了一个轻量级的适配器,在Google Open Images数据集中的通用样本上进行训练,证明了这对少样本训练free分类的准确性有所提高。通过广泛的实证分析验证我们的贡献,我们发现减少内模冲突会导致以下结果:a) 在多个标准数据集上的性能提高;b) 对分布漂移的增加容错性;c) 提高下游任务的特征变异性,使得特征更具判别性。
https://arxiv.org/abs/2409.11338