This study presents a hybrid model for classifying handwritten digits in the MNIST dataset, combining convolutional neural networks (CNNs) with a multi-well Hopfield network. The approach employs a CNN to extract high-dimensional features from input images, which are then clustered into class-specific prototypes using k-means clustering. These prototypes serve as attractors in a multi-well energy landscape, where a Hopfield network performs classification by minimizing an energy function that balances feature similarity and class this http URL model's design enables robust handling of intraclass variability, such as diverse handwriting styles, while providing an interpretable framework through its energy-based decision process. Through systematic optimization of the CNN architecture and the number of wells, the model achieves a high test accuracy of 99.2% on 10,000 MNIST images, demonstrating its effectiveness for image classification tasks. The findings highlight the critical role of deep feature extraction and sufficient prototype coverage in achieving high performance, with potential for broader applications in pattern recognition.
这项研究提出了一种用于在MNIST数据集上分类手写数字的混合模型,该模型结合了卷积神经网络(CNN)与多阱霍普菲尔德网络。这种方法利用CNN从输入图像中提取高维特征,并使用k均值聚类将这些特征聚类为特定于每个类别的原型。这些原型作为具有多个能量陷阱的能量景观中的吸引子,在其中霍普菲尔德网络通过最小化一个平衡了特征相似性和类别归属的能函数来进行分类。该模型的设计能够稳健地处理同一类别内的变化,例如多样的手写风格,并且由于其基于能量的决策过程提供了可解释性框架。 通过对CNN架构和阱的数量进行系统优化,该模型在10,000张MNIST图像上达到了99.2%的高测试准确率,证明了它对于图像分类任务的有效性。研究结果强调了深度特征提取以及充分原型覆盖对于实现高性能的关键作用,并且其潜在的应用范围可能更广泛,在模式识别领域具有潜力。
https://arxiv.org/abs/2507.08766
Common knowledge indicates that the process of constructing image datasets usually depends on the time-intensive and inefficient method of manual collection and annotation. Large models offer a solution via data generation. Nonetheless, real-world data are obviously more valuable comparing to artificially intelligence generated data, particularly in constructing image datasets. For this reason, we propose a novel method for auto-constructing datasets from real-world images by a multiagent collaborative system, named as DatasetAgent. By coordinating four different agents equipped with Multi-modal Large Language Models (MLLMs), as well as a tool package for image optimization, DatasetAgent is able to construct high-quality image datasets according to user-specified requirements. In particular, two types of experiments are conducted, including expanding existing datasets and creating new ones from scratch, on a variety of open-source datasets. In both cases, multiple image datasets constructed by DatasetAgent are used to train various vision models for image classification, object detection, and image segmentation.
常识表明,构建图像数据集通常依赖于耗时且效率低下的手动收集和标注方法。大型模型通过生成数据提供了一种解决方案。然而,与人工合成的数据相比,现实世界中的数据在构建图像数据集中显然更有价值。因此,我们提出了一种使用多智能体协作系统从真实世界的图像自动构建数据集的新方法,命名为DatasetAgent。通过协调四个配备有多模态大型语言模型(MLLM)的代理以及用于图像优化的工具包,DatasetAgent能够根据用户指定的要求构建高质量的图像数据集。特别地,在各种开源数据集上进行了两种类型的实验,包括扩展现有数据集和从头开始创建新数据集。在这两种情况下,使用多个由DatasetAgent构建的图像数据集来训练多种视觉模型以进行图像分类、目标检测和图像分割任务。
https://arxiv.org/abs/2507.08648
Deep learning has driven significant advances in medical image analysis, yet its adoption in clinical practice remains constrained by the large size and lack of transparency in modern models. Advances in interpretability techniques such as DL-Backtrace, Layer-wise Relevance Propagation, and Integrated Gradients make it possible to assess the contribution of individual components within neural networks trained on medical imaging tasks. In this work, we introduce an interpretability-guided pruning framework that reduces model complexity while preserving both predictive performance and transparency. By selectively retaining only the most relevant parts of each layer, our method enables targeted compression that maintains clinically meaningful representations. Experiments across multiple medical image classification benchmarks demonstrate that this approach achieves high compression rates with minimal loss in accuracy, paving the way for lightweight, interpretable models suited for real-world deployment in healthcare settings.
深度学习在医学图像分析方面取得了显著进展,但其在临床实践中的应用仍受到现代模型规模庞大和透明度不足的限制。解释性技术的进步,如DL-Backtrace、逐层相关传播(Layer-wise Relevance Propagation)和集成梯度法(Integrated Gradients),使得评估训练于医学成像任务上的神经网络中各个组件的重要性成为可能。在这项工作中,我们引入了一个可解释性引导的剪枝框架,该框架在保持预测性能和透明度的同时降低了模型复杂度。通过选择性地保留每一层中最相关的部分,我们的方法能够实现有针对性的压缩,并保持临床上有意义的表示形式。多项医学图像分类基准实验表明,这种方法可以达到高压缩率且准确性损失极小,为医疗环境中部署轻量级、可解释性的模型铺平了道路。
https://arxiv.org/abs/2507.08330
We conduct an extensive study on the state of calibration under real-world dataset shift for image classification. Our work provides important insights on the choice of post-hoc and in-training calibration techniques, and yields practical guidelines for all practitioners interested in robust calibration under shift. We compare various post-hoc calibration methods, and their interactions with common in-training calibration strategies (e.g., label smoothing), across a wide range of natural shifts, on eight different classification tasks across several imaging domains. We find that: (i) simultaneously applying entropy regularisation and label smoothing yield the best calibrated raw probabilities under dataset shift, (ii) post-hoc calibrators exposed to a small amount of semantic out-of-distribution data (unrelated to the task) are most robust under shift, (iii) recent calibration methods specifically aimed at increasing calibration under shifts do not necessarily offer significant improvements over simpler post-hoc calibration methods, (iv) improving calibration under shifts often comes at the cost of worsening in-distribution calibration. Importantly, these findings hold for randomly initialised classifiers, as well as for those finetuned from foundation models, the latter being consistently better calibrated compared to models trained from scratch. Finally, we conduct an in-depth analysis of ensembling effects, finding that (i) applying calibration prior to ensembling (instead of after) is more effective for calibration under shifts, (ii) for ensembles, OOD exposure deteriorates the ID-shifted calibration trade-off, (iii) ensembling remains one of the most effective methods to improve calibration robustness and, combined with finetuning from foundation models, yields best calibration results overall.
我们针对图像分类任务在实际数据集变化情况下的校准状态进行了广泛研究。我们的工作为后处理和训练中校准技术的选择提供了重要见解,并为所有关注在变化条件下稳健校准的从业人员提供实用指南。我们在八个不同的分类任务上,涵盖了多个成像领域内各种自然变化的情况下,比较了多种后处理校准方法及其与常见训练中的校准策略(如标签平滑)之间的相互作用。 我们发现: (i) 同时应用熵正则化和标签平滑可产生最佳的数据集转移条件下的原始概率校准效果; (ii) 曝光于少量语义域外数据的后处理校准器在变化下最具有鲁棒性,这些域外数据与任务无关; (iii) 最近专门用于增强变换条件下校准的方法并不一定比简单的后处理校准方法提供显著改进; (iv) 在改善变换条件下的校准时,通常会牺牲内部分布的数据校准。 重要的是,上述发现适用于随机初始化的分类器以及从基础模型微调得到的那些分类器。后者在与从头开始训练的模型进行比较时始终具有更好的校准效果。 最后,我们进行了详细的集成效应分析,结果表明: (i) 在集成之前应用校准(而不是之后)对变化条件下的校准更加有效; (ii) 对于集合体来说,域外数据暴露会损害内部分布与变换分布之间的校准权衡; (iii) 集成仍然是提高校准鲁棒性最有效的方法之一,并且结合基础模型微调可以实现最佳的校准结果。
https://arxiv.org/abs/2507.07780
Adversarial Training (AT) is a widely adopted defense against adversarial examples. However, existing approaches typically apply a uniform training objective across all classes, overlooking disparities in class-wise vulnerability. This results in adversarial unfairness: classes with well distinguishable features (strong classes) tend to become more robust, while classes with overlapping or shared features(weak classes) remain disproportionately susceptible to adversarial attacks. We observe that strong classes do not require strong adversaries during training, as their non-robust features are quickly suppressed. In contrast, weak classes benefit from stronger adversaries to effectively reduce their vulnerabilities. Motivated by this, we introduce TRIX, a feature-aware adversarial training framework that adaptively assigns weaker targeted adversaries to strong classes, promoting feature diversity via uniformly sampled targets, and stronger untargeted adversaries to weak classes, enhancing their focused robustness. TRIX further incorporates per-class loss weighting and perturbation strength adjustments, building on prior work, to emphasize weak classes during the optimization. Comprehensive experiments on standard image classification benchmarks, including evaluations under strong attacks such as PGD and AutoAttack, demonstrate that TRIX significantly improves worst-case class accuracy on both clean and adversarial data, reducing inter-class robustness disparities, and preserves overall accuracy. Our results highlight TRIX as a practical step toward fair and effective adversarial defense.
对抗训练(AT)是防范对抗样本的一种广泛应用的防御方法。然而,现有方法通常在整个类中应用统一的训练目标,忽视了各分类之间的脆弱性差异。这导致了对抗不公平:具有易于区分特征的类别(强类别)在对抗攻击下变得更加强健,而那些具有重叠或共享特征的类别(弱类别)则仍然容易受到攻击。我们观察到,在训练过程中,对于那些非鲁棒性的特征很快被抑制的强类来说,并不需要强大的对手。相反,弱类从更强的对手中获益,能够有效地减少它们自身的脆弱性。 基于这一发现,我们引入了TRIX,这是一种具有感知特征的对抗训练框架,它根据类别特点自适应地分配更弱的目标型攻击者给强类别,通过均匀采样的目标来促进特征多样性,并为弱类分配更强的非目标型攻击者以增强其特定方面的鲁棒性。此外,TRIX 还结合了基于先前工作的每类损失加权和扰动强度调整,强调在优化过程中对弱类的关注。 标准图像分类基准上的全面实验(包括PGD和AutoAttack等强攻击下的评估)表明,TRIX 在干净数据和对抗数据中显著提高了最差情况类别精度,并且减少了各分类之间的鲁棒性差异。同时,它还保持了总体准确性。我们的结果凸显出TRIX作为实现公平有效的对抗防御的实用步骤的重要性。
https://arxiv.org/abs/2507.07768
Machine unlearning seeks to remove the influence of particular data or class from trained models to meet privacy, legal, or ethical requirements. Existing unlearning methods tend to forget shallowly: phenomenon of an unlearned model pretend to forget by adjusting only the model response, while its internal representations retain information sufficiently to restore the forgotten data or behavior. We empirically confirm the widespread shallowness by reverting the forgetting effect of various unlearning methods via training-free performance recovery attack and gradient-inversion-based data reconstruction attack. To address this vulnerability fundamentally, we define a theoretical criterion of ``deep forgetting'' based on one-point-contraction of feature representations of data to forget. We also propose an efficient approximation algorithm, and use it to construct a novel general-purpose unlearning algorithm: One-Point-Contraction (OPC). Empirical evaluations on image classification unlearning benchmarks show that OPC achieves not only effective unlearning performance but also superior resilience against both performance recovery attack and gradient-inversion attack. The distinctive unlearning performance of OPC arises from the deep feature forgetting enforced by its theoretical foundation, and recaps the need for improved robustness of machine unlearning methods.
机器遗忘技术旨在从训练好的模型中移除特定数据或类别的影响,以满足隐私、法律或伦理要求。现有的遗忘方法往往只是表面地“忘记”:它们通过调整模型的响应来假装忘记了某些内容,但内部表示仍然保留着足够恢复被遗忘的数据或行为的信息。我们通过无训练性能恢复攻击和基于梯度反转的数据重建攻击验证了这种广泛存在的表面性遗忘现象,并成功逆转了各种遗忘方法的效果。 为了从根本上解决这一漏洞,我们定义了一个理论标准“深层遗忘”,该标准基于要忘记的数据的特征表示的一点收缩(one-point-contraction)。此外,我们还提出了一种高效的近似算法,并利用它构建了一种新颖的通用遗忘算法:一点收缩法(One-Point-Contraction, OPC)。 在图像分类遗忘基准测试上的实证评估表明,OPC不仅实现了有效的遗忘性能,而且对无训练性能恢复攻击和基于梯度反转的攻击也具有更强的防御能力。这种独特的遗忘性能源自其理论基础所强制执行的深层特征遗忘,并强调了改进机器遗忘方法鲁棒性的需求。
https://arxiv.org/abs/2507.07754
With the rise of social media, vast amounts of user-uploaded videos (e.g., YouTube) are utilized as training data for Visual Object Tracking (VOT). However, the VOT community has largely overlooked video data-privacy issues, as many private videos have been collected and used for training commercial models without authorization. To alleviate these issues, this paper presents the first investigation on preventing personal video data from unauthorized exploitation by deep trackers. Existing methods for preventing unauthorized data use primarily focus on image-based tasks (e.g., image classification), directly applying them to videos reveals several limitations, including inefficiency, limited effectiveness, and poor generalizability. To address these issues, we propose a novel generative framework for generating Temporal Unlearnable Examples (TUEs), and whose efficient computation makes it scalable for usage on large-scale video datasets. The trackers trained w/ TUEs heavily rely on unlearnable noises for temporal matching, ignoring the original data structure and thus ensuring training video data-privacy. To enhance the effectiveness of TUEs, we introduce a temporal contrastive loss, which further corrupts the learning of existing trackers when using our TUEs for training. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in video data-privacy protection, with strong transferability across VOT models, datasets, and temporal matching tasks.
随着社交媒体的兴起,大量用户上传的视频(例如 YouTube)被用作视觉对象跟踪 (VOT) 的训练数据。然而,VOT 社区在很大程度上忽视了视频数据隐私问题,因为许多私人视频未经许可就被收集并用于商业模型的训练。为了解决这些问题,本文首次研究如何通过深度追踪器防止个人视频数据未经授权的利用。 现有的防止未授权使用数据的方法主要集中在基于图像的任务(例如图像分类)上,直接将其应用于视频会暴露出多个限制,包括效率低下、有效性有限和泛化能力差等问题。为了应对这些挑战,我们提出了一种新颖的生成框架来创建时间不可学习示例 (TUEs),并且其高效的计算使其在大规模视频数据集上的使用成为可能。用 TUE 训练的追踪器严重依赖于时间匹配中的不可学习噪声,忽略原始数据结构从而确保训练视频的数据隐私。 为了增强 TUE 的有效性,我们引入了一种时间对比损失函数,它进一步破坏了现有跟踪器在使用我们的 TUE 进行训练时的学习过程。广泛的实验表明,我们的方法在保护视频数据隐私方面取得了最先进的性能,并且在 VOT 模型、数据集和时间匹配任务之间具有很强的可迁移性。
https://arxiv.org/abs/2507.07483
Multilayer Extreme Learning Machine (ML-ELM) and its variants have proven to be an effective technique for the classification of different natural signals such as audio, video, acoustic and images. In this paper, a Hybrid Multilayer Extreme Learning Machine (HML-ELM) that is based on ELM-based autoencoder (ELM-AE) and an Interval Type-2 fuzzy Logic theory is suggested for active image classification and applied to Unmanned Aerial Vehicles (UAVs). The proposed methodology is a hierarchical ELM learning framework that consists of two main phases: 1) self-taught feature extraction and 2) supervised feature classification. First, unsupervised multilayer feature encoding is achieved by stacking a number of ELM-AEs, in which input data is projected into a number of high-level representations. At the second phase, the final features are classified using a novel Simplified Interval Type-2 Fuzzy ELM (SIT2-FELM) with a fast output reduction layer based on the SC algorithm; an improved version of the algorithm Center of Sets Type Reducer without Sorting Requirement (COSTRWSR). To validate the efficiency of the HML-ELM, two types of experiments for the classification of images are suggested. First, the HML-ELM is applied to solve a number of benchmark problems for image classification. Secondly, a number of real experiments to the active classification and transport of four different objects between two predefined locations using a UAV is implemented. Experiments demonstrate that the proposed HML-ELM delivers a superior efficiency compared to other similar methodologies such as ML-ELM, Multilayer Fuzzy Extreme Learning Machine (ML-FELM) and ELM.
多层极限学习机(ML-ELM)及其变体已被证明是分类音频、视频、声学和图像等不同自然信号的有效技术。本文提出了一种基于极限学习机制自动编码器(ELM-AE)及区间类型2模糊逻辑理论的混合多层极限学习机(HML-ELM),用于主动图像分类,并将其应用于无人驾驶飞行器(UAV)。所提出的是一种分层的ELM学习框架,包含两个主要阶段:1)自监督特征提取和2)监督特征分类。在第一阶段,通过堆叠若干个ELM-AE实现无监督多层特性编码,将输入数据投影到多个高级表示中。在第二阶段,则使用一种新颖的简化区间类型2模糊极限学习机(SIT2-FELM)进行最终特征分类,并基于SC算法快速输出减少层;该算法是改进版本的无需排序需求的集合中心化类型缩减器(COSTRWSR)。为了验证HML-ELM的有效性,针对图像分类提出了两种类型的实验。首先,将HML-ELM应用于解决若干个基准问题以进行图像分类。其次,在使用UAV进行四个不同物体在两个预定义位置之间的主动分类和传输方面实施了多个实际实验。实验证明,所提出的HML-ELM相比其他类似方法如ML-ELM、多层模糊极限学习机(ML-FELM)和传统的ELM,表现出了更卓越的效率。
https://arxiv.org/abs/2507.08047
In recent years, large-scale pre-trained multimodal models (LMMs) generally emerge to integrate the vision and language modalities, achieving considerable success in multimodal tasks, such as text-image classification. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this context, we propose a novel efficient prompt-based multimodal interaction strategy, namely Efficient Prompt Interaction for text-image Classification (EPIC). Specifically, we utilize temporal prompts on intermediate layers, and integrate different modalities with similarity-based prompt interaction, to leverage sufficient information exchange between modalities. Utilizing this approach, our method achieves reduced computational resource consumption and fewer trainable parameters (about 1\% of the foundation model) compared to other fine-tuning strategies. Furthermore, it demonstrates superior performance on the UPMC-Food101 and SNLI-VE datasets, while achieving comparable performance on the MM-IMDB dataset.
近年来,大规模预训练的多模态模型(LMMs)逐渐出现,旨在整合视觉和语言模态,并在诸如文本图像分类等多模态任务中取得了显著的成功。然而,随着这些模型规模的增长,微调以适应下游任务时计算成本也相应增加。因此,基于提示的交互策略被研究出来,用以更高效地对齐不同模态。在此背景下,我们提出了一种新颖且高效的基于提示的多模态交互策略——即用于文本图像分类的有效提示交互(EPIC)。具体而言,我们在中间层使用时间提示,并通过基于相似性的提示交互整合不同的模态,从而促进模态间的信息充分交换。采用这种方法后,相较于其他微调策略,我们的方法实现了计算资源消耗的减少以及可训练参数数量的降低(基础模型的大约1%)。此外,在UPMC-Food101和SNLI-VE数据集上表现出了优于其他方法的效果,并且在MM-IMDB数据集上的性能也达到了与竞争对手相当的水平。
https://arxiv.org/abs/2507.07415
Foundation models are pre-trained on large-scale datasets and subsequently fine-tuned on small-scale datasets using parameter-efficient fine-tuning (PEFT) techniques like low-rank adapters (LoRA). In most previous works, LoRA weight matrices are randomly initialized with a fixed rank across all attachment points. In this paper, we improve convergence and final performance of LoRA fine-tuning, using our proposed data-driven weight initialization method, ConsNoTrainLoRA (CNTLoRA). We express LoRA initialization as a domain shift problem where we use multiple constraints relating the pre-training and fine-tuning activations. By reformulating these constraints, we obtain a closed-form estimate of LoRA weights that depends on pre-training weights and fine-tuning activation vectors and hence requires no training during initialization. This weight estimate is decomposed to initialize the up and down matrices with proposed flexibility of variable ranks. With the proposed initialization method, we fine-tune on downstream tasks such as image generation, image classification and image understanding. Both quantitative and qualitative results demonstrate that CNTLoRA outperforms standard and data-driven weight initialization methods. Extensive analyses and ablations further elucidate the design choices of our framework, providing an optimal recipe for faster convergence and enhanced performance.
基础模型是在大规模数据集上预训练,然后通过低秩适配器(LoRA)等参数高效微调技术在小型数据集上进行精调。以往的工作中,LoRA权重矩阵通常会在所有附加点处使用固定秩随机初始化。本文提出了一种基于数据驱动的权重初始化方法——ConsNoTrainLoRA (CNTLoRA),旨在改善LoRA精调过程中的收敛性和最终性能。我们将LoRA初始化视为一个领域偏移问题,在这个问题中,我们利用多个约束条件来联系预训练和微调激活状态。通过重新表述这些约束条件,我们可以获得一种闭合形式的LoRA权重估计值,该估计值依赖于预训练权重和微调激活向量,因此在初始化时不需要额外的训练过程。这种权重估计被分解开来用于以可变秩灵活性来初始化上矩阵和下矩阵。 采用我们提出的这一初始化方法后,在下游任务如图像生成、图像分类及图像理解等任务中进行精调。无论是定量分析还是定性分析都表明,CNTLoRA在性能方面优于标准的和基于数据驱动的权重初始化方法。通过广泛的分析与消融研究进一步阐明了我们框架的设计选择,并为实现更快收敛和增强表现提供了最佳方案。
https://arxiv.org/abs/2507.08044
Microscopic assessment of histopathology images is vital for accurate cancer diagnosis and treatment. Whole Slide Image (WSI) classification and captioning have become crucial tasks in computer-aided pathology. However, microscopic WSI face challenges such as redundant patches and unknown patch positions due to subjective pathologist captures. Moreover, generating automatic pathology captions remains a significant challenge. To address these issues, we introduce a novel GNN-ViTCap framework for classification and caption generation from histopathological microscopic images. First, a visual feature extractor generates patch embeddings. Redundant patches are then removed by dynamically clustering these embeddings using deep embedded clustering and selecting representative patches via a scalar dot attention mechanism. We build a graph by connecting each node to its nearest neighbors in the similarity matrix and apply a graph neural network to capture both local and global context. The aggregated image embeddings are projected into the language model's input space through a linear layer and combined with caption tokens to fine-tune a large language model. We validate our method on the BreakHis and PatchGastric datasets. GNN-ViTCap achieves an F1 score of 0.934 and an AUC of 0.963 for classification, along with a BLEU-4 score of 0.811 and a METEOR score of 0.569 for captioning. Experimental results demonstrate that GNN-ViTCap outperforms state of the art approaches, offering a reliable and efficient solution for microscopy based patient diagnosis.
显微病理图像的微观评估对于准确的癌症诊断和治疗至关重要。全滑动图像(WSI)分类和描述已成为计算机辅助病理学中的关键任务。然而,由于主观的病理学家拍摄方式导致的问题,如冗余补丁和未知的位置信息,使得这些问题变得复杂。此外,自动生成病理描述仍然是一个重大挑战。为了解决这些问题,我们引入了一种新的GNN-ViTCap框架,用于从组织病理学显微图像中进行分类和生成描述。 该框架首先通过视觉特征提取器生成补丁嵌入。然后,利用深度嵌入聚类动态地对这些嵌入进行聚类,并通过标量点注意力机制选择具有代表性的补丁以去除冗余补丁。接下来,我们建立一个图结构,将每个节点连接到相似度矩阵中的最近邻结点,并应用图形神经网络来捕捉局部和全局上下文信息。最后,聚合后的图像嵌入被线性层映射到语言模型的输入空间中,并与描述标记结合以微调大型语言模型。 我们在BreakHis和PatchGastric数据集上验证了该方法的有效性。结果显示,GNN-ViTCap在分类任务上的F1得分为0.934,AUC为0.963;在描述生成方面,BLEU-4评分为0.811,METEOR评分为0.569。实验结果表明,GNN-ViTCap优于现有方法,在基于显微镜的患者诊断中提供了可靠且高效的解决方案。
https://arxiv.org/abs/2507.07006
Hyperspectral image (HSI) classification presents inherent challenges due to high spectral dimensionality, significant domain shifts, and limited availability of labeled data. To address these issues, we propose a novel Active Transfer Learning (ATL) framework built upon a Spatial-Spectral Transformer (SST) backbone. The framework integrates multistage transfer learning with an uncertainty-diversity-driven active learning mechanism that strategically selects highly informative and diverse samples for annotation, thereby significantly reducing labeling costs and mitigating sample redundancy. A dynamic layer freezing strategy is introduced to enhance transferability and computational efficiency, enabling selective adaptation of model layers based on domain shift characteristics. Furthermore, we incorporate a self-calibrated attention mechanism that dynamically refines spatial and spectral weights during adaptation, guided by uncertainty-aware feedback. A diversity-promoting sampling strategy ensures broad spectral coverage among selected samples, preventing overfitting to specific classes. Extensive experiments on benchmark cross-domain HSI datasets demonstrate that the proposed SST-ATL framework achieves superior classification performance compared to conventional approaches. The source code is publicly available at this https URL.
高光谱图像(HSI)分类由于其高维光谱特性、显著的领域偏移以及标注数据的稀缺性,面临着固有的挑战。为了解决这些问题,我们提出了一种新颖的主动迁移学习(ATL)框架,该框架基于空间-光谱变换器(SST)骨干网络构建。此框架整合了多阶段迁移学习与一种不确定性-多样性驱动的主动学习机制,以战略化地选择具有高度信息量和多样性的样本进行标注,从而大大减少了标注成本并缓解了样本冗余问题。 为了提升模型的转移能力和计算效率,我们引入了一种动态层冻结策略,该策略可根据领域偏移特性选择性地调整模型层数。此外,框架中还集成了一个自校准注意力机制,在适应过程中根据不确定性感知反馈动态优化空间和光谱权重。一种促进多样性的采样策略确保了所选样本在广泛的光谱覆盖范围内,防止对特定类别的过度拟合。 在基准跨域HSI数据集上进行的大量实验表明,提出的SST-ATL框架相比传统方法实现了更为优异的分类性能。该研究的源代码可在以下网址公开获取:[此URL](请将方括号中的文本替换为实际提供的链接)。
https://arxiv.org/abs/2411.18115
This study conducts a comprehensive comparison of four neural network architectures: Convolutional Neural Network, Capsule Network, Convolutional Kolmogorov--Arnold Network, and the newly proposed Capsule--Convolutional Kolmogorov--Arnold Network. The proposed Capsule-ConvKAN architecture combines the dynamic routing and spatial hierarchy capabilities of Capsule Network with the flexible and interpretable function approximation of Convolutional Kolmogorov--Arnold Networks. This novel hybrid model was developed to improve feature representation and classification accuracy, particularly in challenging real-world biomedical image data. The architectures were evaluated on a histopathological image dataset, where Capsule-ConvKAN achieved the highest classification performance with an accuracy of 91.21\%. The results demonstrate the potential of the newly introduced Capsule-ConvKAN in capturing spatial patterns, managing complex features, and addressing the limitations of traditional convolutional models in medical image classification.
这项研究对四种神经网络架构进行了全面的比较:卷积神经网络(CNN)、胶囊网络(CapsNet)、卷积柯尔莫哥洛夫-阿诺德网络(ConvKAN)以及新提出的胶囊卷积柯尔莫哥洛夫-阿诺德网络(Capsule-ConvKAN)。所提出的新架构Capsule-ConvKAN结合了胶囊网络的动态路由和空间层次能力,与卷积柯尔莫哥洛夫-阿诺德网络灵活且可解释的功能近似能力。这一新型混合模型旨在提高特征表示能力和分类精度,特别是在具有挑战性的现实世界生物医学图像数据中。 这些架构在一组组织病理学图像数据集上进行了评估,在该数据集中Capsule-ConvKAN取得了最高分类性能,准确率达到91.21%。研究结果表明,新引入的Capsule-ConvKAN模型具备捕捉空间模式、管理复杂特征以及克服传统卷积模型在医学图像分类中限制的能力。
https://arxiv.org/abs/2507.06417
Unsupervised domain adaptation (UDA) involves learning class semantics from labeled data within a source domain that generalize to an unseen target domain. UDA methods are particularly impactful for semantic segmentation, where annotations are more difficult to collect than in image classification. Despite recent advances in large-scale vision-language representation learning, UDA methods for segmentation have not taken advantage of the domain-agnostic properties of text. To address this, we present a novel Covariance-based Pixel-Text loss, CoPT, that uses domain-agnostic text embeddings to learn domain-invariant features in an image segmentation encoder. The text embeddings are generated through our LLM Domain Template process, where an LLM is used to generate source and target domain descriptions that are fed to a frozen CLIP model and combined. In experiments on four benchmarks we show that a model trained using CoPT achieves the new state of the art performance on UDA for segmentation. The code can be found at this https URL.
无监督领域适应(UDA)涉及从源域中的标记数据中学习类别语义,以便推广到未见的目标域。UDA方法在语义分割任务中尤为关键,因为与图像分类相比,在语义分割中收集标注更为困难。尽管在大规模视觉-语言表示学习方面取得了进展,但用于分割的UDA方法尚未利用文本的领域无关特性。为此,我们提出了一种基于协方差的像素-文本损失函数(CoPT),该函数使用领域无关的文本嵌入来学习图像分割编码器中的领域不变特征。这些文本嵌入通过我们的LLM领域模板过程生成,其中大型语言模型(LLM)被用来生成源域和目标域描述,并将这些描述输入冻结状态下的CLIP模型中进行组合。在四个基准测试上的实验表明,使用CoPT训练的模型在UDA分割任务上达到了新的最先进性能水平。代码可以在提供的URL地址获取。
https://arxiv.org/abs/2507.07125
In this study, SoftReMish, a new activation function designed to improve the performance of convolutional neural networks (CNNs) in image classification tasks, is proposed. Using the MNIST dataset, a standard CNN architecture consisting of two convolutional layers, max pooling, and fully connected layers was implemented. SoftReMish was evaluated against popular activation functions including ReLU, Tanh, and Mish by replacing the activation function in all trainable layers. The model performance was assessed in terms of minimum training loss and maximum validation accuracy. Results showed that SoftReMish achieved a minimum loss (3.14e-8) and a validation accuracy (99.41%), outperforming all other functions tested. These findings demonstrate that SoftReMish offers better convergence behavior and generalization capability, making it a promising candidate for visual recognition tasks.
在这项研究中,提出了一种新的激活函数SoftReMish,旨在提高卷积神经网络(CNN)在图像分类任务中的性能。使用MNIST数据集,在一个标准的CNN架构上进行了实验,该架构包括两个卷积层、最大池化层和全连接层。通过将所有可训练层中的激活函数替换为SoftReMish、ReLU、Tanh 和 Mish等流行激活函数来评估其效果。模型性能通过最小训练损失和最大验证准确率进行评估。 结果显示,SoftReMish达到了最低的训练损失(3.14e-8)和最高的验证精度(99.41%),优于所有其他被测试的功能。这些发现表明,SoftReMish具有更好的收敛行为和泛化能力,在视觉识别任务中表现出极大的潜力。
https://arxiv.org/abs/2507.06148
Diffusion models have demonstrated exceptional performance across various domains due to their ability to model and generate complicated data distributions. However, when applied to PolSAR data, traditional real-valued diffusion models face challenges in capturing complex-valued phase this http URL, these models often struggle to preserve fine structural details. To address these limitations, we leverage the Contourlet transform, which provides rich multiscale and multidirectional representations well-suited for PolSAR imagery. We propose a structural knowledge-guided complex diffusion model for PolSAR image classification in the Contourlet domain. Specifically, the complex Contourlet transform is first applied to decompose the data into low- and high-frequency subbands, enabling the extraction of statistical and boundary features. A knowledge-guided complex diffusion network is then designed to model the statistical properties of the low-frequency components. During the process, structural information from high-frequency coefficients is utilized to guide the diffusion process, improving edge preservation. Furthermore, multiscale and multidirectional high-frequency features are jointly learned to further boost classification accuracy. Experimental results on three real-world PolSAR datasets demonstrate that our approach surpasses state-of-the-art methods, particularly in preserving edge details and maintaining region homogeneity in complex terrain.
扩散模型由于能够建模和生成复杂的数据分布,在多个领域内展示了卓越的性能。然而,当应用于极化合成孔径雷达(PolSAR)数据时,传统的实数值扩散模型在捕捉复杂的相位信息方面遇到挑战,并且难以保留精细的结构细节。为了解决这些问题,我们利用了轮廓波变换(Contourlet transform),该技术提供了丰富的多尺度和多方向表示,特别适合用于极化合成孔径雷达图像处理。 为此,我们提出了一种基于轮廓波域中的结构知识引导型复扩散模型,以提高PolSAR图像分类的准确性。具体而言,首先使用复数轮廓波变换将数据分解为低频和高频子带,从而能够提取统计特性和边界特征。接着设计了一个知识引导下的复扩散网络,用于建模低频分量的统计特性。在此过程中,利用高频系数中的结构信息来指导扩散过程,以提高边缘保留能力。此外,通过联合学习多尺度和多方向的高频特征进一步提升分类精度。 实验结果表明,在三个真实的PolSAR数据集上,我们的方法超越了现有最先进的技术,特别是在复杂地形中保持边界细节和区域同质性方面表现尤为突出。
https://arxiv.org/abs/2507.05666
Kolmogorov-Arnold Networks (KANs) have garnered attention for replacing fixed activation functions with learnable univariate functions, but they exhibit practical limitations, including high computational costs and performance deficits in general classification tasks. In this paper, we propose the Modulation Joint KAN (MJKAN), a novel neural network layer designed to overcome these challenges. MJKAN integrates a FiLM (Feature-wise Linear Modulation)-like mechanism with Radial Basis Function (RBF) activations, creating a hybrid architecture that combines the non-linear expressive power of KANs with the efficiency of Multilayer Perceptrons (MLPs). We empirically validated MJKAN's performance across a diverse set of benchmarks, including function regression, image classification (MNIST, CIFAR-10/100), and natural language processing (AG News, SMS Spam). The results demonstrate that MJKAN achieves superior approximation capabilities in function regression tasks, significantly outperforming MLPs, with performance improving as the number of basis functions increases. Conversely, in image and text classification, its performance was competitive with MLPs but revealed a critical dependency on the number of basis functions. We found that a smaller basis size was crucial for better generalization, highlighting that the model's capacity must be carefully tuned to the complexity of the data to prevent overfitting. In conclusion, MJKAN offers a flexible architecture that inherits the theoretical advantages of KANs while improving computational efficiency and practical viability.
Kolmogorov-Arnold Networks(KAN)因其用可学习的一元函数替代固定激活函数而受到关注,但它们在实际应用中存在一些限制,包括计算成本高和一般分类任务中的性能不足。本文提出了一种新的神经网络层——调制联合KAN(MJKAN),旨在克服这些挑战。MJKAN结合了类似于FiLM(按特征线性调制)的机制和径向基函数(RBF)激活,构建了一个混合架构,它融合了KAN的非线性表达能力和多层感知机(MLP)的效率。 我们通过一系列基准测试对MJKAN的表现进行了实证验证,包括函数回归、图像分类(MNIST, CIFAR-10/100)和自然语言处理(AG News, SMS Spam)。结果表明,在函数回归任务中,MJKAN展示了优越的逼近能力,并显著超越了MLP,且随着基函数数量的增加性能有所提升。然而,在图像和文本分类中,其表现与MLP相当,但显现出对基函数数量的关键依赖性。我们发现较小的基础大小对于更好的泛化至关重要,这表明模型容量需要根据数据复杂度进行仔细调整以防止过拟合。 综上所述,MJKAN提供了一种灵活的架构,继承了KAN理论上的优势,同时在计算效率和实际可行性方面进行了改进。
https://arxiv.org/abs/2507.04690
Hyperspectral image (HSI) classification faces challenges such as high-dimensional data, limited training samples, and spectral redundancy, which often lead to overfitting and insufficient generalization capability. This paper proposes a novel MVNet network architecture that integrates 3D-CNN's local feature extraction, Transformer's global modeling, and Mamba's linear complexity sequence modeling capabilities, achieving efficient spatial-spectral feature extraction and fusion. MVNet features a redesigned dual-branch Mamba module, including a State Space Model (SSM) branch and a non-SSM branch employing 1D convolution with SiLU activation, enhancing modeling of both short-range and long-range dependencies while reducing computational latency in traditional Mamba. The optimized HSI-MambaVision Mixer module overcomes the unidirectional limitation of causal convolution, capturing bidirectional spatial-spectral dependencies in a single forward pass through decoupled attention that focuses on high-value features, alleviating parameter redundancy and the curse of dimensionality. On IN, UP, and KSC datasets, MVNet outperforms mainstream hyperspectral image classification methods in both classification accuracy and computational efficiency, demonstrating robust capability in processing complex HSI data.
高光谱图像(HSI)分类面临诸如高维数据、训练样本有限和光谱冗余等挑战,这些问题通常会导致过拟合并降低泛化能力。本文提出了一种新颖的MVNet网络架构,该架构结合了3D-CNN的局部特征提取功能、Transformer的全局建模能力和Mamba的线性复杂度序列建模能力,实现了高效的空间-光谱特征提取和融合。 MVNet包含一个重新设计的双分支Mamba模块,包括状态空间模型(SSM)分支和非SSM分支,后者采用SiLU激活函数的1D卷积。这种设计增强了对短程和长程依赖关系的建模能力,并且相较于传统Mamba减少了计算延迟。 优化后的HSI-MambaVision Mixer模块克服了因果卷积的方向性限制,在单次前向传递中通过解耦注意力机制捕获双向的空间-光谱依赖关系,从而减轻参数冗余和维度灾难问题。 在IN、UP和KSC数据集上,MVNet在分类准确性和计算效率方面均优于主流的高光谱图像分类方法,展示了处理复杂HSI数据的强大能力。
https://arxiv.org/abs/2507.04409
In scenarios requiring both prediction and explanation efficiency for image classification, self-explaining models that perform both tasks in a single inference are effective. However, their training incurs substantial labeling and computational costs. This study aims to tackle the issue by proposing a method to transfer the visual explainability of self-explaining models, learned in a source domain, to a target domain based on a task arithmetic framework. Specifically, we construct a self-explaining model by extending image classifiers based on a vision-language pretrained model. We then define an \emph{explainability vector} as the difference between model parameters trained on the source domain with and without explanation supervision. Based on the task arithmetic framework, we impart explainability to a model trained only on the prediction task in the target domain by applying the explainability vector. Experimental results on various image classification datasets demonstrate that, except for transfers between some less-related domains, visual explainability can be successfully transferred from source to target domains, improving explanation quality in the target domain without sacrificing classification accuracy. Furthermore, we show that the explainability vector learned on a large and diverse dataset like ImageNet, extended with explanation supervision, exhibits universality and robustness, improving explanation quality on nine out of ten different target datasets. We also find that the explanation quality achieved with a single model inference is comparable to that of Kernel SHAP, which requires 150 model inferences.
在需要图像分类同时具备预测和解释效率的场景中,自解释模型通过一次推理完成这两项任务是有效的。然而,这类模型的训练会带来显著的数据标注和计算成本。这项研究旨在解决这个问题,提出了一种基于任务算术框架的方法,能够将源自源领域的自解释模型的视觉可解释性转移到目标领域。具体而言,我们通过扩展基于视觉-语言预训练模型的图像分类器来构建一个自解释模型。然后定义了一个“可解释性向量”,即在带有和不带解释监督的情况下,在源域上进行训练的模型参数之间的差异。根据任务算术框架,我们在目标领域中仅针对预测任务进行了训练的模型上应用该可解释性向量,从而赋予其可解释性。 实验结果表明,在各种图像分类数据集上的实验证明了除了一些关系较小的领域外,视觉可解释性能够成功地从源域转移到目标域,并在不牺牲分类准确性的前提下提高了目标领域的解释质量。此外,我们发现,在一个大型且多样化的数据集(如ImageNet)上通过增加解释监督学习到的可解释性向量具有通用性和鲁棒性,可以在九个不同的目标数据集中提高解释质量。同时,我们还发现使用单一模型推理所获得的解释质量与Kernel SHAP相当,后者需要进行150次模型推理才能完成相同的任务。 这种方法不仅提高了自解释模型的效率和实用性,也为大规模图像分类应用中的可解释性问题提供了一个有效的解决方案。
https://arxiv.org/abs/2507.04380
Effective data curation is essential for optimizing neural network training. In this paper, we present the Guided Spectrally Tuned Data Selection (GSTDS) algorithm, which dynamically adjusts the subset of data points used for training using an off-the-shelf pre-trained reference model. Based on a pre-scheduled filtering ratio, GSTDS effectively reduces the number of data points processed per batch. The proposed method ensures an efficient selection of the most informative data points for training while avoiding redundant or less beneficial computations. Preserving data points in each batch is performed based on spectral analysis. A Fiedler vector-based scoring mechanism removes the filtered portion of the batch, lightening the resource requirements of the learning. The proposed data selection approach not only streamlines the training process but also promotes improved generalization and accuracy. Extensive experiments on standard image classification benchmarks, including CIFAR-10, Oxford-IIIT Pet, and Oxford-Flowers, demonstrate that GSTDS outperforms standard training scenarios and JEST, a recent state-of-the-art data curation method, on several key factors. It is shown that GSTDS achieves notable reductions in computational requirements, up to four times, without compromising performance. GSTDS exhibits a considerable growth in terms of accuracy under the limited computational resource usage, in contrast to other methodologies. These promising results underscore the potential of spectral-based data selection as a scalable solution for resource-efficient deep learning and motivate further exploration into adaptive data curation strategies. You can find the code at this https URL.
有效数据整理对于优化神经网络训练至关重要。在本文中,我们介绍了指导式谱调数据选择(GSTDS)算法,该算法使用现成的预训练参考模型动态调整用于训练的数据点子集。基于预先安排的过滤比率,GSTDS有效地减少了每批次处理的数据点数量。所提出的方法确保了高效地选择了最有信息量的数据点进行训练,并避免了冗余或较少有益的计算。每个批次中保留数据点是根据谱分析来进行的。基于Fiedler向量的评分机制会移除批次中的过滤部分,从而减轻学习所需的资源需求。提出的这种数据选择方法不仅简化了训练过程,还促进了更好的泛化和准确性。 在标准图像分类基准测试(包括CIFAR-10、Oxford-IIIT Pet 和 Oxford-Flowers)上的广泛实验表明,GSTDS 在多个关键因素方面优于常规训练场景以及JEST(一种最近最先进的数据整理方法)。研究表明,通过采用GSTDS,计算需求可显著减少多达四倍,并且不会影响性能。在有限的计算资源使用下,与其它方法相比,GSTDS 的准确性表现出显著的增长。这些有希望的结果突显了基于谱的数据选择作为高效深度学习的可扩展解决方案的巨大潜力,并激励人们对自适应数据整理策略进行进一步探索。 您可以在提供的链接中找到代码(请在此处插入实际链接)。
https://arxiv.org/abs/2507.04269