Open-set domain generalization(OSDG) for hyperspectral image classification presents significant challenges due to the presence of unknown classes in target domains and the need for models to generalize across multiple unseen domains without target-specific adaptation. Existing domain adaptation methods assume access to target domain data during training and fail to address the fundamental issue of domain shift when unknown classes are present, leading to negative transfer and reduced classification performance. To address these limitations, we propose a novel open-set domain generalization framework that combines four key components: Spectrum-Invariant Frequency Disentanglement (SIFD) for domain-agnostic feature extraction, Dual-Channel Residual Network (DCRN) for robust spectral-spatial feature learning, Evidential Deep Learning (EDL) for uncertainty quantification, and Spectral-Spatial Uncertainty Disentanglement (SSUD) for reliable open-set classification. The SIFD module extracts domain-invariant spectral features in the frequency domain through attention-weighted frequency analysis and domain-agnostic regularization, while DCRN captures complementary spectral and spatial information via parallel pathways with adaptive fusion. EDL provides principled uncertainty estimation using Dirichlet distributions, enabling the SSUD module to make reliable open-set decisions through uncertainty-aware pathway weighting and adaptive rejection thresholding. Experimental results on three cross-scene hyperspectral classification tasks show that our approach achieves performance comparable to state-of-the-art domain adaptation methods while requiring no access to the target domain during training. The implementation will be made available at this https URL upon acceptance.
开放集领域泛化(OSDG)在高光谱图像分类中面临重大挑战,这是因为目标域中存在未知类别,并且需要模型能够跨多个未见领域进行泛化而无需针对特定领域的适应。现有的领域自适应方法假设可以在训练期间访问目标域数据,并且无法解决当出现未知类别时的领域偏移基本问题,从而导致负向迁移和分类性能下降。为了解决这些局限性,我们提出了一种新颖的开放集领域泛化框架,该框架结合了四个关键组成部分:光谱不变频域解耦(SIFD),用于领域不可知特征提取;双通道残差网络(DCRN),用于鲁棒的光谱-空间特征学习;证据深度学习(EDL),用于不确定性量化;以及光谱-空间不确定性解耦(SSUD),以实现可靠的开放集分类。 SIFD模块通过注意力加权频域分析和领域不可知正则化在频率域中提取领域不变的光谱特征。DCRN通过具有自适应融合的并行路径捕获互补的光谱和空间信息。EDL使用狄利克雷分布提供原理性的不确定性估计,使SSUD模块能够通过对不确定性感知路径权重以及自适应拒绝阈值进行可靠开放集决策。 在三个跨场景高光谱分类任务上的实验结果表明,我们的方法实现了与最新领域自适应方法相当的性能,同时无需在训练期间访问目标域数据。接受后,该实现将在此URL上提供:[此https URL]。
https://arxiv.org/abs/2506.09460
Feature augmentation generates novel samples in the feature space, providing an effective way to enhance the generalization ability of learning algorithms with hyperbolic geometry. Most hyperbolic feature augmentation is confined to closed-environment, assuming the number of classes is fixed (\emph{i.e.}, seen classes) and generating features only for these classes. In this paper, we propose a hyperbolic dual feature augmentation method for open-environment, which augments features for both seen and unseen classes in the hyperbolic space. To obtain a more precise approximation of the real data distribution for efficient training, (1) we adopt a neural ordinary differential equation module, enhanced by meta-learning, estimating the feature distributions of both seen and unseen classes; (2) we then introduce a regularizer to preserve the latent hierarchical structures of data in the hyperbolic space; (3) we also derive an upper bound for the hyperbolic dual augmentation loss, allowing us to train a hyperbolic model using infinite augmentations for seen and unseen classes. Extensive experiments on five open-environment tasks: class-incremental learning, few-shot open-set recognition, few-shot learning, zero-shot learning, and general image classification, demonstrate that our method effectively enhances the performance of hyperbolic algorithms in open-environment.
特征增强通过在特征空间中生成新颖样本,提供了一种有效的方法来利用双曲几何提升学习算法的泛化能力。大多数基于双曲几何的特征增强局限于封闭环境假设类别数量固定(即已知类别),仅对这些类别进行特征生成。本文提出了一种面向开放环境的双曲双重特征增强方法,它能在双曲空间中为已知和未知类同时生成特征。 为了更精确地逼近实际数据分布以实现高效训练,我们采取了以下措施: 1. 采用神经常微分方程模块并结合元学习技术来估计已知及未知类别的特征分布; 2. 引入正则化项来保持双曲空间中数据的潜在层次结构; 3. 推导出一种上界用于双曲双重增强损失,从而使得可以利用无限数量的增广样本(对于已知和未知类别)训练双曲模型。 在五个开放环境任务上进行了广泛的实验:类增量学习、少量样本开集识别、少量样本学习、零样本学习以及通用图像分类。结果表明,我们的方法有效提升了双曲算法在开放环境下的性能。
https://arxiv.org/abs/2506.08906
This paper introduces an enhanced Genetic Algorithm technique, which optimizes neural networks for binary image classification tasks, such as cat vs. non-cat classification. The proposed method employs only two individuals for crossover, represented by two parameter sets: Leader and Follower. The Leader focuses on exploitation, representing the primary optimal solution, while the Follower promotes exploration by preserving diversity and avoiding premature convergence. Leader and Follower are modeled as two phases or roles. The key contributions of this work are threefold: (1) a self-adaptive layer dimension mechanism that eliminates the need for manual tuning of layer architectures; (2) generates two parameter sets, leader and follower parameter sets, with 10 layer architecture configurations (5 for each set), ranked by Pareto dominance and cost post-optimization; and (3) achieved better results compared to gradient-based methods. Experimental results show that the proposed method achieves 99.04% training accuracy and 80% testing accuracy (cost = 0.06) on a three-layer network with architecture [12288, 17, 4, 1], higher performance a gradient-based approach that achieves 98% training accuracy and 80% testing accuracy (cost = 0.092) on a four-layer network with architecture [12288, 20, 7, 5, 1].
本文介绍了一种增强的遗传算法技术,用于优化神经网络以执行二值图像分类任务(例如猫与非猫分类)。所提出的方法仅使用两个个体进行交叉操作,分别由两组参数集表示:领导者和跟随者。领导者专注于开发,代表主要的最优解;而跟随者则通过保持多样性并避免过早收敛来促进探索。 该方法的关键贡献有三个方面: 1. 一种自适应层维度机制,消除了手动调整层架构的需求。 2. 生成两组参数集(领导者和跟随者的参数集),每组包含五个层级架构配置,并根据帕累托支配原则和成本进行优化后的排序。 3. 相比基于梯度的方法,取得了更好的结果。 实验结果显示,所提出的方法在三层网络结构 [12288, 17, 4, 1] 上实现了99.04%的训练准确率和80%的测试准确率(成本 = 0.06),而基于梯度的方法在一个四层架构 [12288, 20, 7, 5, 1] 的网络上仅达到98%的训练准确率和80%的测试准确率(成本 = 0.092),且性能低于所提出的方法。
https://arxiv.org/abs/2504.17346
The Radon cumulative distribution transform (R-CDT), is an easy-to-compute feature extractor that facilitates image classification tasks especially in the small data regime. It is closely related to the sliced Wasserstein distance and provably guaranties the linear separability of image classes that emerge from translations or scalings. In many real-world applications, like the recognition of watermarks in filigranology, however, the data is subject to general affine transformations originating from the measurement process. To overcome this issue, we recently introduced the so-called max-normalized R-CDT that only requires elementary operations and guaranties the separability under arbitrary affine transformations. The aim of this paper is to continue our study of the max-normalized R-CDT especially with respect to its robustness against non-affine image deformations. Our sensitivity analysis shows that its separability properties are stable provided the Wasserstein-infinity distance between the samples can be controlled. Since the Wasserstein-infinity distance only allows small local image deformations, we moreover introduce a mean-normalized version of the R-CDT. In this case, robustness relates to the Wasserstein-2 distance and also covers image deformations caused by impulsive noise for instance. Our theoretical results are supported by numerical experiments showing the effectiveness of our novel feature extractors as well as their robustness against local non-affine deformations and impulsive noise.
Radon累积分布变换(R-CDT)是一种易于计算的特征提取器,特别有助于在小数据集情况下的图像分类任务。它与切片Wasserstein距离密切相关,并且能够保证由于平移或缩放导致的图像类别间的线性可分性。然而,在许多实际应用中,例如文件细密图学中的水印识别,测量过程会导致一般仿射变换的数据变化。为解决这一问题,我们最近引入了所谓的最大归一化R-CDT(max-normalized R-CDT),它只需要基本操作,并能保证在任意仿射变换下的可分性。本文旨在继续研究最大归一化R-CDT的特性,特别是其对非仿射图像变形的鲁棒性。 敏感性分析表明,在样本之间的Wasserstein无穷距离可以被控制的情况下,它的可分性属性是稳定的。由于Wasserstein无穷距离仅允许小范围内的局部图像变形,我们进一步引入了R-CDT的均值归一化版本(mean-normalized version)。在这种情况下,鲁棒性与Wasserstein-2距离相关,并且包括由脉冲噪声引起的图像变形。我们的理论结果得到了数值实验的支持,这些实验展示了新特征提取器的有效性和对局部非仿射变形及脉冲噪声的稳健性。
https://arxiv.org/abs/2506.08761
Within the family of convolutional neural networks, InceptionNeXt has shown excellent competitiveness in image classification and a number of downstream tasks. Built on parallel one-dimensional strip convolutions, however, it suffers from limited ability of capturing spatial dependencies along different dimensions and fails to fully explore spatial modeling in local neighborhood. Besides, inherent locality constraints of convolution operations are detrimental to effective global context modeling. To overcome these limitations, we propose a novel backbone architecture termed InceptionMamba in this study. More specifically, the traditional one-dimensional strip convolutions are replaced by orthogonal band convolutions in our InceptionMamba to achieve cohesive spatial modeling. Furthermore, global contextual modeling can be achieved via a bottleneck Mamba module, facilitating enhanced cross-channel information fusion and enlarged receptive field. Extensive evaluations on classification and various downstream tasks demonstrate that the proposed InceptionMamba achieves state-of-the-art performance with superior parameter and computational efficiency. The source code will be available at this https URL.
在卷积神经网络家族中,InceptionNeXt 在图像分类和一系列下游任务中表现出色。然而,由于其基于并行的一维带状卷积构建,它在捕捉不同维度上的空间依赖性方面能力有限,并且未能充分利用局部邻域的空间建模。此外,卷积操作本身的局部性限制了有效全局上下文建模的能力。 为了克服这些局限性,我们在本研究中提出了一个新型的骨干架构,称之为InceptionMamba。具体来说,在我们的InceptionMamba中,传统的二维条状卷积被正交带状卷积所取代,以实现连贯的空间建模。此外,通过瓶颈Mamba模块可以实现全局上下文建模,从而促进跨通道信息融合的增强和感受野的扩大。 在分类任务及各类下游任务上的广泛评估表明,提出的InceptionMamba架构达到了最先进的性能,并且具有优越的参数效率和计算效率。源代码可在[此链接](https://this https URL)获取。
https://arxiv.org/abs/2506.08735
Accurate classification of second-trimester fetal ultrasound images remains challenging due to low image quality, high intra-class variability, and significant class imbalance. In this work, we introduce a simple yet powerful, biologically inspired deep learning ensemble framework that-unlike prior studies focused on only a handful of anatomical targets-simultaneously distinguishes 16 fetal structures. Drawing on the hierarchical, modular organization of biological vision systems, our model stacks two complementary branches (a "shallow" path for coarse, low-resolution cues and a "detailed" path for fine, high-resolution features), concatenating their outputs for final prediction. To our knowledge, no existing method has addressed such a large number of classes with a comparably lightweight architecture. We trained and evaluated on 5,298 routinely acquired clinical images (annotated by three experts and reconciled via Dawid-Skene), reflecting real-world noise and variability rather than a "cleaned" dataset. Despite this complexity, our ensemble (EfficientNet-B0 + EfficientNet-B6 with LDAM-Focal loss) identifies 90% of organs with accuracy > 0.75 and 75% of organs with accuracy > 0.85-performance competitive with more elaborate models applied to far fewer categories. These results demonstrate that biologically inspired modular stacking can yield robust, scalable fetal anatomy recognition in challenging clinical settings.
第二孕期胎儿超声图像的精确分类仍然具有挑战性,原因在于图像质量低、同一类别的内部变异性高以及类别不平衡严重。在这项研究中,我们提出了一种简单却强大的生物启发式深度学习集成框架,与以往仅关注少数解剖目标的研究不同,我们的模型能够同时区分16种胎儿结构。借鉴生物学视觉系统的层级和模块化组织原理,我们的模型由两条互补的路径组成(一条为“浅层”路径,用于低分辨率粗略线索;另一条为“详细”路径,用于高分辨率精细特征),并将这两条路径的输出进行拼接以得出最终预测结果。据我们所知,没有现有的方法能够使用同样轻量级的架构来处理如此多类别。 我们在5,298张常规获取的临床图像上进行了训练和评估(由三名专家标注并经Dawid-Skene算法协调),这些数据反映了现实世界中的噪声与变化情况,而非经过“清理”的数据集。尽管存在复杂性,我们的集成模型(EfficientNet-B0 + EfficientNet-B6 结合 LDAM-Focal 损失函数)能够识别出90%的器官准确性高于0.75,且对于75%的器官准确性高于0.85——这些性能与应用于较少类别的更复杂模型相当。这一结果表明,生物启发式的模块堆叠方法能够在临床环境中实现稳健而可扩展的胎儿解剖结构识别。
https://arxiv.org/abs/2506.08623
We introduce AdaAct, a novel optimization algorithm that adjusts learning rates according to activation variance. Our method enhances the stability of neuron outputs by incorporating neuron-wise adaptivity during the training process, which subsequently leads to better generalization -- a complementary approach to conventional activation regularization methods. Experimental results demonstrate AdaAct's competitive performance across standard image classification benchmarks. We evaluate AdaAct on CIFAR and ImageNet, comparing it with other state-of-the-art methods. Importantly, AdaAct effectively bridges the gap between the convergence speed of Adam and the strong generalization capabilities of SGD, all while maintaining competitive execution times. Code is available at this https URL.
我们介绍了AdaAct,这是一种新颖的优化算法,可根据激活变化率调整学习速率。我们的方法通过在训练过程中引入逐神经元自适应性来增强神经元输出的稳定性,从而提高了泛化能力——这种方法是对传统激活正则化方法的一种补充。实验结果表明,在标准图像分类基准测试中,AdaAct表现出竞争性的性能。我们在CIFAR和ImageNet数据集上评估了AdaAct,并将其与其它最先进的方法进行了比较。重要的是,AdaAct有效地弥合了Adam的收敛速度与SGD的强大泛化能力之间的差距,同时保持了竞争力的执行时间。代码可在[此处](此链接应为实际URL)获取。
https://arxiv.org/abs/2506.08353
Deep neural networks face several challenges in hyperspectral image classification, including high-dimensional data, sparse distribution of ground objects, and spectral redundancy, which often lead to classification overfitting and limited generalization capability. To more effectively extract and fuse spatial context with fine spectral information in hyperspectral image (HSI) classification, this paper proposes a novel network architecture called STNet. The core advantage of STNet stems from the dual innovative design of its Spatial-Spectral Transformer module: first, the fundamental explicit decoupling of spatial and spectral attention ensures targeted capture of key information in HSI; second, two functionally distinct gating mechanisms perform intelligent regulation at both the fusion level of attention flows (adaptive attention fusion gating) and the internal level of feature transformation (GFFN). This characteristic demonstrates superior feature extraction and fusion capabilities compared to traditional convolutional neural networks, while reducing overfitting risks in small-sample and high-noise scenarios. STNet enhances model representation capability without increasing network depth or width. The proposed method demonstrates superior performance on IN, UP, and KSC datasets, outperforming mainstream hyperspectral image classification approaches.
深度神经网络在高光谱图像分类中面临几个挑战,包括数据维度高、地面物体分布稀疏以及光谱冗余。这些问题往往导致过度拟合和有限的泛化能力。为了更有效地提取并融合高光谱图像(HSI)中的空间上下文信息与精细的光谱信息,本文提出了一种名为STNet的新颖网络架构。 STNet的核心优势在于其空间-光谱变换模块的双创新设计:首先,基础的空间和光谱注意力机制显式解耦确保了对HSI中关键信息的目标捕捉;其次,两种功能不同的门控机制在注意流融合水平(自适应注意力融合门控)和特征转换内部层面(GFFN)进行智能调节。这些特性相比传统的卷积神经网络展示了更优越的特征提取与融合能力,并且减少了小样本及高噪声场景中的过度拟合风险。STNet增强了模型表示能力,而无需增加网络深度或宽度。 所提出的方法在IN、UP和KSC数据集上表现出色,超越了主流的HSI分类方法。
https://arxiv.org/abs/2506.08324
End-to-end autonomous driving has emerged as a dominant paradigm, yet its highly entangled black-box models pose significant challenges in terms of interpretability and safety assurance. To improve model transparency and training flexibility, this paper proposes a hierarchical and decoupled post-training framework tailored for pretrained neural networks. By reconstructing intermediate feature maps from ground-truth labels, surrogate supervisory signals are introduced at transitional layers to enable independent training of specific components, thereby avoiding the complexity and coupling of conventional end-to-end backpropagation and providing interpretable insights into networks' internal mechanisms. To the best of our knowledge, this is the first method to formalize feature-level reverse computation as well-posed optimization problems, which we rigorously reformulate as systems of linear equations or least squares problems. This establishes a novel and efficient training paradigm that extends gradient backpropagation to feature backpropagation. Extensive experiments on multiple standard image classification benchmarks demonstrate that the proposed method achieves superior generalization performance and computational efficiency compared to traditional training approaches, validating its effectiveness and potential.
端到端的自动驾驶技术已经成为主导范式,但其高度纠缠的黑箱模型在解释性和安全性保证方面提出了重大挑战。为了提高模型透明度和训练灵活性,本文提出了一种针对预训练神经网络设计的分层解耦后训练框架。通过从真实标签中重构中间特征图,该方法在转换层引入替代监督信号,从而能够独立地对特定组件进行训练,避免了传统端到端反向传播的复杂性和耦合,并为理解模型内部机制提供了可解释性洞察。 据我们所知,这是首次将特征级别的逆向计算正式化为良好的优化问题的方法,我们将这些问题严格重述为线性方程组或最小二乘问题。这建立了一种新颖且高效的训练范式,将梯度反向传播扩展到了特征反向传播。在多个标准图像分类基准上的广泛实验表明,所提出的方法与传统训练方法相比,在泛化性能和计算效率方面均表现出优越性,验证了其有效性和潜力。
https://arxiv.org/abs/2506.07188
Traditional decision-based black-box adversarial attacks on image classifiers aim to generate adversarial examples by slightly modifying input images while keeping the number of queries low, where each query involves sending an input to the model and observing its output. Most existing methods assume that all queries have equal cost. However, in practice, queries may incur asymmetric costs; for example, in content moderation systems, certain output classes may trigger additional review, enforcement, or penalties, making them more costly than others. While prior work has considered such asymmetric cost settings, effective algorithms for this scenario remain underdeveloped. In this paper, we propose a general framework for decision-based attacks under asymmetric query costs, which we refer to as asymmetric black-box attacks. We modify two core components of existing attacks: the search strategy and the gradient estimation process. Specifically, we propose Asymmetric Search (AS), a more conservative variant of binary search that reduces reliance on high-cost queries, and Asymmetric Gradient Estimation (AGREST), which shifts the sampling distribution to favor low-cost queries. We design efficient algorithms that minimize total attack cost by balancing different query types, in contrast to earlier methods such as stealthy attacks that focus only on limiting expensive (high-cost) queries. Our method can be integrated into a range of existing black-box attacks with minimal changes. We perform both theoretical analysis and empirical evaluation on standard image classification benchmarks. Across various cost regimes, our method consistently achieves lower total query cost and smaller perturbations than existing approaches, with improvements of up to 40% in some settings.
传统的基于决策的黑盒对抗攻击方法旨在通过轻微修改输入图像来生成对抗样本,同时保持查询次数较少,每次查询涉及向模型发送一个输入并观察其输出。现有大多数方法假设所有查询的成本相同。然而,在实际应用中,如内容管理系统中的某些输出类别可能需要额外审查、执行或处罚,导致这些类别的查询成本高于其他类别。尽管以前的研究已经考虑过这种非对称的成本设定,但针对这种情况的有效算法尚未充分发展。 本文提出了一种在非对称查询成本下的基于决策的攻击通用框架,我们将其称为“非对称黑盒攻击”。我们修改了现有攻击方法中的两个核心组件:搜索策略和梯度估计过程。具体来说,我们提出了不对称搜索(AS),这是一种更保守的二分查找变体,减少了对高成本查询的依赖;以及不对称梯度估计(AGREST),它通过改变采样分布来优先考虑低成本查询。 我们的方法设计了高效算法,在不同类型的查询之间实现平衡以最小化总攻击成本。这与以前的方法有所不同,例如之前的重点只在于限制昂贵(高成本)查询的隐秘性攻击。本文提出的方法可以轻松集成到各种现有的黑盒攻击中而无需进行大的改动。 我们在标准图像分类基准上进行了理论分析和实证评估,在不同的成本设定下,我们的方法始终表现出更低的整体查询成本和更小的扰动,相较于现有方法改进幅度最高可达40%。
https://arxiv.org/abs/2506.06933
Sparsifying neural networks often suffers from seemingly inevitable performance degradation, and it remains challenging to restore the original performance despite much recent progress. Motivated by recent studies in robust optimization, we aim to tackle this problem by finding subnetworks that are both sparse and flat at the same time. Specifically, we formulate pruning as a sparsity-constrained optimization problem where flatness is encouraged as an objective. We solve it explicitly via an augmented Lagrange dual approach and extend it further by proposing a generalized projection operation, resulting in novel pruning methods called SAFE and its extension, SAFE$^+$. Extensive evaluations on standard image classification and language modeling tasks reveal that SAFE consistently yields sparse networks with improved generalization performance, which compares competitively to well-established baselines. In addition, SAFE demonstrates resilience to noisy data, making it well-suited for real-world conditions.
稀疏化神经网络通常会遭受看似不可避免的性能下降,尽管最近有很多进展,但要恢复原始性能仍然具有挑战性。受近期鲁棒优化研究的启发,我们旨在通过寻找同时具备稀疏性和平坦性的子网络来解决这个问题。具体来说,我们将剪枝定义为一个带有稀疏约束的优化问题,并鼓励平坦性作为目标函数的一部分。我们利用增强拉格朗日对偶方法明确地解决了该问题,并进一步提出了广义投影操作,从而得出了称为SAFE及其扩展版本SAFE$^+$的新剪枝方法。在标准图像分类和语言建模任务上的广泛评估表明,SAFE始终能生成稀疏网络并提高泛化性能,与已确立的基线相比具有竞争力。此外,SAFE展示了对噪声数据的强大鲁棒性,使其适合于现实世界条件下的应用。
https://arxiv.org/abs/2506.06866
Recent advancements in quantum machine learning have shown promise in enhancing classical neural network architectures, particularly in domains involving complex, high-dimensional data. Building upon prior work in temporal sequence modeling, this paper introduces Vision-QRWKV, a hybrid quantum-classical extension of the Receptance Weighted Key Value (RWKV) architecture, applied for the first time to image classification tasks. By integrating a variational quantum circuit (VQC) into the channel mixing component of RWKV, our model aims to improve nonlinear feature transformation and enhance the expressive capacity of visual representations. We evaluate both classical and quantum RWKV models on a diverse collection of 14 medical and standard image classification benchmarks, including MedMNIST datasets, MNIST, and FashionMNIST. Our results demonstrate that the quantum-enhanced model outperforms its classical counterpart on a majority of datasets, particularly those with subtle or noisy class distinctions (e.g., ChestMNIST, RetinaMNIST, BloodMNIST). This study represents the first systematic application of quantum-enhanced RWKV in the visual domain, offering insights into the architectural trade-offs and future potential of quantum models for lightweight and efficient vision tasks.
近期在量子机器学习领域的进展显示出其能够增强经典神经网络架构,特别是在处理复杂高维数据的领域中。本文基于之前的时间序列建模工作,提出了Vision-QRWKV模型,这是Receptance Weighted Key Value (RWKV) 架构的一种混合量子-经典扩展形式,并首次应用于图像分类任务。通过将变分量子电路(VQC)集成到RWKV架构中的通道混频组件中,我们的模型旨在改进非线性特征变换并增强视觉表示的表达能力。 我们在包括MedMNIST数据集、MNIST和FashionMNIST在内的14个不同的医学和标准图像分类基准上评估了经典和量子RWKV模型。研究结果表明,在大多数数据集中,特别是那些类别区分细微或有噪音的数据集中(例如ChestMNIST、RetinaMNIST和BloodMNIST),增强后的量子模型的表现优于其经典的对应版本。 这项研究代表了在视觉领域首次系统性地应用增强型量子RWKV架构的研究工作。它提供了关于该领域架构权衡以及未来轻量级且高效视觉任务中量子模型潜在价值的见解。
https://arxiv.org/abs/2506.06633
Diagnosing deep neural networks (DNNs) through the eigenspectrum of weight matrices has been an active area of research in recent years. At a high level, eigenspectrum analysis of DNNs involves measuring the heavytailness of the empirical spectral densities (ESD) of weight matrices. It provides insight into how well a model is trained and can guide decisions on assigning better layer-wise training hyperparameters. In this paper, we address a challenge associated with such eigenspectrum methods: the impact of the aspect ratio of weight matrices on estimated heavytailness metrics. We demonstrate that matrices of varying sizes (and aspect ratios) introduce a non-negligible bias in estimating heavytailness metrics, leading to inaccurate model diagnosis and layer-wise hyperparameter assignment. To overcome this challenge, we propose FARMS (Fixed-Aspect-Ratio Matrix Subsampling), a method that normalizes the weight matrices by subsampling submatrices with a fixed aspect ratio. Instead of measuring the heavytailness of the original ESD, we measure the average ESD of these subsampled submatrices. We show that measuring the heavytailness of these submatrices with the fixed aspect ratio can effectively mitigate the aspect ratio bias. We validate our approach across various optimization techniques and application domains that involve eigenspectrum analysis of weights, including image classification in computer vision (CV) models, scientific machine learning (SciML) model training, and large language model (LLM) pruning. Our results show that despite its simplicity, FARMS uniformly improves the accuracy of eigenspectrum analysis while enabling more effective layer-wise hyperparameter assignment in these application domains. In one of the LLM pruning experiments, FARMS reduces the perplexity of the LLaMA-7B model by 17.3% when compared with the state-of-the-art method.
近年来,通过权重矩阵的特征谱分析来诊断深度神经网络(DNN)已成为研究的一个活跃领域。从高层次上看,DNN 的特征谱分析涉及测量权重矩阵经验谱密度 (ESD) 的尾部厚度。这为模型训练的质量提供了见解,并能够指导分层训练超参数的选择。在这篇论文中,我们解决了一个与该特征谱方法相关的问题:权重矩阵的纵横比对估算出的尾部厚度度量的影响。我们展示了不同大小(和纵横比)的矩阵会对估算出的尾部厚度度量引入不可忽视的偏差,导致模型诊断不准确以及分层超参数设定不当。 为了解决这个问题,我们提出了一种名为 FARMS (固定纵横比矩阵子采样)的方法。该方法通过从权重矩阵中随机抽取具有固定纵横比的小块来规范化这些矩阵,并测量这些小块的平均经验谱密度(而不是原始 ESD 的尾部厚度)。我们展示了对具有固定纵横比的小块进行尾部厚度度量可以有效缓解纵横比偏差。 我们在涉及重量特征谱分析的各种优化技术及应用领域验证了这种方法的有效性,包括计算机视觉 (CV) 模型中的图像分类、科学机器学习(SciML)模型训练以及大型语言模型 (LLM) 剪枝。结果显示,尽管其原理简单,但 FARMS 在这些应用场景中一致地提高了特征谱分析的准确性,并且允许更有效的分层超参数分配。 在一项 LLM 剪枝实验中,FARMS 相较于最新方法使 LLaMA-7B 模型的困惑度降低了 17.3%。
https://arxiv.org/abs/2506.06280
ResNet has been widely used in image classification tasks due to its ability to model the residual dependence of constant mappings for linear computation. However, the ResNet method adopts a unidirectional transfer of features and lacks an effective method to correlate contextual information, which is not effective in classifying fetal ultrasound images in the classification task, and fetal ultrasound images have problems such as low contrast, high similarity, and high noise. Therefore, we propose a bilateral multi-scale information fusion network-based FPDANet to address the above challenges. Specifically, we design the positional attention mechanism (DAN) module, which utilizes the similarity of features to establish the dependency of different spatial positional features and enhance the feature representation. In addition, we design a bilateral multi-scale (FPAN) information fusion module to capture contextual and global feature dependencies at different feature scales, thereby further improving the model representation. FPDANet classification results obtained 91.05\% and 100\% in Top-1 and Top-5 metrics, respectively, and the experimental results proved the effectiveness and robustness of FPDANet.
ResNet由于其能够对恒定映射的残差依赖进行建模,从而实现线性计算的能力,在图像分类任务中得到了广泛应用。然而,ResNet方法采用的是单向特征传输,并且缺乏有效的方法来关联上下文信息,这在胎儿超声图像的分类任务中效果不佳,因为这些图像存在对比度低、相似度高和噪声大的问题。因此,我们提出了基于双边多尺度信息融合网络的FPDANet来解决上述挑战。具体而言,我们设计了位置注意力机制(DAN)模块,利用特征之间的相似性建立不同空间位置特征间的依赖关系,并增强特征表示能力。此外,我们还设计了一种双边多尺度(FPAN)的信息融合模块,用于捕捉上下文和全局特性在不同特征尺度上的依赖关系,从而进一步提高模型的表现力。实验结果显示,FPDANet在Top-1和Top-5指标上分别获得了91.05%和100%的分类准确率,证明了该网络的有效性和鲁棒性。
https://arxiv.org/abs/2506.06054
Orthopoxvirus infections must be accurately classified from medical pictures for an easy and early diagnosis and epidemic prevention. The necessity for automated and scalable solutions is highlighted by the fact that traditional diagnostic techniques can be time-consuming and require expert interpretation and there are few and biased data sets of the different types of Orthopox. In order to improve classification performance and lower computational costs, a hybrid strategy is put forth in this paper that uses Machine Learning models combined with pretrained Deep Learning models to extract deep feature representations without the need for augmented data. The findings show that this feature extraction method, when paired with other methods in the state-of-the-art, produces excellent classification outcomes while preserving training and inference efficiency. The proposed approach demonstrates strong generalization and robustness across multiple evaluation settings, offering a scalable and interpretable solution for real-world clinical deployment.
正痘病毒(Orthopoxvirus)感染需要从医学图片中准确分类,以便实现早期和简便的诊断及疫情预防。由于传统诊断技术耗时且依赖专家解读,并且不同类型的正痘病毒数据集较少且有偏见,因此强调了自动化的、可扩展解决方案的需求。为了提高分类性能并降低计算成本,本文提出了一种混合策略,该策略结合使用机器学习模型与预训练的深度学习模型来提取深层特征表示,而无需增加增强数据(augmented data)。研究结果表明,当此特征提取方法与其他前沿方法相结合时,在保持高效训练和推理效率的同时,能够产生优异的分类效果。所提出的这种方法在多种评估设置下展示了强大的泛化能力和鲁棒性,并为现实世界中的临床应用提供了可扩展且易于解释的解决方案。
https://arxiv.org/abs/2506.06007
Spiking Neural Networks (SNNs) are noted for their brain-like computation and energy efficiency, but their performance lags behind Artificial Neural Networks (ANNs) in tasks like image classification and object detection due to the limited representational capacity. To address this, we propose a novel spiking neuron, Integer Binary-Range Alignment Leaky Integrate-and-Fire to exponentially expand the information expression capacity of spiking neurons with only a slight energy increase. This is achieved through Integer Binary Leaky Integrate-and-Fire and range alignment strategy. The Integer Binary Leaky Integrate-and-Fire allows integer value activation during training and maintains spike-driven dynamics with binary conversion expands virtual timesteps during inference. The range alignment strategy is designed to solve the spike activation limitation problem where neurons fail to activate high integer values. Experiments show our method outperforms previous SNNs, achieving 74.19% accuracy on ImageNet and 66.2% mAP@50 and 49.1% mAP@50:95 on COCO, surpassing previous bests with the same architecture by +3.45% and +1.6% and +1.8%, respectively. Notably, our SNNs match or exceed ANNs' performance with the same architecture, and the energy efficiency is improved by 6.3${\times}$.
脉冲神经网络(SNN)因其类似大脑的计算方式和高能效而受到关注,但在图像分类、目标检测等任务中,其性能仍落后于人工神经网络(ANN),主要原因是表示能力有限。为解决这一问题,我们提出了一种新颖的脉冲神经元——整数二进制范围对齐漏级积分放电(IBLIF)神经元。通过这种设计,可以在几乎不增加能耗的情况下极大地扩展脉冲神经元的信息表达能力。 我们的方法结合了整数二进制漏级积分放电机制和范围对齐策略。在训练过程中,整数二进制漏级积分放电允许使用整数值进行激活,并且保持脉冲驱动的动力学特性;通过二值转换,在推理阶段扩展虚拟时间步长。此外,范围对齐策略专门用于解决神经元无法激活高整数值的问题。 实验结果表明,我们的方法在性能上优于以前的SNN模型:在ImageNet数据集上的准确率达到74.19%,在COCO数据集上的平均精度(mAP@50和mAP@50:95)分别为66.2%和49.1%,分别超过了采用相同架构的最佳先前记录3.45%、1.6%和1.8%。值得注意的是,我们的SNN模型在与ANN使用相同的架构时能够达到或超过其性能,并且能效提高了6.3倍。
https://arxiv.org/abs/2506.05679
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal this http URL address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL this http URL, cross-module concept guidance enforces tight alignment between the backbone's feature representations and the prototypical concept activation this http URL addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data this http URL, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision this http URL results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot this http URL findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.
自解释模型(SEMs)依赖于原型概念学习(PCL),以使其视觉识别过程更具可解释性,但在数据稀缺的情况下,由于训练样本不足而导致次优性能的问题常常困扰着它们。为了解决这一限制,我们提出了一种基于少量样本的原型概念分类框架(FSPCC),该框架系统地缓解了低数据环境下的两大挑战:参数不平衡和表示偏差。具体而言,我们的方法采用混合LoRA专家(MoLE)来实现参数高效适应,并确保骨干网络与PCL之间的可训练参数分配平衡。此外,跨模块的概念引导强制执行骨干网络的特征表示与原型概念激活之间的一致性。为了进一步丰富所学习的表示并缓解数据不足带来的挑战,我们还整合了一种多级特征保持策略,该策略融合了空间和语义线索,并将其应用于各个层级中。 为了增强模型解释性和最小化概念重叠,我们引入了一种几何感知的概念辨别损失函数,通过鼓励概念间的正交性来促进更独立且透明的决策过程。实验结果在六个流行基准数据集(CUB-200-2011、mini-ImageNet、CIFAR-FS、Stanford Cars、FGVC-Aircraft 和 DTD)上显示出我们的方法显著优于现有的SEMs,特别是在五路五样本设置中获得了4.2%-8.7%的相对性能提升。这些发现强调了将概念学习与少量样本适应相结合以实现更高准确性和更清晰模型解释性的有效性,并为开发更加透明的视觉识别系统铺平了道路。
https://arxiv.org/abs/2506.04673
Medical image classification is crucial for diagnosis and treatment, benefiting significantly from advancements in artificial intelligence. The paper reviews recent progress in the field, focusing on three levels of solutions: basic, specific, and applied. It highlights advances in traditional methods using deep learning models like Convolutional Neural Networks and Vision Transformers, as well as state-of-the-art approaches with Vision Language Models. These models tackle the issue of limited labeled data, and enhance and explain predictive results through Explainable Artificial Intelligence.
医学图像分类对于诊断和治疗至关重要,得益于人工智能领域的进步而受益良多。该论文回顾了最近在这一领域取得的进展,并重点关注三种层次的解决方案:基础、特定和应用层面。它强调了传统方法中使用深度学习模型(如卷积神经网络和视觉变换器)的进步,以及利用视觉语言模型等最新技术的方法。这些模型解决了标注数据有限的问题,并通过可解释的人工智能增强了并解释了预测结果。
https://arxiv.org/abs/2506.04129
In few-shot learning (FSL), the labeled samples are scarce. Thus, label errors can significantly reduce classification accuracy. Since label errors are inevitable in realistic learning tasks, improving the robustness of the model in the presence of label errors is critical. This paper proposes a new robust neural field-based image approach (RoNFA) for few-shot image classification with noisy labels. RoNFA consists of two neural fields for feature and category representation. They correspond to the feature space and category set. Each neuron in the field for category representation (FCR) has a receptive field (RF) on the field for feature representation (FFR) centered at the representative neuron for its category generated by soft clustering. In the prediction stage, the range of these receptive fields adapts according to the neuronal activation in FCR to ensure prediction accuracy. These learning strategies provide the proposed model with excellent few-shot learning capability and strong robustness against label noises. The experimental results on real-world FSL datasets with three different types of label noise demonstrate that the proposed method significantly outperforms state-of-the-art FSL methods. Its accuracy obtained in the presence of noisy labels even surpasses the results obtained by state-of-the-art FSL methods trained on clean support sets, indicating its strong robustness against noisy labels.
在少量样本学习(FSL)中,标注样本稀缺。因此,标签错误会显著降低分类准确度。由于实际学习任务中的标签错误不可避免,提高模型在存在标签错误情况下的鲁棒性至关重要。本文提出了一种新的基于神经场的稳健图像方法(RoNFA),用于处理带有噪声标签的少量样本图像分类问题。RoNFA包含两个神经场,分别用于特征和类别的表示。这两个神经场对应于特征空间和类别集合。 在用于类别表示的神经场(FCR)中,每个神经元在其用于特征表示的神经场(FFR)上有感受野(RF),这些感受野以通过软聚类生成的该类别的代表性神经元为中心。在预测阶段,这些感受野的范围会根据FCR中的神经元激活情况而调整,确保预测准确性。 这些学习策略为所提出的模型提供了出色的少量样本学习能力和对标签噪声的强大鲁棒性。在具有三种不同类型标签噪声的真实世界FSL数据集上的实验结果表明,该方法显著优于最先进的FSL方法。即使在存在噪声标签的情况下获得的准确率也超过了用干净支持集训练的最先进的FSL方法的结果,这显示了其强大的抗噪标签能力。
https://arxiv.org/abs/2506.03461
Previous studies have compared the brain and deep neural networks trained on image classification. Intriguingly, while some suggest that their representations are highly similar, others argued the opposite. Here, we propose a new approach to characterize the similarity of the decision strategies of two observers (models or brains) using decision variable correlation (DVC). DVC quantifies the correlation between decoded decisions on individual samples in a classification task and thus can capture task-relevant information rather than general representational alignment. We evaluate this method using monkey V4/IT recordings and models trained on image classification tasks. We find that model--model similarity is comparable to monkey--monkey similarity, whereas model--monkey similarity is consistently lower and, surprisingly, decreases with increasing ImageNet-1k performance. While adversarial training enhances robustness, it does not improve model--monkey similarity in task-relevant dimensions; however, it markedly increases model--model similarity. Similarly, pre-training on larger datasets does not improve model--monkey similarity. These results suggest a fundamental divergence between the task-relevant representations in monkey V4/IT and those learned by models trained on image classification tasks.
先前的研究已经比较了在图像分类任务上训练的脑和深度神经网络。有趣的是,虽然有些人认为它们的表现形式非常相似,但也有人持相反意见。在这里,我们提出了一种新的方法来表征两个观察者(模型或大脑)决策策略之间的相似性,使用的是决策变量相关性(DVC)。DVC 量化了在分类任务中个体样本上解码的决策之间的相关性,因此可以捕捉到与任务相关的特定信息而非一般的表示对齐。我们通过猴子 V4/IT 记录和在图像分类任务上训练的模型来评估这种方法。 我们的研究发现,模型与模型之间的相似度可比拟于猴子与猴子之间的相似度,而模型与猴子之间的相似度始终较低,并且令人惊讶的是,随着 ImageNet-1k 性能的提升,这种相似性反而下降。尽管对抗训练可以提高模型的鲁棒性,但它并没有改善模型和猴子在任务相关维度上的相似性;然而,它显著提高了模型之间的一致性。同样地,更大的数据集预训练也没有改进模型与猴子之间的相似度。 这些结果表明,在执行图像分类任务时,经过训练的模型中任务相关的表示形式与猴子 V4/IT 中的表现存在根本性的差异。
https://arxiv.org/abs/2506.02164