Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure. We apply the use of interpretability indicators -- originally proposed with SVDA -- to monitor attention dynamics during training and assess structural properties of the learned representations. Experimental evaluations on four widely used benchmarks -- CIFAR-10, FashionMNIST, CIFAR-100, and ImageNet-100 -- demonstrate that SVDA consistently yields more interpretable attention patterns without sacrificing classification accuracy. While the current framework offers descriptive insights rather than prescriptive guidance, our results establish SVDA as a comprehensive and informative tool for analyzing and developing structured attention models in computer vision. This work lays the foundation for future advances in explainable AI, spectral diagnostics, and attention-based model compression.
视觉变压器(ViT)在图像分类任务中取得了最先进的性能,但它们的注意力机制往往仍然显得不透明,并且表现出密集而非结构化的行为。在这项工作中,我们将之前提出的基于奇异值分解的注意机制(SVDA)应用到了ViT架构上,引入了一种以几何为基础的形式化方法来增强可解释性、稀疏性和谱结构。我们使用了与SVDA一起最初提出的一系列可解释指标,在训练过程中监控注意力动态,并评估学习表示的结构性质。在四个广泛使用的基准数据集(CIFAR-10、FashionMNIST、CIFAR-100和ImageNet-100)上的实验评估表明,SVDA能够持续提供更可解释的注意模式,而不会牺牲分类准确性。 尽管当前框架主要提供了描述性见解而非规范性的指导原则,但我们的结果将SVDA确立为分析和发展结构化注意力模型在计算机视觉领域中的全面且信息丰富的工具。这项工作奠定了未来可解释人工智能、谱诊断以及基于注意力模型压缩方面进展的基础。
https://arxiv.org/abs/2602.10994
Contrastive learning has demonstrated great success in representation learning, especially for image classification tasks. However, there is still a shortage in studies targeting regression tasks, and more specifically applications on hyperspectral data. In this paper, we propose a spectral-spatial contrastive learning framework for regression tasks for hyperspectral data, in a model-agnostic design allowing to enhance backbones such as 3D convolutional and transformer-based networks. Moreover, we provide a collection of transformations relevant for augmenting hyperspectral data. Experiments on synthetic and real datasets show that the proposed framework and transformations significantly improve the performance of all studied backbone models.
对比学习在表示学习方面取得了巨大成功,特别是在图像分类任务上。然而,在回归任务的研究中仍然存在不足,尤其是针对高光谱数据的应用。本文提出了一种适用于高光谱数据回归任务的光谱-空间对比学习框架,并采用一种模型无关的设计方式,可以增强如3D卷积和基于变换器的网络等骨干网络。此外,我们还提供了一系列用于增强高光谱数据的相关转换方法。在合成数据集和真实数据集上的实验表明,所提出的框架和转换显著提升了所有研究骨干模型的表现性能。
https://arxiv.org/abs/2602.10745
Deploying vision foundation models typically relies on efficient adaptation strategies, whereas conventional full fine-tuning suffers from prohibitive costs and low efficiency. While delta-tuning has proven effective in boosting the performance and efficiency of LLMs during adaptation, its advantages cannot be directly transferred to the fine-tuning pipeline of vision foundation models. To push the boundaries of adaptation efficiency for vision tasks, we propose an adapter with Complex Linear Projection Optimization (CoLin). For architecture, we design a novel low-rank complex adapter that introduces only about 1% parameters to the backbone. For efficiency, we theoretically prove that low-rank composite matrices suffer from severe convergence issues during training, and address this challenge with a tailored loss. Extensive experiments on object detection, segmentation, image classification, and rotated object detection (remote sensing scenario) demonstrate that CoLin outperforms both full fine-tuning and classical delta-tuning approaches with merely 1% parameters for the first time, providing a novel and efficient solution for deployment of vision foundation models. We release the code on this https URL.
部署视觉基础模型通常依赖于高效的适应策略,而传统的全量微调则面临着成本高昂和效率低下的问题。尽管增量微调(delta-tuning)方法在增强大规模语言模型(LLM)适应过程中的性能和效率方面已经证明非常有效,但这种方法的优势却无法直接应用于视觉基础模型的全量微调管道中。为了推动视觉任务适应效率的边界,我们提出了一种基于复数线性投影优化(Complex Linear Projection Optimization, CoLin)的适配器。从架构设计上讲,我们设计了一种新颖的低秩复数适配器,仅向骨干模型引入大约1%的参数。在效率方面,理论上证明了低秩复合矩阵在训练过程中会遇到严重的收敛问题,并通过定制损失函数解决了这一挑战。 广泛的实验(包括目标检测、分割、图像分类以及旋转物体检测(遥感场景))表明,CoLin首次以仅增加1%参数量的方式超越了全量微调和经典增量微调方法的表现,为视觉基础模型的部署提供了一种新颖且高效的解决方案。我们已经将代码发布在[这个链接](https://example.com)上。
https://arxiv.org/abs/2602.10513
Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.
出界检测(OOD,Out-of-distribution)对于机器学习系统的安全部署至关重要。现有的事后检测器通常依赖于模型置信度得分或特征空间中的似然估计,这往往需要严格的分布假设。在本工作中,我们引入了一种新的范式,并从多样性视角出发来定义OOD检测问题。我们提出了Vendi Novelty Score(VNS),这是一种基于Vendi Scores(VS)的OOD检测器,而VS是一组相似性为基础的多样性度量指标。VNS量化了测试样本如何增加在分布特征集中的VS值,提供了一种无需密度建模即可确定新颖性的原则方法。VNS具有线性时间复杂度、非参数性质,并自然地结合了类条件(局部)和数据集级别(全局)的新颖性信号。在多个图像分类基准测试及网络架构上,VNS实现了最先进的OOD检测性能。值得注意的是,在仅使用1%的训练数据进行计算的情况下,VNS仍能保持这种性能水平,从而可以在内存受限或访问受限的环境中部署。
https://arxiv.org/abs/2602.10062
Most self-supervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.
大多数自监督学习(SSL)方法通过将同一输入的不同视图对齐来学习连续的视觉表示,从而在如何跨表示维度结构化信息方面提供了有限的控制。在这项工作中,我们将视觉自监督学习框架化为教师网络和学生网络之间的离散通信过程,在一个固定容量的二进制信道中传输语义信息。与对齐连续特征不同,学生预测由教师生成的多标签二元消息。通过逐元素的二元交叉熵目标强制执行离散一致,并且编码率正则化项鼓励有效利用受限制的通道,从而促进结构化的表示形式。我们进一步表明,定期重新初始化投影头会加强这种效果,从而鼓励在多个离散编码中保持预测性的嵌入。 广泛的实验结果证明,在图像分类、检索和密集视觉预测任务上,以及通过自监督适应处理域变化时,与连续一致的基准相比,该方法实现了持续改进。除了骨干表示之外,我们还分析了所学习的二进制代码,并展示了它们形成了一种紧凑且具有信息量的离散语言,能够捕捉跨类可重用的语义因素。
https://arxiv.org/abs/2602.09764
Contrastive learning (CL) is a predominant technique in image classification, but they showed limited performance with an imbalanced dataset. Recently, several supervised CL methods have been proposed to promote an ideal regular simplex geometric configuration in the representation space-characterized by intra-class feature collapse and uniform inter-class mean spacing, especially for imbalanced datasets. In particular, existing prototype-based methods include class prototypes, as additional samples to consider all classes. However, the existing CL methods suffer from two limitations. First, they do not consider the alignment between the class means/prototypes and classifiers, which could lead to poor generalization. Second, existing prototype-based methods treat prototypes as only one additional sample per class, making their influence depend on the number of class instances in a batch and causing unbalanced contributions across classes. To address these limitations, we propose Equilibrium Contrastive Learning (ECL), a supervised CL framework designed to promote geometric equilibrium, where class features, means, and classifiers are harmoniously balanced under data imbalance. The proposed ECL framework uses two main components. First, ECL promotes the representation geometric equilibrium (i.e., a regular simplex geometry characterized by collapsed class samples and uniformly distributed class means), while balancing the contributions of class-average features and class prototypes. Second, ECL establishes a classifier-class center geometric equilibrium by aligning classifier weights and class prototypes. We ran experiments with three long-tailed datasets, the CIFAR-10(0)-LT, ImageNet-LT, and the two imbalanced medical datasets, the ISIC 2019 and our constructed LCCT dataset. Results show that ECL outperforms existing SOTA supervised CL methods designed for imbalanced classification.
对比学习(CL)是图像分类中的主要技术之一,但在处理不平衡数据集时表现有限。最近,提出了一些监督的对比学习方法,旨在促进表示空间中理想规则单纯形几何结构的形成——通过类内特征坍缩和类间平均距离均匀化,特别是在不平衡的数据集中尤为明显。现有的基于原型的方法包括了类原型作为额外样本以考虑所有类别。然而,现有的CL方法存在两个主要限制:首先,它们没有考虑到类均值/原型与分类器之间的对齐问题,这可能导致泛化性能不佳;其次,现有基于原型的方法将每个类的原型视为单一附加样本,其影响依赖于批次中该类实例的数量,导致不同类别间的贡献不均衡。为了解决这些问题,我们提出了平衡对比学习(ECL),这是一种旨在促进在数据不平衡情况下表示几何平衡的监督CL框架,即类特征、均值和分类器之间达到和谐平衡。 提出的ECL框架包括两个主要组成部分:首先,ECL促进了表示的几何平衡(即由坍缩的类样本和均匀分布的类均值所定义的标准单纯形几何结构),同时平衡了类别平均特征与类别原型的贡献。其次,ECL通过对齐分类器权重和类原型来建立分类器-中心几何平衡。 我们在三个长尾数据集上进行了实验:CIFAR-10(0)-LT、ImageNet-LT以及两个不平衡的医学数据集ISIC 2019和我们构建的LCCT数据集。实验结果表明,ECL在不均衡分类任务中超过了现有的最先进的监督对比学习方法。
https://arxiv.org/abs/2602.09506
Domain adaptation (DA) is a quickly expanding area in machine learning that involves adjusting a model trained in one domain to perform well in another domain. While there have been notable progressions, the fundamental concept of numerous DA methodologies has persisted: aligning the data from various domains into a shared feature space. In this space, knowledge acquired from labeled source data can improve the model training on target data that lacks sufficient labels. In this study, we demonstrate the use of 10 deep learning models to simulate common DA techniques and explore their application in four medical image datasets. We have considered various situations such as multi-modality, noisy data, federated learning (FL), interpretability analysis, and classifier calibration. The experimental results indicate that using DA with ResNet34 in a brain tumor (BT) data set results in an enhancement of 4.7\% in model performance. Similarly, the use of DA can reduce the impact of Gaussian noise, as it provides $\sim 3\%$ accuracy increase using ResNet34 on a BT dataset. Furthermore, simply introducing DA into FL framework shows limited potential (e.g., $\sim 0.3\%$ increase in performance) for skin cancer classification. In addition, the DA method can improve the interpretability of the models using the gradcam++ technique, which offers clinical values. Calibration analysis also demonstrates that using DA provides a lower expected calibration error (ECE) value $\sim 2\%$ compared to CNN alone on a multi-modality dataset.
领域适应(DA)是机器学习中一个迅速扩展的领域,它涉及到调整在一个特定环境中训练好的模型以使其在另一个不同的环境中也能有良好的表现。尽管在此领域已经取得了显著的进步,但许多领域的适应方法的核心概念依然保持不变:将不同领域中的数据映射到共享特征空间中,在这个空间内,从标记源数据中学到的知识可以改善目标数据(缺乏足够标签)上的模型训练效果。 在这项研究中,我们展示了如何使用10种深度学习模型来模拟常见的领域适应技术,并探讨了这些方法在四个医学影像数据集中的应用。我们考虑到了各种情况,包括多模态、噪声数据、联邦学习(FL)、可解释性分析和分类器校准。 实验结果表明,在脑肿瘤(BT)数据集中使用领域适应与ResNet34模型相结合能够使模型性能提高约4.7%。同样地,应用领域适应技术可以减轻高斯噪声的影响,例如在脑肿瘤数据集上使用ResNet34模型时准确率提高了大约3%。此外,在联邦学习框架中简单引入领域适应方法显示出了有限的潜力,比如皮肤癌分类中的性能提高约为0.3%。 研究还表明,通过gradcam++技术应用领域适应能够提高模型的可解释性,这对临床工作具有重要价值。校准分析同样证明了使用领域适应相比于仅用CNN的情况在多模态数据集中可以降低预期校准误差(ECE)约2%。
https://arxiv.org/abs/2602.09355
Accurate classification of breast cancer histopathology images is pivotal for early oncological diagnosis and therapeutic this http URL, conventional deep learning architectures often encounter performance degradation under limited annotations and suffer from a "blackbox" nature, hindering their clinical integration. To mitigate these limitations, we propose GAFRNet, a robust and interpretable Graph Attention and FuzzyRule Network specifically engineered for histopathology image classification with scarce supervision. GAFRNet constructs a similarity-driven graph representation to model intersample relationships and employs a multihead graph attention mechanism to capture complex relational features across heterogeneous tissue this http URL, a differentiable fuzzy-rule module encodes intrinsic topological descriptorsincluding node degree, clustering coefficient, and label consistencyinto explicit, human-understandable diagnostic logic. This design establishes transparent "IF-THEN" mappings that mimic the heuristic deduction process of medical experts, providing clear reasoning behind each prediction without relying on post-hoc attribution methods. Extensive evaluations on three benchmark datasets (BreakHis, Mini-DDSM, and ICIAR2018) demonstrate that GAFR-Net consistently outperforms various state-of-the-art methods across multiple magnifications and classification tasks. These results validate the superior generalization and practical utility of GAFR-Net as a reliable decision-support tool for weakly supervised medical image analysis.
乳腺癌组织病理图像的准确分类对于早期肿瘤诊断和治疗至关重要。然而,传统的深度学习架构在标注数据有限的情况下性能会下降,并且由于其“黑箱”特性而难以被临床应用。为了解决这些问题,我们提出了一种名为GAFRNet(Graph Attention and Fuzzy Rule Network)的强大且可解释的网络,专门用于稀缺监督条件下的组织病理图像分类。GAFRNet通过构建以相似性驱动的图表示来建模样本间的相互关系,并采用多头图注意力机制来捕捉异质组织中的复杂关联特征。此外,一个可微分模糊规则模块将内在拓扑描述符(包括节点度、聚类系数和标签一致性)编码为明确的人类易于理解的诊断逻辑。这种设计建立了透明的“如果-那么”映射关系,模拟了医疗专家基于直觉进行推断的过程,并且无需依赖事后归因方法就能提供清晰的推理依据。 在三个基准数据集(BreakHis、Mini-DDSM和ICIA2018)上的广泛评估表明,GAFRNet在多种放大倍率和分类任务中均超越了各种最新的方法。这些结果验证了GAFR-Net作为弱监督医学影像分析可靠决策支持工具的优越泛化能力和实用价值。 通过这种方式,GAFRNet不仅能够提高诊断准确性,还能增强医生对模型输出的理解与信任度,从而促进其在临床环境中的广泛采用。
https://arxiv.org/abs/2602.09318
Despite strong performance in data-rich regimes, deep learning often underperforms in the data-scarce settings common in practice. While foundation models (FMs) trained on massive datasets demonstrate strong generalization by extracting general-purpose features, they can still suffer from scarce labeled data during downstream fine-tuning. To address this, we propose GeLDA, a semantics-aware generative latent data augmentation framework that leverages conditional diffusion models to synthesize samples in an FM-induced latent space. Because this space is low-dimensional and concentrates task-relevant information compared to the input space, GeLDA enables efficient, high-quality data generation. GeLDA conditions generation on auxiliary feature vectors that capture semantic relationships among classes or subdomains, facilitating data augmentation in low-resource domains. We validate GeLDA in two large-scale recognition tasks: (a) in zero-shot language-specific speech emotion recognition, GeLDA improves the Whisper-large baseline's unweighted average recall by 6.13%; and (b) in long-tailed image classification, it achieves 74.7% tail-class accuracy on ImageNet-LT, setting a new state-of-the-art result.
尽管在数据丰富的环境中,深度学习表现强劲,但在实践中常见的数据稀缺场景中却往往表现不佳。虽然基于大规模数据集训练的基础模型(FMs)通过提取通用特征展示了强大的泛化能力,但它们在下游微调时仍会受到标签数据不足的影响。为解决这一问题,我们提出了GeLDA——一个语义感知的生成隐式数据增强框架,该框架利用条件扩散模型在基础模型诱导的潜在空间中合成样本。由于这个空间是低维且集中了任务相关信息(相比输入空间),GeLDA能够实现高效、高质量的数据生成。通过辅助特征向量对生成过程进行控制,这些向量捕捉类别或子域之间的语义关系,使得在资源匮乏领域内数据增强成为可能。我们在两个大规模识别任务中验证了GeLDA的效果:(a) 在零样本特定语言的语音情感识别中,GeLDA将Whisper-large基线模型的无权平均召回率提高了6.13%;以及(b) 在长尾图像分类中,在ImageNet-LT数据集上实现了74.7%的尾部类别准确率,刷新了最新的最佳结果。
https://arxiv.org/abs/2602.02841
Machine learning methods have been successful in many areas, like image classification and natural language processing. However, it still needs to be determined how to apply ML to areas with mathematical constraints, like solving PDEs. Among various approaches to applying ML techniques to solving PDEs, the data-driven discretization method presents a promising way of accelerating and improving existing PDE solver on structured grids where it predicts the coefficients of quasi-linear stencils for computing values or derivatives of a function at given positions. It can improve the accuracy and stability of low-resolution simulation compared with using traditional finite difference or finite volume schemes. Meanwhile, it can also benefit from traditional numerical schemes like achieving conservation law by adapting finite volume type formulations. In this thesis, we have implemented the shallow water equation and Euler equation classic solver under a different framework. Experiments show that our classic solver performs much better than the Pyclaw solver. Then we propose four different deep neural networks for the ML-based solver. The results indicate that two of these approaches could output satisfactory solutions.
机器学习方法已经在许多领域取得了成功,比如图像分类和自然语言处理。然而,在解决具有数学约束条件的问题(如偏微分方程)方面,如何应用机器学习技术仍需进一步探索。在将机器学习技术应用于求解偏微分方程的各种方法中,数据驱动离散化方法展示出了一种有前景的加速和完善结构网格上现有偏微分方程求解器的方式,它能够预测计算给定位置函数值或导数所需的拟线性模板系数。相比传统的有限差分或有限体积方案,在低分辨率模拟的情况下,这种方法可以提高准确性和稳定性。同时,该方法还能从传统数值方案中受益,例如通过适应有限体积类型的形式化来实现守恒定律。 在论文中,我们基于不同的框架实现了浅水方程和欧拉方程的经典求解器,并且实验结果显示我们的经典求解器比Pyclaw求解器表现得更好。接着,我们提出了四种适用于机器学习求解器的不同深度神经网络方法。结果表明其中两种方法能够输出令人满意的解决方案。 这段翻译详细描述了如何利用数据驱动离散化方法以及深度学习技术来改进和加速偏微分方程的求解过程,并且展示了在特定问题上使用这种方法时可以取得的成功。
https://arxiv.org/abs/2602.08670
Accurate identification of cat breeds from images is a challenging task due to subtle differences in fur patterns, facial structure, and color. In this paper, we present a deep learning-based approach for classifying cat breeds using a subset of the Oxford-IIIT Pet Dataset, which contains high-resolution images of various domestic breeds. We employed the Global Context Vision Transformer (GCViT) architecture-tiny for cat breed recognition. To improve model generalization, we used extensive data augmentation, including rotation, horizontal flipping, and brightness adjustment. Experimental results show that the GCViT-Tiny model achieved a test accuracy of 92.00% and validation accuracy of 94.54%. These findings highlight the effectiveness of transformer-based architectures for fine-grained image classification tasks. Potential applications include veterinary diagnostics, animal shelter management, and mobile-based breed recognition systems. We also provide a hugging face demo at this https URL.
从图像中准确识别猫的品种是一项具有挑战性的任务,因为不同品种之间的毛发图案、面部结构和颜色存在细微差异。本文提出了一种基于深度学习的方法,利用牛津-IIIT宠物数据集的一个子集(包含各种家养品种的高分辨率图片)来对猫的品种进行分类。我们采用Global Context Vision Transformer (GCViT) 架构的小型版本(GCViT-Tiny)来进行猫品种识别。为了提高模型的泛化能力,我们采用了包括旋转、水平翻转和亮度调整在内的广泛数据增强方法。 实验结果表明,GCViT-Tiny模型在测试集上实现了92.00%的准确率,在验证集上的准确率为94.54%。这些发现强调了基于Transformer架构对于细粒度图像分类任务的有效性。该研究的应用潜力包括兽医诊断、动物收容所管理以及基于移动设备的品种识别系统。 我们还在hugging face上提供了一个演示,链接为:[这里插入实际链接](请将这里的"this https URL"替换为实际提供的链接地址)。
https://arxiv.org/abs/2602.07534
Object-centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.
对象中心学习(OCL)旨在学习支持组合泛化和对抗分布外(OOD)数据鲁棒性的结构化场景表示。然而,评估OCL模型时通常并未关注这些目标,相反,大多数先前的工作主要通过物体发现和简单的推理任务来评价OCL模型的性能,例如通过图像分类测试其表征能力。我们识别了现有基准测试中的两个局限性:(1)它们对于衡量OCL模型表示的有效性提供有限的见解;(2)定位与表示有效性是使用不同的度量标准进行评估的,导致了一致性问题。 为了解决第一个问题,我们采用了指令调优的视觉语言模型(VLMs)作为评估工具,这使得能够跨多个多样化的视觉问答数据集进行可扩展基准测试,并测量VLM如何利用OCL表示完成复杂的推理任务。针对第二个问题,我们引入了一个统一的任务和度量标准,该度量同时评估定位能力和表征有效性,从而消除了由于使用分离的评估方法所带来的不一致性。 最后,为了提供一个参考点,我们也包含了一种简单的多特征重构基线模型。
https://arxiv.org/abs/2602.07532
Prototypical parts-based models offer a "this looks like that" paradigm for intrinsic interpretability, yet they typically struggle with ImageNet-scale generalization and often require computationally expensive backbone finetuning. Furthermore, existing methods frequently suffer from "prototype drift," where learned prototypes lack tangible grounding in the training distribution and change their activation under small perturbations. We present ProtoQuant, a novel architecture that achieves prototype stability and grounded interpretability through latent vector quantization. By constraining prototypes to a discrete learned codebook within the latent space, we ensure they remain faithful representations of the training data without the need to update the backbone. This design allows ProtoQuant to function as an efficient, interpretable head that scales to large-scale datasets. We evaluate ProtoQuant on ImageNet and several fine-grained benchmarks (CUB-200, Cars-196). Our results demonstrate that ProtoQuant achieves competitive classification accuracy while generalizing to ImageNet and comparable interpretability metrics to other prototypical-parts-based methods.
基于原型的部件模型提供了一种“这看起来像那”的内在可解释性范式,但它们通常在ImageNet规模的一般化上遇到困难,并且往往需要耗费大量计算资源来对骨干网络进行微调。此外,现有的方法经常遭受“原型漂移”问题,即学习到的原型缺乏与训练分布的具体联系,在小扰动下其激活会发生变化。我们提出了ProtoQuant这一新颖架构,通过潜在向量量化实现了原型稳定性和基于事实的可解释性。通过将原型约束在潜在空间中的离散学习码本内,我们可以确保它们成为训练数据的真实代表,并且无需更新骨干网络。这种设计使ProtoQuant能够作为一个高效、可解释的头部模型,并能扩展到大规模的数据集上。 我们在ImageNet以及几个细粒度基准测试(CUB-200, Cars-196)上对ProtoQuant进行了评估。我们的结果表明,ProtoQuant在分类准确率方面达到了与竞品相当的表现,并且能够泛化至ImageNet数据集,同时提供了与其他基于原型部件的方法相媲美的可解释性指标。
https://arxiv.org/abs/2602.06592
Adapting large pretrained models to new tasks efficiently and continually is crucial for real-world deployment but remains challenging due to catastrophic forgetting and the high cost of retraining. While parameter-efficient tuning methods like low rank adaptation (LoRA) reduce computational demands, they lack mechanisms for strict continual learning and knowledge integration, without relying on data replay, or multiple adapters. We propose Share, a novel approach to parameter efficient continual finetuning that learns and dynamically updates a single, shared low-rank subspace, enabling seamless adaptation across multiple tasks and modalities. Share constructs a foundational subspace that extracts core knowledge from past tasks and incrementally integrates new information by identifying essential subspace directions. Knowledge from each new task is incorporated into this evolving subspace, facilitating forward knowledge transfer, while minimizing catastrophic interference. This approach achieves up to 100x parameter reduction and 281x memory savings over traditional LoRA methods, maintaining performance comparable to jointly trained models. A single Share model can replace hundreds of task-specific LoRA adapters, supporting scalable, asynchronous continual learning. Experiments across image classification, natural language understanding, 3D pose estimation, and text-to-image generation validate its effectiveness, making Share a practical and scalable solution for lifelong learning in large-scale AI systems.
将大型预训练模型高效且持续地适应新任务对于现实世界的部署至关重要,但由于灾难性遗忘和重新训练成本高昂的问题,这一过程仍然具有挑战性。虽然参数效率调优方法(如低秩自适应调整 LoRA)可以减少计算需求,但它们缺乏严格的连续学习机制以及知识整合能力,并且不依赖于数据重播或多个适配器。为此,我们提出了 Share 方法——一种新颖的参数高效持续微调方法,该方法通过学习和动态更新单一、共享的低秩子空间来实现跨多个任务和模式的无缝适应。Share 构建了一个基础子空间以从过往的任务中提取核心知识,并逐步整合新的信息,通过识别关键子空间方向进行这一过程。每个新任务的知识被融入到这个不断演变的基础子空间中,从而促进正向知识迁移并最小化灾难性干扰。 这种方法相对于传统的 LoRA 方法可实现高达 100 倍的参数减少和 281 倍的记忆节省,并且保持与联合训练模型相当的性能。一个 Share 模型可以替代数百个任务特定的 LoRA 适配器,支持大规模、异步连续学习。跨图像分类、自然语言理解、3D 姿态估计和文本到图像生成的实验验证了其有效性,使 Share 成为大型 AI 系统中终身学习的实际且可扩展解决方案。
https://arxiv.org/abs/2602.06043
Membership inference attacks (MIAs) aim to determine whether a sample was part of a model's training set, posing serious privacy risks for modern machine-learning systems. Existing MIAs primarily rely on static indicators, such as loss or confidence, and do not fully leverage the dynamic behavior of models when actively probed. We propose LeakBoost, a perceptual-loss-based interrogation framework that actively probes a model's internal representations to expose hidden membership signals. Given a candidate input, LeakBoost synthesizes an interrogation image by optimizing a perceptual (activation-space) objective, amplifying representational differences between members and non-members. This image is then analyzed by an off-the-shelf membership detector, without modifying the detector itself. When combined with existing membership inference methods, LeakBoost achieves substantial improvements at low false-positive rates across multiple image classification datasets and diverse neural network architectures. In particular, it raises AUC from near-chance levels (0.53-0.62) to 0.81-0.88, and increases TPR at 1 percent FPR by over an order of magnitude compared to strong baseline attacks. A detailed sensitivity analysis reveals that deeper layers and short, low-learning-rate optimization produce the strongest leakage, and that improvements concentrate in gradient-based detectors. LeakBoost thus offers a modular and computationally efficient way to assess privacy risks in white-box settings, advancing the study of dynamic membership inference.
成员推理攻击(MIAs)旨在确定一个样本是否属于模型的训练集,这对现代机器学习系统造成了严重的隐私风险。现有的MIAs主要依赖于静态指标,如损失或置信度,并没有充分利用当积极探测时模型动态行为中的信息。我们提出了一种基于感知损耗的质询框架——LeakBoost,它通过积极地探究模型内部表示来揭示隐藏的成员信号。给定一个候选输入,LeakBoost 通过优化感知(激活空间)目标生成一个质询图像,放大成员与非成员之间的表征差异。随后,这个图像由现成的成员检测器进行分析,无需对检测器本身做出任何修改。 当结合现有的成员推理方法时,在多个图像分类数据集和各种神经网络架构中,LeakBoost 在低误报率下实现了显著改进。具体而言,它将AUC值从接近随机水平(0.53-0.62)提升至0.81-0.88,并且在1%的假正例率时使真正例率提高了大约一个数量级,超过了强基线攻击的表现。 详细的敏感性分析显示,深层网络和短、低学习率优化产生的泄漏最为强烈,并且改进主要集中在基于梯度的检测器中。因此,LeakBoost 为在白盒设置下评估隐私风险提供了一种模块化且计算效率高的方法,推进了动态成员推理的研究。
https://arxiv.org/abs/2602.05748
Early-exit neural networks enable adaptive inference by allowing predictions at intermediate layers, reducing computational cost. However, early exits often lack interpretability and may focus on different features than deeper layers, limiting trust and explainability. This paper presents Explanation-Guided Training (EGT), a multi-objective framework that improves interpretability and consistency in early-exit networks through attention-based regularization. EGT introduces an attention consistency loss that aligns early-exit attention maps with the final exit. The framework jointly optimizes classification accuracy and attention consistency through a weighted combination of losses. Experiments on a real-world image classification dataset demonstrate that EGT achieves up to 98.97% overall accuracy (matching baseline performance) with a 1.97x inference speedup through early exits, while improving attention consistency by up to 18.5% compared to baseline models. The proposed method provides more interpretable and consistent explanations across all exit points, making early-exit networks more suitable for explainable AI applications in resource-constrained environments.
早期退出神经网络通过在中间层允许进行预测,从而实现了自适应推理并减少了计算成本。然而,这些早期的退出往往缺乏可解释性,并且可能会关注不同于深层结构所聚焦的不同特征,这限制了信任和解释能力。本文提出了基于指导训练(Explanation-Guided Training, EGT),这是一种多目标框架,通过注意力正则化来提高早期退出网络的可解释性和一致性。EGT 引入了一种注意力一致性损失,以使早期退出的关注图与最终退出相一致。该框架通过对损失进行加权组合同时优化分类准确度和注意力一致性。 在真实世界的图像分类数据集上的实验表明,EGT 可以通过早期出口实现高达 1.97 倍的推理加速,并且整体精度达到了最高达 98.97%(与基线性能相当),同时相较于基线模型可使注意力一致性提高多达 18.5%。该方法在整个退出点上提供了更加解释性和一致性的解释,使得早期出口网络更适合在资源受限环境中用于可解释的人工智能应用。
https://arxiv.org/abs/2601.08891
Pre-trained vision-language models such as CLIP exhibit strong transferability, yet adapting them to downstream image classification tasks under limited annotation budgets remains challenging. In active learning settings, the model must select the most informative samples for annotation from a large pool of unlabeled data. Existing approaches typically estimate uncertainty via entropy-based criteria or representation clustering, without explicitly modeling uncertainty from the model perspective. In this work, we propose a robust uncertainty modeling framework for active CLIP adaptation based on dual-prompt tuning. We introduce two learnable prompts in the textual branch of CLIP. The positive prompt enhances the discriminability of task-specific textual embeddings corresponding to light-weight tuned visual embeddings, improving classification reliability. Meanwhile, the negative prompt is trained in an reversed manner to explicitly model the probability that the predicted label is correct, providing a principled uncertainty signal for guiding active sample selection. Extensive experiments across different fine-tuning paradigms demonstrate that our method consistently outperforms existing active learning methods under the same annotation budget.
预训练的视觉-语言模型(如CLIP)表现出强大的迁移能力,但在有限标注预算的情况下将其适应下游图像分类任务仍具挑战。在主动学习环境中,模型必须从大量未标记的数据中选择最具信息量的样本进行注释。现有的方法通常通过熵基准则或表示聚类来估计不确定性,而没有明确地从模型视角建模不确定性。 在这项工作中,我们提出了一种基于双提示调优的稳健不确定性建模框架,用于主动CLIP适应。我们在CLIP的文字分支中引入了两个可学习的提示:正向提示通过增强特定任务的文字嵌入(与轻量级调整的视觉嵌入相对应)来提高分类可靠性;反向训练的负向提示则被用来明确地模型预测标签正确的概率,从而为指导主动样本选择提供了一个原理性的不确定性信号。 在不同的微调范式下进行广泛实验表明,在相同的标注预算下,我们的方法始终优于现有的主动学习方法。
https://arxiv.org/abs/2602.04340
Continual learning (or class incremental learning) is a realistic learning scenario for computer vision systems, where deep neural networks are trained on episodic data, and the data from previous episodes are generally inaccessible to the model. Existing research in this domain has primarily focused on avoiding catastrophic forgetting, which occurs due to the continuously changing class distributions in each episode and the inaccessibility of the data from previous episodes. However, these methods assume that all the training samples in every episode are annotated; this not only incurs a huge annotation cost, but also results in a wastage of annotation effort, since most of the samples in a given episode will not be accessible to the model in subsequent episodes. Active learning algorithms identify the salient and informative samples from large amounts of unlabeled data and are instrumental in reducing the human annotation effort in inducing a deep neural network. In this paper, we propose ACIL, a novel active learning framework for class incremental learning settings. We exploit a criterion based on uncertainty and diversity to identify the exemplar samples that need to be annotated in each episode, and will be appended to the data in the next episode. Such a framework can drastically reduce annotation cost and can also avoid catastrophic forgetting. Our extensive empirical analyses on several vision datasets corroborate the promise and potential of our framework against relevant baselines.
持续学习(或类增量学习)是计算机视觉系统中的一个现实学习场景,在这种场景下,深度神经网络是在分段式数据上进行训练的,并且来自以前各阶段的数据通常对模型来说是不可访问的。现有研究主要集中在避免灾难性遗忘问题上,这是由于每个阶段中不断变化的类分布以及无法访问先前阶段的数据所导致的问题。然而,这些方法假设了在每一阶段中的所有训练样本都是有标注的;这不仅带来了巨大的标注成本,还造成了标注工作的浪费,因为在给定阶段中的大多数样本通常在后续阶段对模型来说是不可获取的。 主动学习算法可以从大量的未标记数据中识别出显著且具有信息量的样本,并有助于减少为诱导深度神经网络所需的人工标注工作。在这篇论文中,我们提出了一种名为ACIL(Active Class Incremental Learning)的新颖主动学习框架,专门用于类增量学习场景。该框架利用不确定性与多样性的标准来识别每个阶段需要标注的典型样本,并将这些样本添加到下一阶段的数据集中。这样的框架能够大幅减少标注成本并避免灾难性遗忘问题。 我们通过多个视觉数据集上的广泛实验分析验证了我们的框架相对于相关基准方法所展现出的巨大潜力和前景。
https://arxiv.org/abs/2602.04252
Identifying out-of-distribution (OOD) data at inference time is crucial for many machine learning applications, especially for automation. We present a novel unsupervised semi-parametric framework COMBOOD for OOD detection with respect to image recognition. Our framework combines signals from two distance metrics, nearest-neighbor and Mahalanobis, to derive a confidence score for an inference point to be out-of-distribution. The former provides a non-parametric approach to OOD detection. The latter provides a parametric, simple, yet effective method for detecting OOD data points, especially, in the far OOD scenario, where the inference point is far apart from the training data set in the embedding space. However, its performance is not satisfactory in the near OOD scenarios that arise in practical situations. Our COMBOOD framework combines the two signals in a semi-parametric setting to provide a confidence score that is accurate both for the near-OOD and far-OOD scenarios. We show experimental results with the COMBOOD framework for different types of feature extraction strategies. We demonstrate experimentally that COMBOOD outperforms state-of-the-art OOD detection methods on the OpenOOD (both version 1 and most recent version 1.5) benchmark datasets (for both far-OOD and near-OOD) as well as on the documents dataset in terms of accuracy. On a majority of the benchmark datasets, the improvements in accuracy resulting from the COMBOOD framework are statistically significant. COMBOOD scales linearly with the size of the embedding space, making it ideal for many real-life applications.
在推理过程中识别出界(Out-of-distribution,OOD)数据对于许多机器学习应用至关重要,尤其是在自动化领域。我们提出了一种新颖的无监督半参数框架COMBOOD,用于图像识别中的OOD检测。该框架结合了最近邻和马哈拉诺比斯两种距离度量信号,以推断推理点是否属于出界数据,并为此提供了一个置信分数。前者提供了非参数方法来进行OOD检测;后者则提供了一种参数化、简单且有效的检测方法,尤其是在远OOD场景中,此时的推理点与训练数据集在嵌入空间中的距离较远。然而,在实际出现近OOD情况时,后者的性能并不理想。 我们的COMBOOD框架在一个半参数环境中结合了这两种信号,从而为近OOD和远OOD两种情境提供了准确的置信分数。我们展示了不同特征提取策略下使用COMBOOD框架的实验结果,并通过实验证明在OpenOOD(包括版本1和最新的1.5版)基准数据集(对远OOD和近OOD同样适用)以及文档数据集中,该方法超越了现有的最佳OOD检测算法,在准确率方面表现更佳。在大多数基准数据集上,COMBOOD框架带来的准确性提升具有统计显著性。 此外,COMBOOD的计算复杂度与嵌入空间大小成线性关系,使其非常适合许多现实世界的应用场景。
https://arxiv.org/abs/2602.07042
Masked Autoencoders (MAEs) achieve impressive performance in image classification tasks, yet the internal representations they learn remain less understood. This work started as an attempt to understand the strong downstream classification performance of MAE. In this process we discover that representations learned with the pretraining and fine-tuning, are quite robust - demonstrating a good classification performance in the presence of degradations, such as blur and occlusions. Through layer-wise analysis of token embeddings, we show that pretrained MAE progressively constructs its latent space in a class-aware manner across network depth: embeddings from different classes lie in subspaces that become increasingly separable. We further observe that MAE exhibits early and persistent global attention across encoder layers, in contrast to standard Vision Transformers (ViTs). To quantify feature robustness, we introduce two sensitivity indicators: directional alignment between clean and perturbed embeddings, and head-wise retention of active features under degradations. These studies help establish the robust classification performance of MAEs.
遮罩自动编码器(MAE)在图像分类任务中表现出色,但其内部表示仍然不够清晰。这项工作最初旨在理解MAE强大的下游分类性能背后的原因。在此过程中,我们发现通过预训练和微调所学习的表示非常稳健——即使存在退化情况(如模糊和遮挡),也能保持良好的分类性能。通过对令牌嵌入进行逐层分析,我们展示了经过预训练的MAE在神经网络深度上逐步构建其潜在空间的方式是根据类别有意识地进行的:不同类别的嵌入位于逐渐分离的子空间中。 此外,我们还观察到与标准视觉变压器(ViT)相比,MAE在整个编码器层中表现出早期且持久的全局注意力。为了量化特征稳健性,我们引入了两个敏感度指标:清洁和扰动后的嵌入之间的方向对齐,以及在退化情况下各头部保持活跃特征的能力。这些研究有助于建立并证明MAE分类性能的鲁棒性。
https://arxiv.org/abs/2602.03531