Although Mamba models greatly improve Hyperspectral Image (HSI) classification, they have critical challenges in terms defining efficient and adaptive token sequences for improve performance. This paper therefore presents CSSMamba (Clustering-guided Spatial-Spectral Mamba) framework to better address the challenges, with the following contributions. First, to achieve efficient and adaptive token sequences for improved Mamba performance, we integrate the clustering mechanism into a spatial Mamba architecture, leading to a cluster-guided spatial Mamba module (CSpaMamba) that reduces the Mamba sequence length and improves Mamba feature learning capability. Second, to improve the learning of both spatial and spectral information, we integrate the CSpaMamba module with a spectral mamba module (SpeMamba), leading to a complete clustering-guided spatial-spectral Mamba framework. Third, to further improve feature learning capability, we introduce an Attention-Driven Token Selection mechanism to optimize Mamba token sequencing. Last, to seamlessly integrate clustering into the Mamba model in a coherent manner, we design a Learnable Clustering Module that learns the cluster memberships in an adaptive manner. Experiments on the Pavia University, Indian Pines, and Liao-Ning 01 datasets demonstrate that CSSMamba achieves higher accuracy and better boundary preservation compared to state-of-the-art CNN, Transformer, and Mamba-based methods.
尽管马尔巴(Mamba)模型在高光谱图像(HSI)分类方面取得了显著进展,但在定义高效的自适应标记序列以提高性能方面仍面临重大挑战。为此,本文提出了一种由聚类引导的空间-光谱马尔巴(CSSMamba:Clustering-guided Spatial-Spectral Mamba)框架,旨在更好地应对这些挑战,并作出以下贡献。 首先,为了实现高效且自适应的标记序列并提升马尔巴模型的表现,我们将聚类机制整合到空间马尔巴架构中,从而构建了一个由聚类引导的空间马尔巴模块(CSpaMamba),该模块减少了马尔巴序列长度并提高了马尔巴特征学习的能力。 其次,为了提高空间和光谱信息的学习效果,我们结合了CSpaMamba模块与一个光谱马尔巴模块(SpeMamba),形成了一个完整的由聚类引导的空间-光谱马尔巴框架。 第三,为进一步提升特征学习能力,我们引入了一种注意力驱动的标记选择机制来优化马尔巴模型中的标记序列。 最后,为了以一致的方式无缝地将聚类整合到马尔巴模型中,我们设计了一个可学聚类模块(Learnable Clustering Module),该模块能够在自适应的情况下学习集群成员资格。 在帕维亚大学、印度普林斯和辽宁01数据集上的实验表明,CSSMamba相较于最先进的CNN、Transformer以及基于马尔巴的方法,在准确性方面表现出更高的性能,并且能够更好地保持边界。
https://arxiv.org/abs/2601.16098
This paper introduces a novel approach to securing machine learning model deployments against potential distribution shifts in practical applications, the Total Variation Out-of-Distribution (TV-OOD) detection method. Existing methods have produced satisfactory results, but TV-OOD improves upon these by leveraging the Total Variation Network Estimator to calculate each input's contribution to the overall total variation. By defining this as the total variation score, TV-OOD discriminates between in- and out-of-distribution data. The method's efficacy was tested across a range of models and datasets, consistently yielding results in image classification tasks that were either comparable or superior to those achieved by leading-edge out-of-distribution detection techniques across all evaluation metrics.
本文介绍了一种新颖的方法,用于在实际应用中保护机器学习模型部署免受潜在的数据分布变化的影响,这种方法称为总变异离分布(TV-OOD)检测方法。尽管现有方法已经取得了令人满意的结果,但TV-OOD通过利用总变差网络估计器来计算每个输入对整体总变差的贡献,从而在效果上超越了这些方法。将这一贡献定义为总变分分数后,TV-OOD能够区分内分布和外分布的数据。该方法的有效性已经在多种模型和数据集上进行了测试,并且在图像分类任务的所有评估指标中,其性能要么与最先进的离分布检测技术相当,要么更优。
https://arxiv.org/abs/2601.15867
In hyperspectral image classification (HSIC), most deep learning models rely on opaque spectral-spatial feature mixing, limiting their interpretability and hindering understanding of internal decision mechanisms. We present physical spectrum-aware white-box mHC, named ES-mHC, a hyper-connection framework that explicitly models interactions among different electromagnetic spectrum groupings (residual stream in mHC) interactions using structured, directional matrices. By separating feature representation from interaction structure, ES-mHC promotes electromagnetic spectrum grouping specialization, reduces redundancy, and exposes internal information flow that can be directly visualized and spatially analyzed. Using hyperspectral image classification as a representative testbed, we demonstrate that the learned hyper-connection matrices exhibit coherent spatial patterns and asymmetric interaction behaviors, providing mechanistic insight into the model internal dynamics. Furthermore, we find that increasing the expansion rate accelerates the emergence of structured interaction patterns. These results suggest that ES-mHC transforms HSIC from a purely black-box prediction task into a structurally transparent, partially white-box learning process.
在超光谱图像分类(HSIC)中,大多数深度学习模型依赖于不透明的光谱-空间特征混合,这限制了它们的可解释性,并阻碍了对内部决策机制的理解。我们提出了一种物理光谱感知的白盒模型mHC,命名为ES-mHC,这是一种超连接框架,该框架通过结构化、定向矩阵明确地建模不同电磁频段组之间的相互作用(即mHC中的残差流)。通过将特征表示与交互结构分离,ES-mHC促进了电磁频段分组的专业化,减少了冗余,并揭示了可以直接可视化和空间分析的内部信息流动。利用超光谱图像分类作为代表性试验平台,我们展示了学习到的超连接矩阵呈现出连贯的空间模式以及不对称的相互作用行为,这为模型的内部动力学提供了机制性的见解。此外,我们发现增加扩张率会加速结构化交互模式的出现。这些结果表明ES-mHC将HSIC从一个纯粹的黑箱预测任务转化为一种结构性透明、部分白盒的学习过程。
https://arxiv.org/abs/2601.15757
Training deep computer vision models requires manual oversight or hyperparameter tuning of the learning rate (LR) schedule. While existing adaptive optimizers schedule the LR automatically, they suffer from computational and memory overhead, incompatibility with regularization, and suboptimal LR choices. In this work, we introduce the ZENITH (Zero-overhead Evolution using Norm-Informed Training History) optimizer, which adapts the LR using the temporal evolution of the gradient norm. Image classification experiments spanning 6 CNN architectures and 6 benchmarks demonstrate that ZENITH achieves higher test accuracy in lower wall-clock time than baselines. It also yielded superior mAP in object detection, keypoint detection, and instance segmentation on MS COCO using the R-CNN family of models. Furthermore, its compatibility with regularization enables even better generalization.
训练深度计算机视觉模型需要人工监督或调整学习率(LR)的调度。虽然现有的自适应优化器可以自动安排学习率,但它们会带来计算和内存开销、与正则化不兼容以及次优的学习率选择等问题。在本文中,我们引入了 ZENITH(Zero-overhead Evolution using Norm-Informed Training History,基于范数信息训练历史的零开销演化)优化器,该优化器通过梯度范数的时间演变来调整学习率。跨越6种CNN架构和6个基准的数据集进行的图像分类实验表明,ZENITH能够在更短的实际运行时间(wall-clock time)内比基线模型获得更高的测试准确率。此外,在使用R-CNN家族模型的情况下,它在MS COCO数据集上的目标检测、关键点检测和实例分割任务中取得了更好的mAP值。更重要的是,它的兼容性与正则化结合后还能进一步提升泛化能力。
https://arxiv.org/abs/2601.15212
Despite tremendous improvements in tasks such as image classification, object detection, and segmentation, the recognition of visual relationships, commonly modeled as the extraction of a graph from an image, remains a challenging task. We believe that this mainly stems from the fact that there is no canonical way to approach the visual graph recognition task. Most existing solutions are specific to a problem and cannot be transferred between different contexts out-of-the box, even though the conceptual problem remains the same. With broad applicability and simplicity in mind, in this paper we develop a method, \textbf{Gra}ph Recognition via \textbf{S}ubgraph \textbf{P}rediction (\textbf{GraSP}), for recognizing graphs in images. We show across several synthetic benchmarks and one real-world application that our method works with a set of diverse types of graphs and their drawings, and can be transferred between tasks without task-specific modifications, paving the way to a more unified framework for visual graph recognition.
尽管在图像分类、目标检测和分割等任务上取得了巨大的进步,但识别视觉关系(通常建模为从图像中提取图形)仍然是一项具有挑战性的任务。我们认为这主要是由于目前没有一个标准的方式来处理视觉图识别任务。大多数现有的解决方案都是针对特定问题的,无法直接应用于不同的情境中,尽管概念性的问题是一样的。本着广泛适用性和简单性的原则,在本文中我们开发了一种方法——通过子图预测进行图形识别(GraSP),用于在图像中识别图形。我们在多个合成基准测试和一个现实世界的应用程序上展示了该方法可以处理多种不同类型的图及其绘制方式,并且可以在任务之间转移,而无需特定任务的修改,这为视觉图识别的一个更加统一框架铺平了道路。
https://arxiv.org/abs/2601.15133
Mixture-of-Experts (MoE) architectures enable conditional computation by routing inputs to multiple expert subnetworks and are often motivated as a mechanism for scaling large language models. In this project, we instead study MoE behavior in an image classification setting, focusing on predictive performance, expert utilization, and generalization. We compare dense, SoftMoE, and SparseMoE classifier heads on the CIFAR10 dataset under comparable model capacity. Both MoE variants achieve slightly higher validation accuracy than the dense baseline while maintaining balanced expert utilization through regularization, avoiding expert collapse. To analyze generalization, we compute Hessian-based sharpness metrics at convergence, including the largest eigenvalue and trace of the loss Hessian, evaluated on both training and test data. We find that SoftMoE exhibits higher sharpness by these metrics, while Dense and SparseMoE lie in a similar curvature regime, despite all models achieving comparable generalization performance. Complementary loss surface perturbation analyses reveal qualitative differences in non-local behavior under finite parameter perturbations between dense and MoE models, which help contextualize curvature-based measurements without directly explaining validation accuracy. We further evaluate empirical inference efficiency and show that naively implemented conditional routing does not yield inference speedups on modern hardware at this scale, highlighting the gap between theoretical and realized efficiency in sparse MoE models.
混合专家(Mixture-of-Experts,MoE)架构通过将输入路由到多个专家子网络来实现条件计算,并且通常被视为扩展大型语言模型的一种机制。在这个项目中,我们研究了在图像分类设置下MoE的行为,重点是预测性能、专家利用和泛化能力。我们在CIFAR10数据集上比较了密集(dense)、软混合专家(SoftMoE)和稀疏混合专家(SparseMoE)的分类头部,在模型容量相同的情况下进行了对比。 实验结果显示,两种MoE变体在验证准确率方面略高于密集基线,并通过正则化保持了平衡的专家利用,避免了专家的崩溃。为了分析泛化能力,我们在训练和测试数据上计算了Hessian矩阵的最大特征值及迹等基于梯度曲率的衡量标准。我们发现,SoftMoE在这些指标上表现出更高的尖锐性(sharpness),而密集模型和稀疏混合专家模型则位于相似的曲率范围内,尽管所有模型都达到了相当接近的一般化性能。 进一步通过损失面扰动分析揭示了密集模型与MoE模型之间在有限参数扰动下的非局部行为具有定性的差异。这些结果帮助解释基于曲率的测量,并且有助于理解验证准确度,虽然它们并不能直接解释验证准确性。 此外,我们还评估了实际推理效率,并发现简单实现的条件路由并没有在这个规模下利用现代硬件提高推理速度,这强调了稀疏MoE模型在理论和实践上的效率差距。
https://arxiv.org/abs/2601.15021
Interpretability is significant in computational pathology, leading to the development of multimodal information integration from histopathological image and corresponding text this http URL, existing multimodal methods have limited interpretability due to the lack of high-quality dataset that support explicit reasoning and inference and simple reasoning this http URL address the above problems, we introduce a novel multimodal pathology large language model with strong reasoning this http URL improve the generation of accurate and contextually relevant textual descriptions, we design a semantic reward strategy integrated with group relative policy this http URL construct a high-quality pathology visual question answering (VQA) dataset, specifically designed to support complex reasoning this http URL experiments conducted on this dataset demonstrate that our method outperforms state-of-the-art methods, even when trained with only 20% of the this http URL method also achieves comparable performance on downstream zero-shot image classification task compared with CLIP.
可解释性在计算病理学中非常重要,这导致了从组织病理图像和相应的文本中整合多模态信息的发展。然而,由于缺乏支持明确推理和简单推断的高质量数据集,现有的多模态方法具有有限的可解释性。 为了解决上述问题,我们引入了一种新颖的多模态病理大型语言模型,并增强了其推理能力。为了提高生成准确且语境相关的文本描述的能力,我们设计了与群体相对策略集成的语义奖励策略。此外,为了构建一个专门支持复杂推理的高质量病理视觉问答(VQA)数据集,我们进行了特别的设计。 在该数据集上进行的实验表明,我们的方法优于现有最先进的方法,在仅使用20%训练数据的情况下依然能够表现出色。此外,与CLIP相比,我们的方法在下游零样本图像分类任务中也取得了可比的表现。
https://arxiv.org/abs/2601.14757
High-altitude, multi-spectral, aerial imagery is scarce and expensive to acquire, yet it is necessary for algorithmic advances and application of machine learning models to high-impact problems such as wildfire detection. We introduce a human-annotated dataset from the NASA Autonomous Modular Sensor (AMS) using 12-channel, medium to high altitude (3 - 50 km) aerial wildfire images similar to those used in current US wildfire missions. Our dataset combines spectral data from 12 different channels, including infrared (IR), short-wave IR (SWIR), and thermal. We take imagery from 20 wildfire missions and randomly sample small patches to generate over 4000 images with high variability, including occlusions by smoke/clouds, easily-confused false positives, and nighttime imagery. We demonstrate results from a deep-learning model to automate the human-intensive process of fire perimeter determination. We train two deep neural networks, one for image classification and the other for pixel-level segmentation. The networks are combined into a unique real-time segmentation model to efficiently localize active wildfire on an incoming image feed. Our model achieves 96% classification accuracy, 74% Intersection-over-Union(IoU), and 84% recall surpassing past methods, including models trained on satellite data and classical color-rule algorithms. By leveraging a multi-spectral dataset, our model is able to detect active wildfire at nighttime and behind clouds, while distinguishing between false positives. We find that data from the SWIR, IR, and thermal bands is the most important to distinguish fire perimeters. Our code and dataset can be found here: this https URL and this https URL
高空、多光谱的航空图像稀缺且获取成本高昂,但对于像野火检测这样的高影响力问题中的算法进步和机器学习模型的应用是必要的。我们介绍了一个由NASA自主模块传感器(AMS)生成的人工注释数据集,其中包括12通道、中至高层(3-50公里)的高空野火图像,这些图像与当前美国野火任务所使用的类似。 该数据集结合了来自12个不同通道的数据,包括红外线(IR)、短波红外线(SWIR)和热成像。我们从20次野火任务中提取影像,并随机选取小片区域生成超过4000张具有高度变化性的图像,这些变化包括烟雾/云层遮挡、容易混淆的假阳性以及夜间拍摄的图像。 我们展示了基于深度学习模型自动化人工密集型过程——确定火线边界的结果。训练了两个深层神经网络,一个用于图像分类,另一个用于像素级别的分割。这两个网络被整合成一种独特的实时分割模型,以高效地在输入影像流中定位活跃野火区域。我们的模型达到了96%的分类准确率、74%的交并比(IoU)以及84%的召回率,超过了过去的方法,包括基于卫星数据训练的模型和传统颜色规则算法。 通过利用多光谱数据集,我们的模型能够检测夜间或云层后的活跃野火,并区分假阳性。我们发现SWIR、IR和热成像波段的数据对于确定火线边界最为重要。我们的代码和数据集可以在这里找到:[具体网址] 和 [具体网址]。 请注意,上述文字段落中的"this https URL"和"this https URL"为原英文文档中用于插入链接的占位符,在实际分享时应替换为具体的URL地址。
https://arxiv.org/abs/2601.14475
Language-aligned vision foundation models perform strongly across diverse downstream tasks. Yet, their learned representations remain opaque, making interpreting their decision-making hard. Recent works decompose these representations into human-interpretable concepts, but provide poor spatial grounding and are limited to image classification tasks. In this work, we propose Insight, a language-aligned concept foundation model that provides fine-grained concepts, which are human-interpretable and spatially grounded in the input image. We leverage a hierarchical sparse autoencoder and a foundation model with strong semantic representations to automatically extract concepts at various granularities. Examining local co-occurrence dependencies of concepts allows us to define concept relationships. Through these relations we further improve concept naming and obtain richer explanations. On benchmark data, we show that Insight provides performance on classification and segmentation that is competitive with opaque foundation models while providing fine-grained, high quality concept-based explanations. Code is available at this https URL.
与语言对齐的视觉基础模型在各种下游任务中表现出色。然而,它们所学习到的表示仍然是不透明的,这使得解释其决策过程变得困难。最近的研究将这些表示分解为人类可理解的概念,但提供的空间定位较差,并且仅限于图像分类任务。在此工作中,我们提出了Insight,这是一种与语言对齐的概念基础模型,它提供细粒度的概念,这些概念既具有人机交互性又在输入图像中具有明确的空间定位。 我们利用层次稀疏自编码器和语义表示强大的基础模型来自动提取不同级别的概念。通过检查局部共现依赖关系,我们可以定义概念之间的关系。借助这些关系,我们进一步改进了概念的命名,并获得了更丰富的解释。在基准数据集上,我们展示了Insight在分类和分割任务上的性能与不透明的基础模型相当,同时提供了细粒度、高质量的概念基础解释。 代码可在 [提供的链接] 处获取。
https://arxiv.org/abs/2601.13798
Recent advancements in Large Vision-Language Models (LVLMs) have pushed them closer to becoming general-purpose assistants. Despite their strong performance, LVLMs still struggle with vision-centric tasks such as image classification, underperforming compared to their base vision encoders, which are often CLIP-based models. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a novel, model-agnostic framework which introduces vision-integration layers and a context-aware ensemble strategy to identify when to prioritize image representations or rely on the reasoning capabilities of the language model. This design enhances the model's ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations, leading to consistent improvements in generalization across classification and vision-language benchmarks. Extensive experiments demonstrate that CARPE not only improves performance on image classification benchmarks but also enhances results across various vision-language benchmarks. Finally, CARPE is designed to be effectively integrated with most open-source LVLMs that consist of a vision encoder and a language model, ensuring its adaptability across diverse architectures.
最近在大型视觉-语言模型(LVLM)方面的进展已经使它们更接近成为通用型助手。尽管这些模型表现出色,但在以图像分类为代表的视觉任务上仍然表现不佳,未能超越其基础的视觉编码器——通常是基于CLIP的模型。为了解决这一局限性,我们提出了通过集成实现上下文感知的图像表示优先级(CARPE),这是一种新颖且与模型无关的框架,它引入了视觉融合层和一种上下文感知的集成策略来确定何时应优先考虑图像表示或依赖于语言模型的推理能力。这种设计增强了模型根据任务需求灵活调整视觉和文本模态权重的能力,并使模型能够捕捉到图像表示的各种方面,从而在分类和视觉-语言基准测试上实现了持续的泛化性能提升。 广泛的实验表明,CARPE不仅提高了图像分类基准上的表现,还改善了各种视觉-语言基准测试的结果。最后,CARPE被设计为可以有效地与大多数开源LVLM集成,这些模型包括一个视觉编码器和一个语言模型,从而确保其能够在不同的架构中灵活适应。
https://arxiv.org/abs/2601.13622
Diffusion models have emerged as state-of-the-art generative methods for image synthesis, yet their potential as general-purpose feature encoders remains underexplored. Trained for denoising and generation without labels, they can be interpreted as self-supervised learners that capture both low- and high-level structure. We show that a frozen diffusion backbone enables strong fine-grained recognition by probing intermediate denoising features across layers and timesteps and training a linear classifier for each pair. We evaluate this in a real-world plankton-monitoring setting with practical impact, using controlled and comparable training setups against established supervised and self-supervised baselines. Frozen diffusion features are competitive with supervised baselines and outperform other self-supervised methods in both balanced and naturally long-tailed settings. Out-of-distribution evaluations on temporally and geographically shifted plankton datasets further show that frozen diffusion features maintain strong accuracy and Macro F1 under substantial distribution shift.
扩散模型作为图像合成中的最先进的生成方法已崭露头角,但它们作为一种通用特征编码器的潜力尚未得到充分探索。这些模型在没有标签的情况下进行去噪和生成训练,可以被视为一种自我监督学习者,能够捕捉到低级和高级结构信息。我们展示了冻结状态下的扩散骨干网络通过探测跨层和时间步长中的中间去噪特征,并针对每一对训练线性分类器,可以实现强大的细粒度识别能力。我们在一个具有实际影响的浮游生物监测的真实场景中评估了这一点,使用与已确立的监督学习和自我监督方法进行对比且一致的训练设置。在平衡和自然长尾设置下,冻结扩散特征不仅能够媲美监督基线模型,还能超越其他自监督方法。此外,在时间上和地理上偏移的浮游生物数据集上的分布外评估进一步表明,冻结扩散特征能够在显著的分布变化下保持强准确性和Macro F1分数。
https://arxiv.org/abs/2601.13416
As a pivotal technique for improving the defense of deep models, adversarial robustness transfer via distillation has demonstrated remarkable success in conventional image classification tasks. However, this paradigm encounters critical challenges when applied to vision-language models (VLM) (e.g., CLIP): constructing adversarially robust teacher for large-scale multi-modal models demands prohibitively high computational resources. We bridge this gap by revealing an interesting phenomenon: vanilla CLIP (without adversarial training) exhibits intrinsic defensive capabilities against adversarial examples generated by another CLIP with different architectures. We formally define this as proxy adversarial robustness, and naturally propose a Heterogeneous Proxy Transfer (HPT) framework that establishes cross-architectural robustness distillation channels between CLIP variants, effortlessly enabling the VLM robustness transfer from proxy to target models. Yet, such proxy transfer paradigm easily induces severe overfitting, leading to a sharp degradation in zero-shot natural generalization. To resolve that, we design Generalization-Pivot Decoupling (GPD) by leveraging the difference in learning rate scheduling. This decouples the proxy transfer process into a generalization-anchored warm-up that maintains generalization and a generalization-pulled HPT that promotes adversarial robustness, to achieve an equilibrium between natural generalization and adversarial robustness. Extensive experiments on 15 zero-shot datasets demonstrate the effectiveness of our HPT-GPD method. The code is available at the website of this http URL.
作为增强深度模型防御能力的关键技术,通过蒸馏实现的对抗鲁棒性迁移在传统的图像分类任务中展示了显著的成功。然而,当应用于视觉-语言模型(如CLIP)时,这一范式面临重大挑战:为大规模多模态模型构建具有对抗性的教师模型需要过高的计算资源。我们通过揭示一个有趣的现象来弥合这一差距:未经对抗训练的普通CLIP在面对另一架构不同的CLIP生成的对抗性样本时展现出内在防御能力。我们将这种现象正式定义为代理对抗鲁棒性,并自然地提出了一种异构代理传输(Heterogeneous Proxy Transfer,简称HPT)框架,该框架建立在不同架构之间的CLIP变体之间进行跨架构的鲁棒性蒸馏通道,轻松实现了从代理模型到目标模型的视觉-语言模型稳健性的迁移。然而,这种代理迁移范式容易导致严重的过拟合问题,进而导致零样本自然泛化的急剧下降。为了解决这个问题,我们通过利用学习率调度的差异设计了泛化枢轴解耦(Generalization-Pivot Decoupling,简称GPD)。这将代理传输过程分解成一个保持泛化的基于泛化的预热阶段和一个促进对抗鲁棒性的推动泛化的HPT阶段,从而在自然泛化与对抗鲁棒性之间实现平衡。在15个零样本数据集上的广泛实验验证了我们的HPT-GPD方法的有效性。代码可在以下网址获取:[此HTTP链接]。
https://arxiv.org/abs/2601.12865
Explainable Artificial Intelligence (XAI) techniques, such as Gradient-weighted Class Activation Mapping (Grad-CAM), have become indispensable for visualizing the reasoning process of deep neural networks in medical image analysis. Despite their popularity, the faithfulness and reliability of these heatmap-based explanations remain under scrutiny. This study critically investigates whether Grad-CAM truly represents the internal decision-making of deep models trained for lung cancer image classification. Using the publicly available IQ-OTH/NCCD dataset, we evaluate five representative architectures: ResNet-50, ResNet-101, DenseNet-161, EfficientNet-B0, and ViT-Base-Patch16-224, to explore model-dependent variations in Grad-CAM interpretability. We introduce a quantitative evaluation framework that combines localization accuracy, perturbation-based faithfulness, and explanation consistency to assess Grad-CAM reliability across architectures. Experimental findings reveal that while Grad-CAM effectively highlights salient tumor regions in most convolutional networks, its interpretive fidelity significantly degrades for Vision Transformer models due to non-local attention behavior. Furthermore, cross-model comparisons indicate substantial variability in saliency localization, implying that Grad-CAM explanations may not always correspond to the true diagnostic evidence used by the networks. This work exposes critical limitations of current saliency-based XAI approaches in medical imaging and emphasizes the need for model-aware interpretability methods that are both computationally sound and clinically meaningful. Our findings aim to inspire a more cautious and rigorous adoption of visual explanation tools in medical AI, urging the community to rethink what it truly means to "trust" a model's explanation.
可解释的人工智能(XAI)技术,如梯度加权类激活映射(Grad-CAM),已成为医学图像分析中可视化深度神经网络推理过程不可或缺的工具。尽管这些方法很受欢迎,但基于热图的解释的真实性和可靠性仍然受到质疑。本研究批判性地调查了Grad-CAM是否真正代表用于肺部癌症图像分类的深度模型内部决策过程。我们使用公开可用的IQ-OTH/NCCD数据集,评估五种代表性架构:ResNet-50、ResNet-101、DenseNet-161、EfficientNet-B0和ViT-Base-Patch16-224,以探索Grad-CAM解释能力在不同模型之间的差异。我们引入了一个定量评价框架,结合定位准确性、基于扰动的真实性和解释一致性来评估不同架构下的Grad-CAM可靠性。 实验结果表明,尽管大多数卷积网络中的Grad-CAM能够有效突出关键肿瘤区域,但对于视觉Transformer模型来说,由于其非局部注意力机制,Grad-CAM的解释保真度显著下降。此外,跨模型比较显示了显着的变化性在定位显著性方面,这暗示Grad-CAM解释可能并不总是与网络真正使用的诊断证据相对应。 这项工作揭示了当前基于显著性的XAI方法在医学成像中的关键限制,并强调需要开发既能计算上可靠又具有临床意义的、针对模型特征的可解释性方法。我们的研究旨在启发医疗人工智能领域更加谨慎和严格地采用视觉解释工具,促使社区重新思考“信任”模型解释真正意味着什么。
https://arxiv.org/abs/2601.12826
Few-shot learning in remote sensing remains challenging due to three factors: the scarcity of labeled data, substantial domain shifts, and the multi-scale nature of geospatial objects. To address these issues, we introduce Adaptive Multi-Scale Correlation Meta-Network (AMC-MetaNet), a lightweight yet powerful framework with three key innovations: (i) correlation-guided feature pyramids for capturing scale-invariant patterns, (ii) an adaptive channel correlation module (ACCM) for learning dynamic cross-scale relationships, and (iii) correlation-guided meta-learning that leverages correlation patterns instead of conventional prototype averaging. Unlike prior approaches that rely on heavy pre-trained models or transformers, AMC-MetaNet is trained from scratch with only $\sim600K$ parameters, offering $20\times$ fewer parameters than ResNet-18 while maintaining high efficiency ($<50$ms per image inference). AMC-MetaNet achieves up to 86.65\% accuracy in 5-way 5-shot classification on various remote sensing datasets, including EuroSAT, NWPU-RESISC45, UC Merced Land Use, and AID. Our results establish AMC-MetaNet as a computationally efficient, scale-aware framework for real-world few-shot remote sensing.
在遥感领域中,少量样本学习(few-shot learning)仍面临三大挑战:标记数据的稀缺性、显著的域偏移以及地理空间对象的多尺度特性。为了解决这些问题,我们引入了自适应多尺度相关元网络(Adaptive Multi-Scale Correlation Meta-Network, AMC-MetaNet),这是一种轻量级但功能强大的框架,具有三项关键创新: 1. 相关引导特征金字塔:用于捕获尺度不变的模式。 2. 自适应通道相关模块(ACCM):用于学习动态跨尺度关系。 3. 基于相关性的元学习:利用相关性模式而非传统的原型平均。 与以往依赖重型预训练模型或变换器的方法不同,AMC-MetaNet从头开始训练,仅使用约60万个参数。这使得它比ResNet-18少20倍的参数量,同时保持了高效性(每张图像推理时间低于50毫秒)。在包括EuroSAT、NWPU-RESISC45、UC Merced Land Use和AID在内的多个遥感数据集上,AMC-MetaNet实现了高达86.65%的精度,在五类五样本分类任务中表现优异。我们的研究结果将AMC-MetaNet确立为一种计算效率高且尺度感知度强的框架,适用于现实世界的少量样本遥感应用。
https://arxiv.org/abs/2601.12308
Deep learning has achieved remarkable success in image recognition, yet their inherent opacity poses challenges for deployment in critical domains. Concept-based interpretations aim to address this by explaining model reasoning through human-understandable concepts. However, existing post-hoc methods and ante-hoc concept bottleneck models (CBMs), suffer from limitations such as unreliable concept relevance, non-visual or labor-intensive concept definitions, and model or data-agnostic assumptions. This paper introduces Post-hoc Concept Bottleneck Model via Representation Decomposition (PCBM-ReD), a novel pipeline that retrofits interpretability onto pretrained opaque models. PCBM-ReD automatically extracts visual concepts from a pre-trained encoder, employs multimodal large language models (MLLMs) to label and filter concepts based on visual identifiability and task relevance, and selects an independent subset via reconstruction-guided optimization. Leveraging CLIP's visual-text alignment, it decomposes image representations into linear combination of concept embeddings to fit into the CBMs abstraction. Extensive experiments across 11 image classification tasks show PCBM-ReD achieves state-of-the-art accuracy, narrows the performance gap with end-to-end models, and exhibits better interpretability.
深度学习在图像识别方面取得了显著的成功,但其内在的不透明性为将其部署到关键领域带来了挑战。基于概念的解释旨在通过人类可理解的概念来说明模型推理,从而解决这个问题。然而,现有的事后方法和事前概念瓶颈模型(CBMs)存在一些局限性,例如不可靠的概念相关性、非视觉或劳动密集型的概念定义以及对模型或数据的假设过于不敏感。 本文介绍了通过表示分解进行的事后概念瓶颈模型(PCBM-ReD),这是一种新颖的管道,旨在将解释能力附加到预先训练的黑盒模型上。PCBM-ReD 自动从预训练编码器中提取视觉概念,利用多模态大型语言模型(MLLMs)根据可视性和任务相关性对这些概念进行标记和筛选,并通过基于重建引导优化选择一个独立的概念子集。借助 CLIP 的视觉-文本对齐能力,PCBM-ReD 将图像表示分解为概念嵌入的线性组合,从而符合 CBMs 抽象的要求。 在 11 种图像分类任务上进行的大量实验表明,PCBM-ReD 达到了最先进的精度,缩小了与端到端模型的性能差距,并表现出更好的可解释性。
https://arxiv.org/abs/2601.12303
Convolutional neural networks (CNNs) have achieved state-of-the-art performance in image recognition tasks but often involve complex architectures that may overfit on small datasets. In this study, we evaluate a compact CNN across five publicly available, real-world image datasets from Bangladesh, including urban encroachment, vehicle detection, road damage, and agricultural crops. The network demonstrates high classification accuracy, efficient convergence, and low computational overhead. Quantitative metrics and saliency analyses indicate that the model effectively captures discriminative features and generalizes robustly across diverse scenarios, highlighting the suitability of streamlined CNN architectures for small-class image classification tasks.
卷积神经网络(CNN)在图像识别任务中取得了最先进的性能,但常常涉及复杂的架构,在小规模数据集上可能过度拟合。在这项研究中,我们评估了一个精简的CNN在五个来自孟加拉国的公开可用、真实世界图像数据集上的表现,包括城市侵占、车辆检测、道路损坏和农业作物等任务。该网络展示了高分类准确率、高效的收敛性和低计算开销。定量指标和显著性分析表明,模型能够有效捕捉区分特征,并在各种场景中稳健地推广,突显了简化CNN架构适用于小类图像分类任务的适用性。
https://arxiv.org/abs/2601.11911
Deep learning has significantly advanced image analysis across diverse domains but often depends on large, annotated datasets for success. Transfer learning addresses this challenge by utilizing pre-trained models to tackle new tasks with limited labeled data. However, discrepancies between source and target domains can hinder effective transfer learning. We introduce BioTune, a novel adaptive fine-tuning technique utilizing evolutionary optimization. BioTune enhances transfer learning by optimally choosing which layers to freeze and adjusting learning rates for unfrozen layers. Through extensive evaluation on nine image classification datasets, spanning natural and specialized domains such as medical imaging, BioTune demonstrates superior accuracy and efficiency over state-of-the-art fine-tuning methods, including AutoRGN and LoRA, highlighting its adaptability to various data characteristics and distribution changes. Additionally, BioTune consistently achieves top performance across four different CNN architectures, underscoring its flexibility. Ablation studies provide valuable insights into the impact of BioTune's key components on overall performance. The source code is available at this https URL.
深度学习在各种领域中显著推进了图像分析的发展,但通常依赖于大规模且标注完整的数据集才能取得成功。迁移学习通过利用预训练的模型来解决这个问题,在新任务上即使有少量标记数据也能发挥作用。然而,源域和目标域之间的差异会阻碍有效迁移学习的应用。 我们引入了一种新的自适应微调技术BioTune,它使用了进化优化方法。BioTune通过最优地选择要冻结哪些层以及调整未冻结层的学习率来增强迁移学习的效果。通过对九个图像分类数据集的广泛评估(包括自然和专门化领域如医学成像),BioTune展示了比当前最先进的微调方法(例如AutoRGN和LoRA)更高的准确性和效率,突显了它适应不同数据特性和分布变化的能力。此外,无论使用四种不同的CNN架构中的哪一种,BioTune都能保持顶级性能,这证明了它的灵活性。 消融研究为理解BioTune关键组件对整体性能的影响提供了宝贵的见解。源代码可在以下链接获得:[此链接](请将方括号内的文本替换为您实际提供的GitHub或其他版本控制平台的URL)。
https://arxiv.org/abs/2601.11235
Most pseudo-label selection strategies in semi-supervised learning rely on fixed confidence thresholds, implicitly assuming that prediction confidence reliably indicates correctness. In practice, deep networks are often overconfident: high-confidence predictions can still be wrong, while informative low-confidence samples near decision boundaries are discarded. This paper introduces a Confidence-Variance (CoVar) theory framework that provides a principled joint reliability criterion for pseudo-label selection. Starting from the entropy minimization principle, we derive a reliability measure that combines maximum confidence (MC) with residual-class variance (RCV), which characterizes how probability mass is distributed over non-maximum classes. The derivation shows that reliable pseudo-labels should have both high MC and low RCV, and that the influence of RCV increases as confidence grows, thereby correcting overconfident but unstable predictions. From this perspective, we cast pseudo-label selection as a spectral relaxation problem that maximizes separability in a confidence-variance feature space, and design a threshold-free selection mechanism to distinguish high- from low-reliability predictions. We integrate CoVar as a plug-in module into representative semi-supervised semantic segmentation and image classification methods. Across PASCAL VOC 2012, Cityscapes, CIFAR-10, and Mini-ImageNet with varying label ratios and backbones, it consistently improves over strong baselines, indicating that combining confidence with residual-class variance provides a more reliable basis for pseudo-label selection than fixed confidence thresholds. (Code: this https URL)
半监督学习中大多数伪标签选择策略依赖于固定的置信度阈值,暗含地假设预测的置信度可靠地指示了正确性。在实践中,深度网络往往过于自信:即使高置信度的预测也可能出错,而位于决策边界附近的低置信度但信息量大的样本则被丢弃。本文介绍了一种基于置信度-方差(Confidence-Variance, CoVar)理论框架,该框架提供了一个原则性的联合可靠性标准用于伪标签选择。从熵最小化原理出发,我们推导出一种结合最大置信度(MC)和残余类方差(RCV)的可靠性测量方法,其中RCV描述了概率质量在非最大类中的分布情况。 我们的推导表明,可靠的伪标签应具有高MC和低RCV,并且随着置信度的增长,RCV的影响增大,从而纠正过于自信但不稳定的预测。从这个角度来看,我们将伪标签选择视为一种谱松弛问题,在置信度-方差特征空间中最大化可分离性,并设计了一种无需阈值的机制来区分高可靠性与低可靠性预测。 我们将CoVar作为插件模块整合进代表性的半监督语义分割和图像分类方法中。在PASCAL VOC 2012、Cityscapes、CIFAR-10和Mini-ImageNet数据集上,通过不同的标签比例和骨干网络进行实验,结果表明:CoVar的一致性改进了强大的基准模型,这表明结合置信度与残余类方差为伪标签选择提供了比固定置信度阈值更可靠的依据。 (代码地址: this https URL)
https://arxiv.org/abs/2601.11670
Current progress in out-of-distribution (OOD) detection is limited by the lack of large, high-quality datasets with clearly defined OOD categories across varying difficulty levels (near- to far-OOD) that support both fine- and coarse-grained computer vision tasks. To address this limitation, we introduce ICONIC-444 (Image Classification and OOD Detection with Numerous Intricate Complexities), a specialized large-scale industrial image dataset containing over 3.1 million RGB images spanning 444 classes tailored for OOD detection research. Captured with a prototype industrial sorting machine, ICONIC-444 closely mimics real-world tasks. It complements existing datasets by offering structured, diverse data suited for rigorous OOD evaluation across a spectrum of task complexities. We define four reference tasks within ICONIC-444 to benchmark and advance OOD detection research and provide baseline results for 22 state-of-the-art post-hoc OOD detection methods.
当前的出界检测(OOD,Out-of-Distribution)研究由于缺乏大规模、高质量的数据集而受到限制。这些数据集中应该包含清晰定义的不同难度级别的出界类别(从近到远),以支持细粒度和粗粒度的计算机视觉任务。为了应对这一局限性,我们引入了ICONIC-444(Image Classification and OOD Detection with Numerous Intricate Complexities),这是一个专门用于OOD检测研究的大规模工业图像数据集,包含超过310万张RGB图像,并覆盖了444个类别。该数据集由一个原型工业分类机器捕获而来,能够模拟现实世界的任务需求。ICONIC-444补充现有的数据集,提供了结构化且多样化的数据,适合在不同复杂度的任务范围内进行严格的OOD评估。我们在ICONIC-444中定义了四个参考任务来为OOD检测研究制定基准,并为22种最先进的事后OOD检测方法提供基线结果。
https://arxiv.org/abs/2601.10802
In this paper, we propose difficulty-guided sampling (DGS) to bridge the target gap between the distillation objective and the downstream task, therefore improving the performance of dataset distillation. Deep neural networks achieve remarkable performance but have time and storage-consuming training processes. Dataset distillation is proposed to generate compact, high-quality distilled datasets, enabling effective model training while maintaining downstream performance. Existing approaches typically focus on features extracted from the original dataset, overlooking task-specific information, which leads to a target gap between the distillation objective and the downstream task. We propose leveraging characteristics that benefit the downstream training into data distillation to bridge this gap. Focusing on the downstream task of image classification, we introduce the concept of difficulty and propose DGS as a plug-in post-stage sampling module. Following the specific target difficulty distribution, the final distilled dataset is sampled from image pools generated by existing methods. We also propose difficulty-aware guidance (DAG) to explore the effect of difficulty in the generation process. Extensive experiments across multiple settings demonstrate the effectiveness of the proposed methods. It also highlights the broader potential of difficulty for diverse downstream tasks.
在这篇论文中,我们提出了一种难度引导采样(DGS)方法,以弥合蒸馏目标与下游任务之间的差距,从而提升数据集蒸馏的性能。深度神经网络实现了显著的表现,但其训练过程消耗大量时间和存储资源。为了生成紧凑且高质量的数据集以便有效训练模型同时保持下游任务表现,数据集蒸馏被提出。现有方法通常侧重于从原始数据集中提取的特征,并忽略了特定任务的信息,导致了蒸馏目标与下游任务之间的差距。我们提议将有利于下游训练的特点融入到数据蒸馏中以弥合这一差距。 针对图像分类这一具体下游任务,我们引入了难度概念,并提出了DGS作为可插拔后处理采样模块。根据具体的任务难度分布,在由现有方法生成的图像池中进行最终的数据集采样。此外,我们还提出了一种难度感知引导(DAG)的方法来探索难度在生成过程中的作用。 广泛的跨多个设置的实验展示了所提方法的有效性,并强调了难度对于各种下游任务的更广泛应用潜力。
https://arxiv.org/abs/2601.10090