We propose to bridge the gap between semi-supervised and unsupervised image recognition with a flexible method that performs well for both generalized category discovery (GCD) and image clustering. Despite the overlap in motivation between these tasks, the methods themselves are restricted to a single task -- GCD methods are reliant on the labeled portion of the data, and deep image clustering methods have no built-in way to leverage the labels efficiently. We connect the two regimes with an innovative approach that Utilizes Neighbor Information for Classification (UNIC) both in the unsupervised (clustering) and semisupervised (GCD) setting. State-of-the-art clustering methods already rely heavily on nearest neighbors. We improve on their results substantially in two parts, first with a sampling and cleaning strategy where we identify accurate positive and negative neighbors, and secondly by finetuning the backbone with clustering losses computed by sampling both types of neighbors. We then adapt this pipeline to GCD by utilizing the labelled images as ground truth neighbors. Our method yields state-of-the-art results for both clustering (+3% ImageNet-100, Imagenet200) and GCD (+0.8% ImageNet-100, +5% CUB, +2% SCars, +4% Aircraft).
我们提出了一种灵活的方法,旨在弥合半监督和无监督图像识别之间的差距,并且该方法在泛化的类别发现(GCD)和图像聚类方面都表现出色。尽管这些任务的动机存在重叠,但各自的方法仅限于单个任务——GCD 方法依赖于数据中的标注部分,而深度图像聚类方法没有内置的方式来高效利用标签信息。我们通过一种创新性的“基于邻域信息分类(UNIC)”方法将两者联系起来,在无监督(聚类)和半监督(GCD)设置中均能发挥作用。 当前最先进的聚类方法已经很大程度上依赖于最近邻技术,我们在两个方面对其结果进行了显著改进:首先,通过采样和清理策略来识别准确的正负邻居;其次,通过对两种类型的邻居进行采样计算聚类损失,并微调骨干网络。然后我们将此流程调整应用于 GCD 中,利用标注图像作为真实邻居。 我们的方法在聚类(ImageNet-100 和 ImageNet200 上分别提高了 3%)和 GCD(ImageNet-100 提升了 0.8%,CUB 数据集提升了 5%,SCars 数据集中提升了 2%,Aircraft 数据集中提升了 4%)方面都取得了最先进的结果。
https://arxiv.org/abs/2503.14500
Numerous maritime applications rely on the ability to recognize acoustic targets using passive sonar. While there is a growing reliance on pre-trained models for classification tasks, these models often require extensive computational resources and may not perform optimally when transferred to new domains due to dataset variations. To address these challenges, this work adapts the neural edge histogram descriptors (NEHD) method originally developed for image classification, to classify passive sonar signals. We conduct a comprehensive evaluation of statistical and structural texture features, demonstrating that their combination achieves competitive performance with large pre-trained models. The proposed NEHD-based approach offers a lightweight and efficient solution for underwater target recognition, significantly reducing computational costs while maintaining accuracy.
许多海洋应用依赖于使用被动声纳识别声学目标。虽然在分类任务中越来越依赖预训练模型,但这些模型往往需要大量的计算资源,并且当转移到新领域时,由于数据集的变化可能无法发挥最佳性能。为了解决这些问题,这项工作将最初用于图像分类的神经边缘直方图描述符(NEHD)方法改编应用于被动声纳信号的分类。我们对统计和结构纹理特征进行了全面评估,证明它们的结合能够与大型预训练模型实现竞争性的表现。所提出的基于NEHD的方法为水下目标识别提供了轻量级且高效的解决方案,在减少计算成本的同时保持了准确性。
https://arxiv.org/abs/2503.13763
With the rise of neural networks, especially in high-stakes applications, these networks need two properties (i) robustness and (ii) interpretability to ensure their safety. Recent advances in classifiers with 3D volumetric object representations have demonstrated a greatly enhanced robustness in out-of-distribution data. However, these 3D-aware classifiers have not been studied from the perspective of interpretability. We introduce CAVE - Concept Aware Volumes for Explanations - a new direction that unifies interpretability and robustness in image classification. We design an inherently-interpretable and robust classifier by extending existing 3D-aware classifiers with concepts extracted from their volumetric representations for classification. In an array of quantitative metrics for interpretability, we compare against different concept-based approaches across the explainable AI literature and show that CAVE discovers well-grounded concepts that are used consistently across images, while achieving superior robustness.
随着神经网络的兴起,特别是在高风险应用中,这些网络需要具备两个关键特性:(i)鲁棒性;(ii)可解释性,以确保其安全性。最近,使用3D体积对象表示的分类器在处理分布外数据时展现出显著增强的鲁棒性。然而,从可解释性的角度来看,人们对这种3D感知分类器的研究还很少。我们引入了CAVE——基于概念感知体积的解释工具——这是一个新的方向,它将图像分类中的可解释性和鲁棒性统一起来。通过扩展现有的3D感知分类器并利用其体素表示中提取的概念来进行分类,我们设计了一个本质上具有可解释性和鲁棒性的新分类器。在一系列衡量可解释性的定量指标上,我们将CAVE与可解释AI文献中基于概念的方法进行了比较,并证明了CAVE能够发现一致应用于图像中的坚实概念,同时实现更优的鲁棒性。
https://arxiv.org/abs/2503.13429
When a human undertakes a test, their responses likely follow a pattern: if they answered an easy question $(2 \times 3)$ incorrectly, they would likely answer a more difficult one $(2 \times 3 \times 4)$ incorrectly; and if they answered a difficult question correctly, they would likely answer the easy one correctly. Anything else hints at memorization. Do current visual recognition models exhibit a similarly structured learning capacity? In this work, we consider the task of image classification and study if those models' responses follow that pattern. Since real images aren't labeled with difficulty, we first create a dataset of 100 categories, 10 attributes, and 3 difficulty levels using recent generative models: for each category (e.g., dog) and attribute (e.g., occlusion), we generate images of increasing difficulty (e.g., a dog without occlusion, a dog only partly visible). We find that most of the models do in fact behave similarly to the aforementioned pattern around 80-90% of the time. Using this property, we then explore a new way to evaluate those models. Instead of testing the model on every possible test image, we create an adaptive test akin to GRE, in which the model's performance on the current round of images determines the test images in the next round. This allows the model to skip over questions too easy/hard for itself, and helps us get its overall performance in fewer steps.
当人类参加测试时,他们的回答通常会遵循一定的模式:如果他们答错了简单的题目(如 $2 \times 3$),那么他们在更难的题目上(如 $2 \times 3 \times 4$)也可能会答错;反之亦然。如果他们能正确地解答一个难题,则很有可能也能正确回答简单题。任何其他情况都可能暗示了记忆作用而不是理解能力。当前的视觉识别模型是否表现出类似的有结构的学习能力?在这项工作中,我们考虑的是图像分类任务,并研究这些模型的回答是否会遵循这种模式。 由于实际图像没有标注难度级别,我们首先使用最近的生成模型创建了一个包含100个类别、10种属性和3个不同难度级别的数据集:对于每个类别(例如狗)和每种属性(例如遮挡),我们生成一系列难度逐渐增加的图片(如无遮挡的狗、部分可见的狗)。我们发现,大多数模型确实表现出与上述模式相似的行为,大约80-90%的时间。 基于这一特性,我们随后探索了一种新的评估这些模型的方法。与其对每一个可能的测试图像进行测试,我们创建了一个类似于GRE考试(美国研究生入学考试)的自适应测试,在这种测试中,模型在当前轮次的表现决定了下一回合中的测试图片。这使得模型可以跳过那些对于其能力而言过于简单或复杂的题目,并帮助我们在更少的步骤内评估其总体性能。
https://arxiv.org/abs/2503.13058
This study explores the impact of scaling semantic categories on the image classification performance of vision transformers (ViTs). In this specific case, the CLIP server provided by Jina AI is used for experimentation. The research hypothesizes that as the number of ground truth and artificially introduced semantically equivalent categories increases, the labeling accuracy of ViTs improves until a theoretical maximum or limit is reached. A wide variety of image datasets were chosen to test this hypothesis. These datasets were processed through a custom function in Python designed to evaluate the model's accuracy, with adjustments being made to account for format differences between datasets. By exponentially introducing new redundant categories, the experiment assessed accuracy trends until they plateaued, decreased, or fluctuated inconsistently. The findings show that while semantic scaling initially increases model performance, the benefits diminish or reverse after surpassing a critical threshold, providing insight into the limitations and possible optimization of category labeling strategies for ViTs.
这项研究探讨了扩展语义类别对视觉变换器(ViT)图像分类性能的影响。在这个特定案例中,使用Jina AI提供的CLIP服务器进行实验。该研究假设随着真实标签和人工引入的语义等价类别的数量增加,ViT的标注准确性会提高,直到达到理论上的最大值或极限。为了检验这一假设,选择了多种多样的图像数据集进行测试。这些数据集通过一个定制的Python函数进行处理,以评估模型的准确性,并根据数据集之间的格式差异进行了调整。 实验中逐步引入新的冗余类别,以评估准确性的趋势变化,直到它们达到平稳、下降或不规则波动为止。研究发现表明,在初始阶段扩展语义类别确实可以提高模型性能,但一旦超越一个临界阈值,这些好处就会逐渐减少甚至逆转。这一结果为理解ViT的类别标注策略的局限性以及可能的优化方向提供了见解。
https://arxiv.org/abs/2503.12617
Malicious users attempt to replicate commercial models functionally at low cost by training a clone model with query responses. It is challenging to timely prevent such model-stealing attacks to achieve strong protection and maintain utility. In this paper, we propose a novel non-parametric detector called Account-aware Distribution Discrepancy (ADD) to recognize queries from malicious users by leveraging account-wise local dependency. We formulate each class as a Multivariate Normal distribution (MVN) in the feature space and measure the malicious score as the sum of weighted class-wise distribution discrepancy. The ADD detector is combined with random-based prediction poisoning to yield a plug-and-play defense module named D-ADD for image classification models. Results of extensive experimental studies show that D-ADD achieves strong defense against different types of attacks with little interference in serving benign users for both soft and hard-label settings.
恶意用户试图通过使用查询响应训练克隆模型,以低成本复制商业模型的功能。及时预防此类模型窃取攻击并实现强保护同时保持效用具有挑战性。本文提出了一种新颖的非参数检测器——账户感知分布差异(Account-aware Distribution Discrepancy, ADD),利用按账户划分的局部依赖关系来识别恶意用户的查询。我们将每个类别在特征空间中表示为多元正态分布(Multivariate Normal distribution,MVN),并以加权类间分布差异之和作为恶意评分度量。ADD检测器与基于随机预测投毒技术结合使用,生成了一个名为D-ADD的即插即用防御模块,用于图像分类模型。广泛的实验研究表明,无论是在软标签还是硬标签设置下,D-ADD都能够对不同类型攻击提供强有力的防护,同时对服务正常用户的影响微乎其微。
https://arxiv.org/abs/2503.12497
Previous works studied how deep neural networks (DNNs) perceive image content in terms of their biases towards different image cues, such as texture and shape. Previous methods to measure shape and texture biases are typically style-transfer-based and limited to DNNs for image classification. In this work, we provide a new evaluation procedure consisting of 1) a cue-decomposition method that comprises two AI-free data pre-processing methods extracting shape and texture cues, respectively, and 2) a novel cue-decomposition shape bias evaluation metric that leverages the cue-decomposition data. For application purposes we introduce a corresponding cue-decomposition robustness metric that allows for the estimation of the robustness of a DNN w.r.t. image corruptions. In our numerical experiments, our findings for biases in image classification DNNs align with those of previous evaluation metrics. However, our cue-decomposition robustness metric shows superior results in terms of estimating the robustness of DNNs. Furthermore, our results for DNNs on the semantic segmentation datasets Cityscapes and ADE20k for the first time shed light into the biases of semantic segmentation DNNs.
之前的研究探讨了深度神经网络(DNN)如何通过它们对不同图像线索的偏好来感知图像内容,例如纹理和形状。以前的方法通常基于风格转换,并且仅限于用于图像分类的DNN。在这项工作中,我们提供了一种新的评估程序,包括: 1. 一种线索分解方法,该方法包含两种AI无关的数据预处理方法,分别提取形状和纹理线索。 2. 一个新颖的线索分解形状偏差评估指标,利用了这些分解后的数据。 为了实际应用目的,我们还引入了一个相应的线索分解鲁棒性度量,它允许估计DNN相对于图像损坏的稳健性。在数值实验中,我们的发现与以前用于图像分类DNN的评价标准一致。然而,我们的线索分解鲁棒性度量显示出更好的效果,在评估DNN的鲁棒性方面表现更佳。此外,我们首次对语义分割数据集Cityscapes和ADE20k上的DNN进行了研究,并揭示了这些网络中的偏差情况。
https://arxiv.org/abs/2503.12453
Whole Slide Image (WSI) classification poses unique challenges due to the vast image size and numerous non-informative regions, which introduce noise and cause data imbalance during feature aggregation. To address these issues, we propose MExD, an Expert-Infused Diffusion Model that combines the strengths of a Mixture-of-Experts (MoE) mechanism with a diffusion model for enhanced classification. MExD balances patch feature distribution through a novel MoE-based aggregator that selectively emphasizes relevant information, effectively filtering noise, addressing data imbalance, and extracting essential features. These features are then integrated via a diffusion-based generative process to directly yield the class distribution for the WSI. Moving beyond conventional discriminative approaches, MExD represents the first generative strategy in WSI classification, capturing fine-grained details for robust and precise results. Our MExD is validated on three widely-used benchmarks-Camelyon16, TCGA-NSCLC, and BRACS consistently achieving state-of-the-art performance in both binary and multi-class tasks.
全滑动图像(WSI)分类面临着独特的挑战,主要是由于其庞大的图像尺寸以及存在大量非相关信息区域,这会导致特征聚合过程中引入噪声并造成数据不平衡。为了解决这些问题,我们提出了MExD,这是一种专家融合扩散模型,它结合了混合专家(MoE)机制和扩散模型的优势,以增强WSI的分类效果。 MExD通过一种新型基于MoE的聚合器来平衡补丁特征分布,该聚合器能够有选择地强调相关信息,有效过滤噪声,解决数据不平衡问题,并提取关键特征。然后,这些特征通过基于扩散过程的生成机制进行整合,直接输出WSI的类别分布。 不同于传统的判别式方法,MExD代表了在WSI分类领域中第一个生成式的策略,能够捕捉到细微的细节以实现稳健且精确的结果。我们的MExD模型已经在三个广泛使用的基准数据集上进行了验证——Camelyon16、TCGA-NSCLC和BRACS,并在这两个二元任务及多类任务中均取得了最先进的性能。
https://arxiv.org/abs/2503.12401
Contrastive Language-Image Pretraining (CLIP) has shown impressive zero-shot performance on image classification. However, state-of-the-art methods often rely on fine-tuning techniques like prompt learning and adapter-based tuning to optimize CLIP's performance. The necessity for fine-tuning significantly limits CLIP's adaptability to novel datasets and domains. This requirement mandates substantial time and computational resources for each new dataset. To overcome this limitation, we introduce simple yet effective training-free approaches, Single-stage LMM Augmented CLIP (SLAC) and Two-stage LMM Augmented CLIP (TLAC), that leverages powerful Large Multimodal Models (LMMs), such as Gemini, for image classification. The proposed methods leverages the capabilities of pre-trained LMMs, allowing for seamless adaptation to diverse datasets and domains without the need for additional training. Our approaches involve prompting the LMM to identify objects within an image. Subsequently, the CLIP text encoder determines the image class by identifying the dataset class with the highest semantic similarity to the LLM predicted object. We evaluated our models on 11 base-to-novel datasets and they achieved superior accuracy on 9 of these, including benchmarks like ImageNet, SUN397 and Caltech101, while maintaining a strictly training-free paradigm. Our overall accuracy of 83.44% surpasses the previous state-of-the-art few-shot methods by a margin of 6.75%. Our method achieved 83.6% average accuracy across 13 datasets, a 9.7% improvement over the previous 73.9% state-of-the-art for training-free approaches. Our method improves domain generalization, with a 3.6% gain on ImageNetV2, 16.96% on ImageNet-S, and 12.59% on ImageNet-R, over prior few-shot methods.
对比语言-图像预训练(CLIP)在零样本图像分类上表现出色。然而,最先进的方法通常依赖于微调技术(如提示学习和基于适配器的调整),以优化 CLIP 的性能。这种对微调的需求极大地限制了 CLIP 在新数据集和领域中的适应能力,并且每处理一个新的数据集都需要大量的时间和计算资源。为了解决这一局限性,我们引入了一种简单而有效的方法——单阶段 LMM 增强 CLIP(SLAC)和双阶段 LMM 增强 CLIP(TLAC),这些方法利用强大的大型多模态模型(LMMs),如 Gemini,来进行图像分类。所提出的方法充分利用了预训练的 LMM 的能力,使模型能够无缝地适应各种数据集和领域,并且无需额外的训练。 我们的方法包括提示 LMM 来识别图像中的对象,随后 CLIP 文本编码器通过确定与 LLM 预测的对象具有最高语义相似性的数据集类别来决定图像类别。我们在 11 个基础到新颖的数据集上评估了模型,并在其中的 9 个数据集中取得了更高的准确性,包括 ImageNet、SUN397 和 Caltech101 等基准测试,同时保持严格的无训练范式。我们总体上的准确率为 83.44%,超过了之前的最先进零样本方法 6.75% 的差距。我们的方法在 13 个数据集上平均达到了 83.6% 的准确性,比先前的 73.9% 最佳无训练方法提高了 9.7%。此外,在域泛化方面,我们相对于之前的零样本方法分别取得了 ImageNetV2 上提高 3.6%,ImageNet-S 上提高 16.96% 和 ImageNet-R 上提高 12.59% 的成绩。
https://arxiv.org/abs/2503.12206
In the emerging field of goal-oriented communications, the focus has shifted from reconstructing data to directly performing specific learning tasks, such as classification, segmentation, or pattern recognition, on the received coded data. In the commonly studied scenario of classification from compressed images, a key objective is to enable learning directly on entropy-coded data, thereby bypassing the computationally intensive step of data reconstruction. Conventional entropy-coding methods, such as Huffman and Arithmetic coding, are effective for compression but disrupt the data structure, making them less suitable for direct learning without decoding. This paper investigates the use of low-density parity-check (LDPC) codes -- originally designed for channel coding -- as an alternative entropy-coding approach. It is hypothesized that the structured nature of LDPC codes can be leveraged more effectively by deep learning models for tasks like classification. At the receiver side, gated recurrent unit (GRU) models are trained to perform image classification directly on LDPC-coded data. Experiments on datasets like MNIST, Fashion-MNIST, and CIFAR show that LDPC codes outperform Huffman and Arithmetic coding in classification tasks, while requiring significantly smaller learning models. Furthermore, the paper analyzes why LDPC codes preserve data structure more effectively than traditional entropy-coding techniques and explores the impact of key code parameters on classification performance. These results suggest that LDPC-based entropy coding offers an optimal balance between learning efficiency and model complexity, eliminating the need for prior decoding.
在目标导向通信这一新兴领域中,研究焦点已从数据重建转向直接对收到的编码数据执行特定学习任务(如分类、分割或模式识别)。在压缩图像分类的常见研究场景下,关键的目标是能够在熵编码数据上直接进行学习,从而跳过数据重建这个计算密集型步骤。传统的熵编码方法,例如霍夫曼编码和算术编码,在压缩方面非常有效,但它们会破坏数据结构,使得不经过解码就难以直接用于学习任务。 本文探讨了使用低密度奇偶校验(LDPC)码——最初为信道编码设计的——作为替代熵编码的方法。假设LDPC码具有的结构性质可以被深度学习模型更有效地利用以完成如分类等任务。在接收端,研究人员训练门控循环单元(GRU)模型直接对LDPC编码的数据进行图像分类。实验表明,在诸如MNIST、Fashion-MNIST和CIFAR这样的数据集上,LDPC码在分类任务中的表现优于霍夫曼编码和算术编码,并且所需的模型学习规模显著更小。 此外,论文还分析了为什么LDPC码比传统的熵编码技术更能有效地保持数据结构,并探讨了关键代码参数对分类性能的影响。这些结果表明,基于LDPC的熵编码能够在学习效率与模型复杂性之间提供一个最佳平衡点,从而消除了之前解码的需求。
https://arxiv.org/abs/2503.11954
Accurate and reliable image classification is crucial in radiology, where diagnostic decisions significantly impact patient outcomes. Conventional deep learning models tend to produce overconfident predictions despite underlying uncertainties, potentially leading to misdiagnoses. Attention mechanisms have emerged as powerful tools in deep learning, enabling models to focus on relevant parts of the input data. Combined with feature fusion, they can be effective in addressing uncertainty challenges. Cross-attention has become increasingly important in medical image analysis for capturing dependencies across features and modalities. This paper proposes a novel dual cross-attention fusion model for medical image analysis by addressing key challenges in feature integration and interpretability. Our approach introduces a bidirectional cross-attention mechanism with refined channel and spatial attention that dynamically fuses feature maps from EfficientNetB4 and ResNet34 leveraging multi-network contextual dependencies. The refined features through channel and spatial attention highlights discriminative patterns crucial for accurate classification. The proposed model achieved AUC of 99.75%, 100%, 99.93% and 98.69% and AUPR of 99.81%, 100%, 99.97%, and 96.36% on Covid-19, Tuberculosis, Pneumonia Chest X-ray images and Retinal OCT images respectively. The entropy values and several high uncertain samples give an interpretable visualization from the model enhancing transparency. By combining multi-scale feature extraction, bidirectional attention and uncertainty estimation, our proposed model strongly impacts medical image analysis.
在放射学中,准确且可靠的图像分类对于诊断决策至关重要,这些决策会对患者的治疗结果产生重大影响。传统的深度学习模型倾向于生成过度自信的预测,尽管存在不确定性,这可能导致误诊。注意力机制已成为深度学习中的强大工具,使模型能够专注于输入数据的相关部分。结合特征融合后,它们可以有效地解决不确定性的挑战。交叉注意在医学图像分析中变得越来越重要,因为它可以捕捉到不同特征和模态之间的依赖关系。 本文提出了一种新颖的双交叉注意力融合模型用于医学图像分析,通过解决关键的特征整合和可解释性问题来实现这一目标。我们的方法引入了双向交叉注意力机制,并结合了经过优化后的通道和空间注意功能,动态地将EfficientNetB4和ResNet34网络生成的特征图进行融合,利用多网络上下文依赖关系。通过这种改进后的通道和空间注意力机制,模型能够突出对于准确分类至关重要的区分性模式。 该提出的模型在针对Covid-19、肺结核、肺炎胸部X光图像以及视网膜OCT图像的数据集上分别实现了AUC值为99.75%、100%、99.93%和98.69%,AUPR值分别为99.81%、100%、99.97%和96.36%。通过熵值以及多个高不确定性样本,模型能够提供可解释的可视化效果,增强透明度。 通过结合多尺度特征提取、双向注意力机制以及不确定性的估计,我们的提出的模型在医学图像分析领域产生了深远的影响。
https://arxiv.org/abs/2503.11851
Language-guided attention frameworks have significantly enhanced both interpretability and performance in image classification; however, the reliance on deterministic embeddings from pre-trained vision-language foundation models to generate reference attention maps frequently overlooks the intrinsic multivaluedness and ill-posed characteristics of cross-modal mappings. To address these limitations, we introduce PARIC, a probabilistic framework for guiding visual attention via language specifications. Our approach enables pre-trained vision-language models to generate probabilistic reference attention maps, which align textual and visual modalities more effectively while incorporating uncertainty estimates, as compared to their deterministic counterparts. Experiments on benchmark test problems demonstrate that PARIC enhances prediction accuracy, mitigates bias, ensures consistent predictions, and improves robustness across various datasets.
基于语言指导的注意力框架在图像分类中显著提升了可解释性和性能;然而,依赖于预训练视觉-语言基础模型产生的确定性嵌入来生成参考注意图,常常忽略了跨模态映射固有的多值特性和病态特性。为了解决这些限制,我们引入了PARIC(Probabilistic Attention via Reference Instruction and Contrast),这是一种通过语言规格引导视觉注意力的概率框架。我们的方法使预训练的视觉-语言模型能够生成概率参考注意图,在将文本和视觉模式对齐的同时,相比确定性对应物,该方法还整合了不确定性估计。在基准测试问题上的实验表明,PARIC提高了预测准确性、减少了偏差、确保了一致性的预测,并且增强了跨各种数据集的鲁棒性。
https://arxiv.org/abs/2503.11360
This paper considers open-set recognition (OSR) of plankton images. Plankton include a diverse range of microscopic aquatic organisms that have an important role in marine ecosystems as primary producers and as a base of food webs. Given their sensitivity to environmental changes, fluctuations in plankton populations offer valuable information about oceans' health and climate change motivating their monitoring. Modern automatic plankton imaging devices enable the collection of large-scale plankton image datasets, facilitating species-level analysis. Plankton species recognition can be seen as an image classification task and is typically solved using deep learning-based image recognition models. However, data collection in real aquatic environments results in imaging devices capturing a variety of non-plankton particles and plankton species not present in the training set. This creates a challenging fine-grained OSR problem, characterized by subtle differences between taxonomically close plankton species. We address this challenge by conducting extensive experiments on three OSR approaches using both phyto- and zooplankton images analyzing also on the effect of the rejection thresholds for OSR. The results demonstrate that high OSR accuracy can be obtained promoting the use of these methods in operational plankton research. We have made the data publicly available to the research community.
本文探讨了浮游生物图像的开放集识别(OSR)问题。浮游生物包括一系列微小的水生生物,作为海洋生态系统中的初级生产者和食物链的基础,在其中扮演着重要的角色。由于对环境变化敏感,浮游生物种群的波动为了解海洋健康及气候变化提供了有价值的信息,这促使了对其监测的需求。现代自动浮游生物成像设备使得大规模收集浮游生物图像数据集成为可能,从而支持物种水平上的分析研究。浮游生物种类识别可以被视为一个图像分类任务,并且通常使用基于深度学习的图像识别模型来解决。 然而,在真实水生环境中采集的数据会导致成像装置捕获各种非浮游生物颗粒以及训练集中不存在的新出现的浮游生物种类,这使得细粒度的开放集识别问题变得非常具有挑战性。这些问题的特点在于分类学相近的浮游生物物种之间存在细微差别。 为了解决这一挑战,我们通过在三种OSR方法上进行广泛实验(使用了包括植物和动物两类浮游生物图像),并分析不同拒绝阈值对OSR的影响来研究这一问题。结果显示,可以实现高精度的OSR,并且这些方法可以在实际操作中的浮游生物学研究中得到推广。我们将数据公开提供给学术界的研究人员。
https://arxiv.org/abs/2503.11318
This paper introduces a holistic vision-language foundation model tailored for remote sensing, named Falcon. Falcon offers a unified, prompt-based paradigm that effectively executes comprehensive and complex remote sensing tasks. Falcon demonstrates powerful understanding and reasoning abilities at the image, region, and pixel levels. Specifically, given simple natural language instructions and remote sensing images, Falcon can produce impressive results in text form across 14 distinct tasks, i.e., image classification, object detection, segmentation, image captioning, and etc. To facilitate Falcon's training and empower its representation capacity to encode rich spatial and semantic information, we developed Falcon_SFT, a large-scale, multi-task, instruction-tuning dataset in the field of remote sensing. The Falcon_SFT dataset consists of approximately 78 million high-quality data samples, covering 5.6 million multi-spatial resolution and multi-view remote sensing images with diverse instructions. It features hierarchical annotations and undergoes manual sampling verification to ensure high data quality and reliability. Extensive comparative experiments are conducted, which verify that Falcon achieves remarkable performance over 67 datasets and 14 tasks, despite having only 0.7B parameters. We release the complete dataset, code, and model weights at this https URL, hoping to help further develop the open-source community.
这篇论文介绍了一种专为遥感设计的全面视觉-语言基础模型,名为Falcon。Falcon提供了一个统一、基于提示的范式,能够有效地执行综合且复杂的遥感任务。Falcon在图像层面、区域层面和像素层面上都展示了强大的理解和推理能力。具体而言,给定简单的自然语言指令和遥感影像后,Falcon能够在14种不同的任务中(包括但不限于图像分类、目标检测、分割以及图像描述)生成令人印象深刻的结果文本形式的输出。 为了促进Falcon的训练,并增强其表示能力以编码丰富的空间与语义信息,我们开发了Falcon_SFT,这是一个大规模多任务指令微调数据集,在遥感领域内。Falcon_SFT数据集中包含大约7800万条高质量的数据样本,涵盖了560万个不同空间分辨率和视角的高精度遥感图像以及多种不同的指令说明。该数据集具备层级标注,并经过人工采样验证以确保数据质量和可靠性。 进行了广泛的比较实验,结果证明尽管Falcon只有约0.7B个参数,但在67个数据集和14项任务中均取得了显著性能。我们在此网址https://此链接处发布完整的数据集、代码以及模型权重,希望帮助进一步推动开源社区的发展。
https://arxiv.org/abs/2503.11070
Current practices for reporting the level of differential privacy (DP) guarantees for machine learning (ML) algorithms provide an incomplete and potentially misleading picture of the guarantees and make it difficult to compare privacy levels across different settings. We argue for using Gaussian differential privacy (GDP) as the primary means of communicating DP guarantees in ML, with the full privacy profile as a secondary option in case GDP is too inaccurate. Unlike other widely used alternatives, GDP has only one parameter, which ensures easy comparability of guarantees, and it can accurately capture the full privacy profile of many important ML applications. To support our claims, we investigate the privacy profiles of state-of-the-art DP large-scale image classification, and the TopDown algorithm for the U.S. Decennial Census, observing that GDP fits the profiles remarkably well in all three cases. Although GDP is ideal for reporting the final guarantees, other formalisms (e.g., privacy loss random variables) are needed for accurate privacy accounting. We show that such intermediate representations can be efficiently converted to GDP with minimal loss in tightness.
当前用于报告机器学习(ML)算法中差分隐私(DP)保证级别的做法提供了一个不完整且可能具有误导性的保障图景,并使不同设置之间的隐私水平比较变得困难。我们主张使用高斯差分隐私(GDP)作为在ML中沟通DP保障的主要手段,而在GDP不够精确的情况下可选择完整的隐私轮廓作为次要选项。与广泛使用的其他替代方法相比,GDP只有一个参数,这确保了保证的易于比较性,并且它可以准确地捕捉许多重要ML应用中的完整隐私概况。 为了支持我们的主张,我们研究了最先进的差分隐私大规模图像分类、以及美国十年一次人口普查中使用的TopDown算法的隐私轮廓,在所有三种情况下观察到GDP与这些轮廓非常吻合。虽然GDP在报告最终保障方面是理想的,但在进行准确的隐私核算时还需要其他形式化方法(如隐私损失随机变量)。我们展示了这些中间表示可以高效地转换为GDP,并且几乎不会损失紧密度。 翻译总结了原文的主要观点:主张使用高斯差分隐私作为主要方式来报告机器学习中的差分隐私保障,同时指出这种做法在实践中提供了更为准确和可比较的隐私水平描述。文中还强调,在某些情况下需要结合其他形式化工具进行精确核算,并且这些工具可以有效地转换为GDP表述,而不会显著影响其准确性。
https://arxiv.org/abs/2503.10945
We propose a general framework called VisionLogic to extract interpretable logic rules from deep vision models, with a focus on image classification tasks. Given any deep vision model that uses a fully connected layer as the output head, VisionLogic transforms neurons in the last layer into predicates and grounds them into vision concepts using causal validation. In this way, VisionLogic can provide local explanations for single images and global explanations for specific classes in the form of logic rules. Compared to existing interpretable visualization tools such as saliency maps, VisionLogic addresses several key challenges, including the lack of causal explanations, overconfidence in visualizations, and ambiguity in interpretation. VisionLogic also facilitates the study of visual concepts encoded by predicates, particularly how they behave under perturbation -- an area that remains underexplored in the field of hidden semantics. Apart from providing better visual explanations and insights into the visual concepts learned by the model, we show that VisionLogic retains most of the neural network's discriminative power in an interpretable and transparent manner. We envision it as a bridge between complex model behavior and human-understandable explanations, providing trustworthy and actionable insights for real-world applications.
我们提出了一种称为VisionLogic的通用框架,用于从深度视觉模型中提取可解释的逻辑规则,并重点研究图像分类任务。对于任何使用全连接层作为输出头的深度视觉模型,VisionLogic将最后一层中的神经元转换为谓词,并通过因果验证将其锚定到视觉概念上。这样,VisionLogic可以为单个图像提供局部解释,同时也可以以逻辑规则的形式为特定类别提供全局解释。 与现有的可解释性可视化工具(如显著图)相比,VisionLogic解决了几个关键挑战,包括缺乏因果解释、对可视化的过度自信以及解读的模糊性等问题。此外,VisionLogic有助于研究由谓词编码的视觉概念,特别是它们在扰动下的行为表现——这是隐藏语义领域中一个尚未充分探索的领域。 除了提供更好的可视化解释并深入理解模型学习到的视觉概念之外,我们还展示了VisionLogic能够在可解释且透明的方式下保留神经网络大部分判别能力。我们认为它是复杂模型行为与人类可理解解释之间的桥梁,为现实世界应用提供了值得信赖和具有操作性的洞察。
https://arxiv.org/abs/2503.10547
Zero-shot learning holds tremendous potential for histopathology image analysis by enabling models to generalize to unseen classes without extensive labeled data. Recent advancements in vision-language models (VLMs) have expanded the capabilities of ZSL, allowing models to perform tasks without task-specific fine-tuning. However, applying VLMs to histopathology presents considerable challenges due to the complexity of histopathological imagery and the nuanced nature of diagnostic tasks. In this paper, we propose a novel framework called Multi-Resolution Prompt-guided Hybrid Embedding (MR-PHE) to address these challenges in zero-shot histopathology image classification. MR-PHE leverages multiresolution patch extraction to mimic the diagnostic workflow of pathologists, capturing both fine-grained cellular details and broader tissue structures critical for accurate diagnosis. We introduce a hybrid embedding strategy that integrates global image embeddings with weighted patch embeddings, effectively combining local and global contextual information. Additionally, we develop a comprehensive prompt generation and selection framework, enriching class descriptions with domain-specific synonyms and clinically relevant features to enhance semantic understanding. A similarity-based patch weighting mechanism assigns attention-like weights to patches based on their relevance to class embeddings, emphasizing diagnostically important regions during classification. Our approach utilizes pretrained VLM, CONCH for ZSL without requiring domain-specific fine-tuning, offering scalability and reducing dependence on large annotated datasets. Experimental results demonstrate that MR-PHE not only significantly improves zero-shot classification performance on histopathology datasets but also often surpasses fully supervised models.
零样本学习在组织病理图像分析中具有巨大潜力,通过使模型能够在没有大量标注数据的情况下推广到未见类别。近期视觉-语言模型(VLMs)的发展扩展了ZSL的能力,使得模型可以在无需特定任务微调的情况下执行任务。然而,将VLM应用于组织病理学面临着诸多挑战,原因在于组织病理图像的复杂性以及诊断任务的细微差别。 本文提出了一种称为多分辨率提示引导混合嵌入(MR-PHE)的新框架,旨在解决零样本组织病理图像分类中的这些挑战。MR-PHE利用多分辨率补丁提取技术来模拟病理学家的诊断工作流程,捕捉细粒度细胞细节及更为广泛的组织结构,这对于准确诊断至关重要。我们引入了一种混合嵌入策略,该策略结合了全局图像嵌入和加权补丁嵌入,有效地整合了局部与全局上下文信息。此外,我们开发了一个全面的提示生成和选择框架,通过增加领域特定同义词和临床相关特征来丰富类别描述,从而增强语义理解能力。一种基于相似性的补丁权重机制根据每个补丁相对于类嵌入的相关性分配类似注意权重,在分类过程中强调重要的诊断区域。 本方法利用预训练的视觉-语言模型(如CONCH)进行零样本学习,而无需特定领域的微调,提供了可扩展性和减少对大型注释数据集的依赖。实验结果表明,MR-PHE不仅显著提高了组织病理学数据集中零样本分类的表现,并且经常超过了完全监督的学习模型。 通过这种方式,本文为利用先进的视觉-语言技术解决复杂的医学图像分析任务提供了一种创新方法,展示了在资源有限情况下进行高效诊断的可能性。
https://arxiv.org/abs/2503.10731
Federated learning (FL) enables the collaborative training of deep neural networks across decentralized data archives (i.e., clients) without sharing the local data of the clients. Most of the existing FL methods assume that the data distributed across all clients is associated with the same data modality. However, remote sensing (RS) images present in different clients can be associated with diverse data modalities. The joint use of the multi-modal RS data can significantly enhance classification performance. To effectively exploit decentralized and unshared multi-modal RS data, our paper introduces a novel multi-modal FL framework for RS image classification problems. The proposed framework comprises three modules: 1) multi-modal fusion (MF); 2) feature whitening (FW); and 3) mutual information maximization (MIM). The MF module employs iterative model averaging to facilitate learning without accessing multi-modal training data on clients. The FW module aims to address the limitations of training data heterogeneity by aligning data distributions across clients. The MIM module aims to model mutual information by maximizing the similarity between images from different modalities. For the experimental analyses, we focus our attention on multi-label classification and pixel-based classification tasks in RS. The results obtained using two benchmark archives show the effectiveness of the proposed framework when compared to state-of-the-art algorithms in the literature. The code of the proposed framework will be available at this https URL.
联邦学习(FL)允许在分散的数据存储库(即,客户端)上协作训练深度神经网络,而无需分享这些客户端的本地数据。现有的大多数联邦学习方法假设分布于所有客户端上的数据都与同一数据模态相关联。然而,在不同客户端中出现的遥感(RS)图像可能与其不同的数据模态相联系。多模态RS数据的联合使用可以显著提升分类性能。为了有效利用分散且未共享的多模态RS数据,本文介绍了一种新的适用于RS图像分类问题的多模态联邦学习框架。所提出的框架包括三个模块:1)多模态融合(MF),2)特征白化(FW),以及3)互信息最大化(MIM)。MF模块利用迭代模型平均方法来支持在不访问客户端上多模态训练数据的情况下进行学习。FW模块旨在通过调整跨客户端的数据分布来解决训练数据异质性的问题。MIM模块则试图通过最大化不同模态之间图像的相似度来建模互信息。 实验分析集中在遥感中的多标签分类和基于像素的分类任务上。使用两个基准数据集所获得的结果表明,与文献中现有的最佳算法相比,所提出的框架在效果方面具有明显的优势。该框架的代码将在[此处提供链接]发布。
https://arxiv.org/abs/2503.10262
Classifying images with an interpretable decision-making process is a long-standing problem in computer vision. In recent years, Prototypical Part Networks has gained traction as an approach for self-explainable neural networks, due to their ability to mimic human visual reasoning by providing explanations based on prototypical object parts. However, the quality of the explanations generated by these methods leaves room for improvement, as the prototypes usually focus on repetitive and redundant concepts. Leveraging recent advances in prototype learning, we present a framework for part-based interpretable image classification that learns a set of semantically distinctive object parts for each class, and provides diverse and comprehensive explanations. The core of our method is to learn the part-prototypes in a non-parametric fashion, through clustering deep features extracted from foundation vision models that encode robust semantic information. To quantitatively evaluate the quality of explanations provided by ProtoPNets, we introduce Distinctiveness Score and Comprehensiveness Score. Through evaluation on CUB-200-2011, Stanford Cars and Stanford Dogs datasets, we show that our framework compares favourably against existing ProtoPNets while achieving better interpretability. Code is available at: this https URL.
用可解释的决策过程对图像进行分类是计算机视觉领域长期存在的一个问题。近年来,原型部件网络(Prototypical Part Networks,简称ProtoPNets)作为一种自解释神经网络的方法获得了关注,因为它们能够通过基于原型对象部分提供的解释来模仿人类视觉推理的能力。然而,这些方法生成的解释质量仍有改进空间,因为原型通常侧重于重复和冗余的概念。利用近期在原型学习方面的进展,我们提出了一种基于部件的可解释图像分类框架,该框架为每个类别学习一组语义上独特的对象部分,并提供多样且全面的解释。我们的方法的核心在于以非参数的方式通过聚类基础视觉模型中提取出的深层特征来学习部件原型,这些模型编码了强大的语义信息。 为了定量评估ProtoPNets提供的解释质量,我们引入了独特性得分(Distinctiveness Score)和综合性得分(Comprehensiveness Score)。通过对CUB-200-2011、Stanford Cars 和 Stanford Dogs 数据集的评价,证明我们的框架在可解释性方面优于现有的ProtoPNets方法。代码可在以下链接获取:this https URL。
https://arxiv.org/abs/2503.10247
Efficient training of artificial neural networks remains a key challenge in deep learning. Backpropagation (BP), the standard learning algorithm, relies on gradient descent and typically requires numerous iterations for convergence. In this study, we introduce Expectation Reflection (ER), a novel learning approach that updates weights multiplicatively based on the ratio of observed to predicted outputs. Unlike traditional methods, ER maintains consistency without requiring ad hoc loss functions or learning rate hyperparameters. We extend ER to multilayer networks and demonstrate its effectiveness in performing image classification tasks. Notably, ER achieves optimal weight updates in a single iteration. Additionally, we reinterpret ER as a modified form of gradient descent incorporating the inverse mapping of target propagation. These findings suggest that ER provides an efficient and scalable alternative for training neural networks.
人工神经网络的高效训练仍然是深度学习中的一个关键挑战。反向传播(BP)作为标准的学习算法,依赖于梯度下降,并且通常需要多次迭代才能收敛。在本研究中,我们引入了期望反射(ER),这是一种新的学习方法,它根据观察到的输出与预测输出的比例来乘法更新权重。不同于传统的方法,ER能够在不使用临时损失函数或学习率超参数的情况下保持一致性。我们将ER扩展到了多层网络,并展示了其在执行图像分类任务方面的有效性。特别地,ER可以在一次迭代中实现最优的权重更新。此外,我们重新诠释了ER为一种结合目标传播逆映射的修改版梯度下降法。这些发现表明,ER为训练神经网络提供了一种高效且可扩展的替代方案。
https://arxiv.org/abs/2503.10144