Graph Neural Networks (GNNs) are powerful tools for graph classification. One important operation for GNNs is the downsampling or pooling that can learn effective embeddings from the node representations. In this paper, we propose a new hierarchical pooling operation, namely the Edge-Node Attention-based Differentiable Pooling (ENADPool), for GNNs to learn effective graph representations. Unlike the classical hierarchical pooling operation that is based on the unclear node assignment and simply computes the averaged feature over the nodes of each cluster, the proposed ENADPool not only employs a hard clustering strategy to assign each node into an unique cluster, but also compress the node features as well as their edge connectivity strengths into the resulting hierarchical structure based on the attention mechanism after each pooling step. As a result, the proposed ENADPool simultaneously identifies the importance of different nodes within each separated cluster and edges between corresponding clusters, that significantly addresses the shortcomings of the uniform edge-node based structure information aggregation arising in the classical hierarchical pooling operation. Moreover, to mitigate the over-smoothing problem arising in existing GNNs, we propose a Multi-distance GNN (MD-GNN) model associated with the proposed ENADPool operation, allowing the nodes to actively and directly receive the feature information from neighbors at different random walk steps. Experiments demonstrate the effectiveness of the MD-GNN associated with the proposed ENADPool.
图神经网络(GNNs)是用于图形分类的强大工具。GNNs的一个重要操作是下采样或池化,可以从节点表示中学习有效的嵌入。在本文中,我们提出了一个新颖的层次池化操作,即基于边缘节点注意力的差分池化(ENADPool)来学习图形表示。与经典的层次池化操作基于不明确的节点分配,简单地计算每个簇节点的平均特征不同,提出的ENADPool不仅采用了一种强聚类策略将每个节点分配到唯一的簇中,而且在每个池化步骤后,还将节点特征以及它们的边缘连接强度压缩到结果分层结构中,基于注意力机制。因此,与传统的层次池化操作相比,ENADPool同时识别每个独立簇内不同节点的重要性以及对应簇之间的边的重要性,从而显著解决了经典层次池化操作中均匀边缘-节点结构信息聚合的不足。此外,为了减轻现有GNN中产生的过度平滑问题,我们提出了一个与提出的ENADPool操作相关的多距离GNN(MD-GNN)模型,允许节点在不同的随机漫步步骤积极直接接收邻居的特征信息。实验结果表明,与提出的ENADPool操作相关的MD-GNN具有有效性。
https://arxiv.org/abs/2405.10218
Hyperspectral target detection (HTD) aims to identify specific materials based on spectral information in hyperspectral imagery and can detect point targets, some of which occupy a smaller than one-pixel area. However, existing HTD methods are developed based on per-pixel binary classification, which limits the feature representation capability for point targets. In this paper, we rethink the hyperspectral point target detection from the object detection perspective, and focus more on the object-level prediction capability rather than the pixel classification capability. Inspired by the token-based processing flow of Detection Transformer (DETR), we propose the first specialized network for hyperspectral multi-class point object detection, SpecDETR. Without the backbone part of the current object detection framework, SpecDETR treats the spectral features of each pixel in hyperspectral images as a token and utilizes a multi-layer Transformer encoder with local and global coordination attention modules to extract deep spatial-spectral joint features. SpecDETR regards point object detection as a one-to-many set prediction problem, thereby achieving a concise and efficient DETR decoder that surpasses the current state-of-the-art DETR decoder in terms of parameters and accuracy in point object detection. We develop a simulated hyperSpectral Point Object Detection benchmark termed SPOD, and for the first time, evaluate and compare the performance of current object detection networks and HTD methods on hyperspectral multi-class point object detection. SpecDETR demonstrates superior performance as compared to current object detection networks and HTD methods on the SPOD dataset. Additionally, we validate on a public HTD dataset that by using data simulation instead of manual annotation, SpecDETR can detect real-world single-spectral point objects directly.
超分辨率目标检测(HTD)旨在根据超分辨率图像的谱信息识别特定的材料,并可以检测到一些占据小于一个像素面积的点目标。然而,现有的HTD方法是基于每个像素的二分类,这限制了点目标的特征表示能力。在本文中,我们从目标检测的角度重新思考超分辨率点目标检测,并将重点放在物体级别预测能力上,而不是像素分类能力上。受到检测Transformer(DETR)的标记基于处理流的启发,我们提出了第一个专门的超分辨率多类点物体检测网络,SpecDETR。与当前物体检测框架的后端不同,SpecDETR将超分辨率图像中每个像素的谱特征视为一个标记,并采用具有局部和全局注意力的多层Transformer编码器来提取深空间-光谱联合特征。SpecDETR将点物体检测视为一个一对一的集合预测问题,从而实现简洁且高效的DETR解码器,在点物体检测方面超越了当前最先进的DETR解码器。我们开发了一个名为SPOD的超分辨率点物体检测模拟基准,并首次评估和比较了当前物体检测网络和HTD方法在超分辨率多类点物体检测上的性能。实验结果表明,SpecDETR在SPOD数据集上的性能优于当前物体检测网络和HTD方法。此外,我们还验证了SpecDETR在公共HTD数据集上的能力,通过使用数据模拟而非手动注释,SpecDETR可以直接检测到真实世界中的单个光谱点物体。
https://arxiv.org/abs/2405.10148
Open-vocabulary object detection (OvOD) has transformed detection into a language-guided task, empowering users to freely define their class vocabularies of interest during inference. However, our initial investigation indicates that existing OvOD detectors exhibit significant variability when dealing with vocabularies across various semantic granularities, posing a concern for real-world deployment. To this end, we introduce Semantic Hierarchy Nexus (SHiNe), a novel classifier that uses semantic knowledge from class hierarchies. It runs offline in three steps: i) it retrieves relevant super-/sub-categories from a hierarchy for each target class; ii) it integrates these categories into hierarchy-aware sentences; iii) it fuses these sentence embeddings to generate the nexus classifier vector. Our evaluation on various detection benchmarks demonstrates that SHiNe enhances robustness across diverse vocabulary granularities, achieving up to +31.9% mAP50 with ground truth hierarchies, while retaining improvements using hierarchies generated by large language models. Moreover, when applied to open-vocabulary classification on ImageNet-1k, SHiNe improves the CLIP zero-shot baseline by +2.8% accuracy. SHiNe is training-free and can be seamlessly integrated with any off-the-shelf OvOD detector, without incurring additional computational overhead during inference. The code is open source.
开放词汇对象检测(OvOD)使检测任务变成了一种语言指导的任务,使用户在推理过程中可以自由定义他们感兴趣的类词汇。然而,我们的初步调查表明,现有的OvOD检测器在处理不同语义粒度词汇时表现出显著的变异性,这可能会对现实世界的部署造成担忧。为此,我们引入了语义层次结构 Nexus(SHiNe),一种新类器,它使用类层次结构的语义知识。它通过三个步骤运行:i)它从层次结构中检索与目标类别相关的超/子类别;ii)它将这些类别整合到等级感知句子中;iii)它将句子嵌入融合生成nexus分类器向量。我们对各种检测基准的评估表明,SHiNe在不同的词汇粒度下增强了鲁棒性,达到+31.9%的mAP50,同时保留了使用大语言模型生成的等级所取得的有益改进。此外,当应用于ImageNet-1k上的开放词汇分类时,SHiNe提高了CLIP零散 baseline的准确率+2.8%。SHiNe是免费的训练的,可以轻松地与任何现有的OvOD检测器集成,而不会在推理过程中产生额外的计算开销。代码是开源的。
https://arxiv.org/abs/2405.10053
Large language models (LLMs) are versatile and can address many tasks, but for computational efficiency, it is often desirable to distill their capabilities into smaller student models. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM's parametric knowledge to generate usable examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. In this work, we propose Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is "seeded" with different content to generate its examples. We empirically study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor, requiring complex synthesis strategies. We find SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance, when compared to standard 32-shot prompting and six baseline approaches.
大语言模型(LLMs)具有广泛的用途,可以处理许多任务,但在计算效率方面,通常希望将它们的功能缩小为较小的学生模型。为分类任务进行数据合成的一种方式是通过生成每个标签的示例来缩小LLM的功能。之前的方法使用少样本提示,这依赖于LLM的参数化知识来生成有用的示例。然而,这导致了重复问题、倾向于流行实体和人文风格的差异等问题。在本文中,我们提出了一种名为“合成-检索和精炼”(SynthesizRR)的方法,该方法通过检索增强来引入数据合成过程的多样性:由于检索到的段落有所不同,LLM会“播种”不同的内容以生成其示例。我们通过研究六个主题分类、情感分析、语调检测和幽默等领域的数据合成,探讨了合成策略的复杂性。我们发现,SynthesizRR在比较标准32- shot提示和六个基线方法时,大大提高了词汇和语义多样性、与人类文本的相似性和去耦性能。
https://arxiv.org/abs/2405.10040
We revisit the classical problem of multiclass classification with bandit feedback (Kakade, Shalev-Shwartz and Tewari, 2008), where each input classifies to one of $K$ possible labels and feedback is restricted to whether the predicted label is correct or not. Our primary inquiry is with regard to the dependency on the number of labels $K$, and whether $T$-step regret bounds in this setting can be improved beyond the $\smash{\sqrt{KT}}$ dependence exhibited by existing algorithms. Our main contribution is in showing that the minimax regret of bandit multiclass is in fact more nuanced, and is of the form $\smash{\widetilde{\Theta}\left(\min \left\{|\mathcal{H}| + \sqrt{T}, \sqrt{KT \log |{\mathcal{H}|}} \right\} \right) }$, where $\mathcal{H}$ is the underlying (finite) hypothesis class. In particular, we present a new bandit classification algorithm that guarantees regret $\smash{\widetilde{O}(|\mathcal{H}|+\sqrt{T})}$, improving over classical algorithms for moderately-sized hypothesis classes, and give a matching lower bound establishing tightness of the upper bounds (up to log-factors) in all parameter regimes.
我们回顾了带有博弈反馈的多分类分类经典问题(Kakade,Shalev-Shwartz和Tewari,2008),其中每个输入类别将预测为$K$个可能的标签,反馈仅限于预测标签是否正确。我们的主要关注是关于$K$的数量依赖关系,以及在当前环境中,是否可以通过超越现有算法的$\sqrt{KT}$依赖关系来提高$T$步后悔的上界。我们的主要贡献是证明,带博弈反馈的多分类的最小最大后悔实际上更加复杂,并且具有形式为$\smash{\hat{\theta}\left(\min \left\{|\mathcal{H}| + \sqrt{T}, \sqrt{KT \log |{\mathcal{H}|}} \right\} \right)}$,其中$\mathcal{H}$是 underlying(有限)假设类。 特别是,我们提出了一个新的带博弈分类算法,该算法保证后悔$O(|\mathcal{H}|+\sqrt{T})$,这超过了中等大小假设类经典算法的范围,并为所有参数范围给出了匹配的下界。
https://arxiv.org/abs/2405.10027
This article describes the Data-Efficient Low-Complexity Acoustic Scene Classification Task in the DCASE 2024 Challenge and the corresponding baseline system. The task setup is a continuation of previous editions (2022 and 2023), which focused on recording device mismatches and low-complexity constraints. This year's edition introduces an additional real-world problem: participants must develop data-efficient systems for five scenarios, which progressively limit the available training data. The provided baseline system is based on an efficient, factorized CNN architecture constructed from inverted residual blocks and uses Freq-MixStyle to tackle the device mismatch problem. The baseline system's accuracy ranges from 42.40% on the smallest to 56.99% on the largest training set.
这篇文章描述了在DCASE 2024挑战中数据效率低复杂度音频场景分类任务及其相应的基线系统。任务设置是前几版的延续(2022和2023),重点关注记录设备不匹配和低复杂度约束。今年的版本引入了一个额外的真实世界问题:参与者必须为五个场景开发数据有效的系统,这些系统逐渐限制了可用的训练数据。提供的基线系统基于由反残差块构建的效率高,分解卷积架构,并使用Freq-MixStyle解决设备不匹配问题。基线系统的准确率在最小训练集上为42.40%,在最大训练集上为56.99%。
https://arxiv.org/abs/2405.10018
The accelerated progress of artificial intelligence (AI) has popularized deep learning models across domains, yet their inherent opacity poses challenges, notably in critical fields like healthcare, medicine and the geosciences. Explainable AI (XAI) has emerged to shed light on these "black box" models, helping decipher their decision making process. Nevertheless, different XAI methods yield highly different explanations. This inter-method variability increases uncertainty and lowers trust in deep networks' predictions. In this study, for the first time, we propose a novel framework designed to enhance the explainability of deep networks, by maximizing both the accuracy and the comprehensibility of the explanations. Our framework integrates various explanations from established XAI methods and employs a non-linear "explanation optimizer" to construct a unique and optimal explanation. Through experiments on multi-class and binary classification tasks in 2D object and 3D neuroscience imaging, we validate the efficacy of our approach. Our explanation optimizer achieved superior faithfulness scores, averaging 155% and 63% higher than the best performing XAI method in the 3D and 2D applications, respectively. Additionally, our approach yielded lower complexity, increasing comprehensibility. Our results suggest that optimal explanations based on specific criteria are derivable and address the issue of inter-method variability in the current XAI literature.
人工智能(AI)的快速发展已经在各个领域普及了深度学习模型,然而其固有的不透明性在关键领域如医疗、医学和地质学等领域提出了挑战。可解释人工智能(XAI)应运而生,帮助揭示这些“黑盒子”模型,并解释其决策过程。然而,不同的XAI方法得出的解释高度不同。这种方法间的变异性增加了不确定性,降低了 deep 网络预测的信任度。在这项研究中,我们首次提出了一个新框架,旨在提高 deep 网络的可解释性,通过最大化 both 解释的准确性和全面性来实现。我们的框架整合了现有的 XAI 方法的各个解释,并采用了一个非线性的“解释优化器”来构建独特的最优解释。通过在 2D 物体和 3D 神经科学成像的多分类和二分类任务上的实验,我们验证了我们的方法的有效性。我们的解释优化器实现了比最佳应用在 3D 和 2D 领域的 XAI 方法更高的忠实度分数,分别平均高于 155% 和 63%。此外,我们的方法还产生了较低的复杂性,提高了可理解性。我们的结果表明,基于特定标准的最优解释是可导出的,并解决了当前 XAI 文献中方法间变异性问题的难题。
https://arxiv.org/abs/2405.10008
Automated medical image analysis systems often require large amounts of training data with high quality labels, which are difficult and time consuming to generate. This paper introduces Radiology Object in COntext version 2 (ROCOv2), a multimodal dataset consisting of radiological images and associated medical concepts and captions extracted from the PMC Open Access subset. It is an updated version of the ROCO dataset published in 2018, and adds 35,705 new images added to PMC since 2018. It further provides manually curated concepts for imaging modalities with additional anatomical and directional concepts for X-rays. The dataset consists of 79,789 images and has been used, with minor modifications, in the concept detection and caption prediction tasks of ImageCLEFmedical Caption 2023. The dataset is suitable for training image annotation models based on image-caption pairs, or for multi-label image classification using Unified Medical Language System (UMLS) concepts provided with each image. In addition, it can serve for pre-training of medical domain models, and evaluation of deep learning models for multi-task learning.
自动医学图像分析系统通常需要大量高质量的训练数据,这很难且耗时。本文介绍了一种名为“Context v2”的Radiology Object(ROCOv2)多模态数据集,该数据集由来自PMC开放访问子集的放射性图像和相关医疗概念和注释组成。这是2018年发表的ROCO数据集中的更新版本,并增加了2018年以来在PMC上新增的35,705个图像。它进一步提供了含有人工编写的成像模式的概念,以及增加了X射线方面的解剖和方向概念。数据集包括79,789个图像,已在ImageCLEF medical caption 2023中的概念检测和预测任务中使用,尽管略微有所修改。该数据集还适用于基于图像-摘要对的训练图像注释模型,或使用每个图像提供的统一医疗语言系统(UMLS)概念进行多标签图像分类。此外,它还可以用于医学领域模型的预训练,以及评估用于多任务学习的深度学习模型。
https://arxiv.org/abs/2405.10004
Large pretrained transformers are increasingly being developed as generalised foundation models which can underpin powerful task-specific artificial intelligence models. Histopathology foundation models show promise across many tasks, but analyses have been limited by arbitrary hyperparameters that were not tuned to the specific task/dataset. We report the most rigorous single-task validation conducted to date of a histopathology foundation model, and the first performed in ovarian cancer subtyping. Attention-based multiple instance learning classifiers were compared using vision transformer and ResNet features generated through varied preprocessing and pretraining procedures. The training set consisted of 1864 whole slide images from 434 ovarian carcinoma cases at Leeds Hospitals. Five-class classification performance was evaluated through five-fold cross-validation, and these cross-validation models were ensembled for evaluation on a hold-out test set and an external set from the Transcanadian study. Reporting followed the TRIPOD+AI checklist. The vision transformer-based histopathology foundation model, UNI, performed best in every evaluation, with five-class balanced accuracies of 88% and 93% in hold-out internal and external testing, compared to the best ResNet model scores of 68% and 81%, respectively. Normalisations and augmentations aided the generalisability of ResNet-based models, but these still did not match the performance of UNI, which gave the best external performance in any ovarian cancer subtyping study to date. Histopathology foundation models offer a clear benefit to subtyping, improving classification performance to a degree where clinical utility is tangible, albeit with an increased computational burden. Such models could provide a second opinion in challenging cases and may improve the accuracy, objectivity, and efficiency of pathological diagnoses overall.
大型预训练Transformer模型正在日益开发为通用基础模型,可以支持强大的任务特定人工智能模型。病理学基础模型在许多任务上表现出良好的前景,但分析受到了任意未经过特定任务/数据集调整的标外参数的限制。我们报道了有史以来最严谨的单任务验证的病理学基础模型,以及第一个在卵巢癌亚型诊断中进行的验证。通过视觉Transformer和通过不同预处理和预训练过程生成的ResNet特征的注意力多实例分类器进行了比较。训练集包括来自利兹医院434名卵巢癌患者的1864个全切片图像。通过五折交叉验证评估了五个分类器的性能,这些交叉验证模型在保持测试集和外部测试的 hold-out 设置上进行了评估。报告符合TRIPOD+AI检查表。基于视觉Transformer的病理学基础模型UNI在每一个评估都表现最佳,其五个类别的平衡准确率分别为88%和93%,与最佳ResNet模型的分数68%和81%相比分别更好。正常化和增强有助于ResNet基模型的泛化,但这些模型仍然没有达到UNI的性能,UNI在目前的任何卵巢癌亚型诊断研究中都具有最好的外部性能。病理学基础模型在亚型诊断中具有明显优势,可以提高分类性能,尽管计算负担增加,但临床实用性有所提高。这样的模型可以在具有挑战性的病例中提供第二意见,并可能改善病理诊断的准确性、客观性和效率。
https://arxiv.org/abs/2405.09990
Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (\textit{Common Procurement Vocabulary}, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach, based on a pre-trained language model that relies only on label description and respects the label taxonomy. To train our proposed model, we used industrial data, which comes from \url{this http URL}, a service by \href{this https URL}{SpazioDati s.r.l}. that collects public contracts stipulated in Italy in the last 25 years. Results show that the proposed model achieves better performance in classifying low-frequent classes compared to three different baselines, and is also able to predict never-seen classes.
对公共招标进行分类是一项有益的任务,对于受邀参加的公司和检查欺诈活动都很有用。为了方便企业和公共行政机构的参与,欧盟制定了一个共同招标词汇表(CPV),对于某些重要性的招标是强制性的;然而,强制要求CPV标签的合同数量相对于所有公共行政机构的活动来说只是少数。将分类扩展到现实世界的分类框架中引入了一些困难,不容忽视。首先,一些细粒度分类在训练集中的观察数量不足(如果有的话),而其他分类则比平均观察数量要得多(甚至几千倍)。为了克服这些困难,我们提出了一个零击中方法,基于一个预训练语言模型,仅依赖标签描述并尊重标签分类。为了训练我们所提出的模型,我们使用了工业数据,该数据来自于意大利过去25年内颁布的公共合同的汇总服务。结果表明,与三种不同的基线相比,所提出的模型在分类低频类别的表现更好,同时还能预测未见过的类别。
https://arxiv.org/abs/2405.09983
The maturity classification of specialty crops such as strawberries and tomatoes is an essential agricultural downstream activity for selective harvesting and quality control (QC) at production and packaging sites. Recent advancements in Deep Learning (DL) have produced encouraging results in color images for maturity classification applications. However, hyperspectral imaging (HSI) outperforms methods based on color vision. Multivariate analysis methods and Convolutional Neural Networks (CNN) deliver promising results; however, a large amount of input data and the associated preprocessing requirements cause hindrances in practical application. Conventionally, the reflectance intensity in a given electromagnetic spectrum is employed in estimating fruit maturity. We present a feature extraction method to empirically demonstrate that the peak reflectance in subbands such as 500-670 nm (pigment band) and the wavelength of the peak position, and contrarily, the trough reflectance and its corresponding wavelength within 671-790 nm (chlorophyll band) are convenient to compute yet distinctive features for the maturity classification. The proposed feature selection method is beneficial because preprocessing, such as dimensionality reduction, is avoided before every prediction. The feature set is designed to capture these traits. The best SOTA methods, among 3D-CNN, 1D-CNN, and SVM, achieve at most 90.0 % accuracy for strawberries and 92.0 % for tomatoes on our dataset. Results show that the proposed method outperforms the SOTA as it yields an accuracy above 98.0 % in strawberry and 96.0 % in tomato classification. A comparative analysis of the time efficiency of these methods is also conducted, which shows the proposed method performs prediction at 13 Frames Per Second (FPS) compared to the maximum 1.16 FPS attained by the full-spectrum SVM classifier.
草莓和西红柿等特色作物的成熟度分类是生产和包装站点选择性收获和质量控制(QC)过程中必不可少的重要农业下游活动。最近在深度学习(DL)方面的进步在颜色图像的成熟度分类应用中产生了鼓舞人心的结果。然而,基于色彩视觉的方法在成熟度分类上劣后于 hyperspectral imaging(HSI)。多变量分析方法和卷积神经网络(CNN)产生了积极的结果;然而,大量的输入数据及其相关预处理要求给应用带来障碍。通常,在给定的电磁频谱中的反射强度被用来估计果实成熟度。我们提出了一个特征提取方法,以经验证明在色素带(500-670纳米,色素带)子频段和最大峰位波长以及相反,在671-790纳米(叶绿素带)中的峰谷反射强度和其相应的波长是方便计算且具有区分性的特征,用于成熟度分类。所提出的特征选择方法有益处,因为预测之前,预处理,例如降维,被避免了。特征集旨在捕捉这些特征。在3D-CNN、1D-CNN和SVM中,最好的SOTA方法,即草莓和西红柿数据集中的3D-CNN,在草莓和西红柿上的准确度分别为90.0%和92.0%。结果表明,与SOTA相比,所提出的方法具有更高的准确度,草莓的准确度为98.0%,西红柿的准确度为96.0%。还进行了这些方法的比较分析,比较了它们的预测时间效率,结果表明,与 full-spectral SVM 分类器达到的最大1.16 FPS 相比,所提出的 method 在13 FPS 的预测速度上表现出色。
https://arxiv.org/abs/2405.09955
Multiple-instance learning (MIL) is an attractive approach for digital pathology applications as it reduces the costs related to data collection and labelling. However, it is not clear how sensitive MIL is to clinically realistic domain shifts, i.e., differences in data distribution that could negatively affect performance, and if already existing metrics for detecting domain shifts work well with these algorithms. We trained an attention-based MIL algorithm to classify whether a whole-slide image of a lymph node contains breast tumour metastases. The algorithm was evaluated on data from a hospital in a different country and various subsets of this data that correspond to different levels of domain shift. Our contributions include showing that MIL for digital pathology is affected by clinically realistic differences in data, evaluating which features from a MIL model are most suitable for detecting changes in performance, and proposing an unsupervised metric named Fréchet Domain Distance (FDD) for quantification of domain shifts. Shift measure performance was evaluated through the mean Pearson correlation to change in classification performance, where FDD achieved 0.70 on 10-fold cross-validation models. The baselines included Deep ensemble, Difference of Confidence, and Representation shift which resulted in 0.45, -0.29, and 0.56 mean Pearson correlation, respectively. FDD could be a valuable tool for care providers and vendors who need to verify if a MIL system is likely to perform reliably when implemented at a new site, without requiring any additional annotations from pathologists.
多次实例学习(MIL)是一种有吸引力的数字病理应用方法,因为它降低了数据收集和标注的成本。然而,如何敏感MIL对临床现实领域转移的敏感性尚不清楚,即数据分布的差异可能会对性能产生负面影响,并且已经存在的检测领域转移的指标是否适用于这些算法。我们训练了一个基于注意力的MIL算法,用于判断一个淋巴结整个滑动图像中是否包含乳腺癌转移。该算法在一家不同国家的医院的數據集上进行评估,并对不同的数据子集进行评估。我们的贡献包括展示数字病理中的MIL受到临床现实数据差异的影响,评估MIL模型中哪些特征最适合检测性能的变化,并提出一个名为弗雷歇领域距离(FDD)的无监督指标,用于量化领域转移。通过评估变化分类性能的平均皮尔逊相关来评估转移测量性能,其中FDD在10倍交叉验证模型上实现了0.70。基线包括Deep集成、差异自信和表示转移,其平均皮尔逊相关分别为0.45、-0.29和0.56。FDD可以为需要验证在新技术实施时是否可靠工作的护理提供商和供应商提供有价值的工具。
https://arxiv.org/abs/2405.09934
Large-scale "foundation models" have gained traction as a way to leverage the vast amounts of unlabeled remote sensing data collected every day. However, due to the multiplicity of Earth Observation satellites, these models should learn "sensor agnostic" representations, that generalize across sensor characteristics with minimal fine-tuning. This is complicated by data availability, as low-resolution imagery, such as Sentinel-2 and Landsat-8 data, are available in large amounts, while very high-resolution aerial or satellite data is less common. To tackle these challenges, we introduce cross-sensor self-supervised training and alignment for remote sensing (X-STARS). We design a self-supervised training loss, the Multi-Sensor Alignment Dense loss (MSAD), to align representations across sensors, even with vastly different resolutions. Our X-STARS can be applied to train models from scratch, or to adapt large models pretrained on e.g low-resolution EO data to new high-resolution sensors, in a continual pretraining framework. We collect and release MSC-France, a new multi-sensor dataset, on which we train our X-STARS models, then evaluated on seven downstream classification and segmentation tasks. We demonstrate that X-STARS outperforms the state-of-the-art by a significant margin with less data across various conditions of data availability and resolutions.
大规模的“基础模型”作为一种利用每天收集的丰富遥感数据的方法已经获得了关注。然而,由于地球观测卫星的多样性,这些模型应该学习“传感器无关”的表示,具有最小的微调,以在传感器特性的不同上进行通用。这复杂化了数据可用性,因为像Sentinel-2和Landsat-8数据这样低分辨率 imagery 大量存在,而非常高的分辨率卫星数据则较为罕见。为解决这些挑战,我们引入了跨传感器自监督训练和对齐 for 遥感(X-STARS)。我们设计了一个自监督训练损失,Multi-Sensor Alignment Dense loss (MSAD),以在传感器特性存在很大差异的情况下对齐表示。我们的 X-STARS 可以应用于从头训练模型,或者将预训练于低分辨率EO数据的模型适应到新高分辨率传感器上,在连续预训练框架中进行。我们收集并发布 MSC-France,一个新的多传感器数据集,用于训练我们的 X-STARS 模型,然后对七个下游分类和分割任务进行评估。我们证明了 X-STARS 在各种数据可用性和分辨率条件下显著优于最先进的模型,尽管数据量较少。
https://arxiv.org/abs/2405.09922
Multi-task learning (MTL) is a learning paradigm that enables the simultaneous training of multiple communicating algorithms. Although MTL has been successfully applied to ether regression or classification tasks alone, incorporating mixed types of tasks into a unified MTL framework remains challenging, primarily due to variations in the magnitudes of losses associated with different tasks. This challenge, particularly evident in MTL applications with joint feature selection, often results in biased selections. To overcome this obstacle, we propose a provable loss weighting scheme that analytically determines the optimal weights for balancing regression and classification tasks. This scheme significantly mitigates the otherwise biased feature selection. Building upon this scheme, we introduce MTLComb, an MTL algorithm and software package encompassing optimization procedures, training protocols, and hyperparameter estimation procedures. MTLComb is designed for learning shared predictors among tasks of mixed types. To showcase the efficacy of MTLComb, we conduct tests on both simulated data and biomedical studies pertaining to sepsis and schizophrenia.
多任务学习(MTL)是一种学习范式,可同时训练多个通信算法。尽管MTL已成功应用于单独的ether回归或分类任务,但将多种任务集成到一个统一的MTL框架中仍然具有挑战性,主要原因是不同任务上损失的大小不同。这个挑战在MTL应用中尤为明显,尤其是在联合特征选择的情况下。为了克服这个障碍,我们提出了一个可证明的损失加权方案,用于平衡回归和分类任务。这个方案显著减轻了其他方案导致的偏见特征选择。在此基础上,我们介绍了MTLComb,一种涵盖优化过程、训练协议和超参数估计程序的MTL算法和软件包。MTLComb旨在学习混合类型任务中的共享预测器。为了展示MTLComb的有效性,我们对其在模拟数据和关于脓毒症和精神分裂症的生物医学研究中进行了测试。
https://arxiv.org/abs/2405.09886
Automated region of interest detection in histopathological image analysis is a challenging and important topic with tremendous potential impact on clinical practice. The deep-learning methods used in computational pathology may help us to reduce costs and increase the speed and accuracy of cancer diagnosis. We started with the UNC Melanocytic Tumor Dataset cohort that contains 160 hematoxylin and eosin whole-slide images of primary melanomas (86) and nevi (74). We randomly assigned 80% (134) as a training set and built an in-house deep-learning method to allow for classification, at the slide level, of nevi and melanomas. The proposed method performed well on the other 20% (26) test dataset; the accuracy of the slide classification task was 92.3% and our model also performed well in terms of predicting the region of interest annotated by the pathologists, showing excellent performance of our model on melanocytic skin tumors. Even though we tested the experiments on the skin tumor dataset, our work could also be extended to other medical image detection problems to benefit the clinical evaluation and diagnosis of different tumors.
在病理学图像分析中自动区域兴趣检测是一个具有巨大临床应用潜力、具有挑战性的重要主题。用于计算病理学中的深度学习方法可能有助于我们降低成本并提高癌症诊断的速度和准确性。我们从一个包含160个黑素细胞肿瘤(86个)和脉络丛转移灶(74个)的UNC Melanocytic Tumor Dataset开始。我们随机将80%(134个)作为训练集,并构建了一种内部深度学习方法,允许在切片级别对脉络丛转移灶和黑素细胞肿瘤进行分类。所提出的方法在另一个20%(26个)测试数据集上的表现也非常出色;切片分类任务的准确率为92.3%,我们的模型在黑色素细胞皮肤肿瘤上的表现也非常出色。尽管我们在皮肤肿瘤数据集上进行了测试,但我们的工作还可以扩展到其他医学图像检测问题,以提高不同肿瘤的临床评价和诊断。
https://arxiv.org/abs/2405.09851
Large language models are well-known to be effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an opportunity to explore their capability to perform ICL with many more demonstrating examples. In this work, we evaluate the performance of multimodal foundation models scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro across 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). We observe that many-shot ICL, including up to almost 2,000 multimodal demonstrating examples, leads to substantial improvements compared to few-shot (<100 examples) ICL across all of the datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly up to the maximum number of tested examples on many datasets. Given the high inference costs associated with the long prompts required for many-shot ICL, we also explore the impact of batching multiple queries in a single API call. We show that batching up to 50 queries can lead to performance improvements under zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on multiple datasets, while drastically reducing per-query cost and latency. Finally, we measure ICL data efficiency of the models, or the rate at which the models learn from more demonstrating examples. We find that while GPT-4o and Gemini 1.5 Pro achieve similar zero-shot performance across the datasets, Gemini 1.5 Pro exhibits higher ICL data efficiency than GPT-4o on most datasets. Our results suggest that many-shot ICL could enable users to efficiently adapt multimodal foundation models to new applications and domains. Our codebase is publicly available at this https URL .
大语言模型在少样本在上下文学习中(ICL)方面已经被证明非常有效。最近多模态基础模型的进步使得以前未曾有过的长上下文窗口成为可能,这为我们研究其在一个多模态基础模型中进行 ICL 时表现的能力提供了机会。在这项工作中,我们评估了多模态基础模型从少样本到多样本 ICL 的性能。我们在包括多个领域的多个数据集(自然图像、医学图像、遥感图像和分子图像)和任务(多分类、多标签和细粒度分类)中进行了基准测试。我们观察到,许多样本在 ICL 中,包括多达几乎 2,000 个多模态示例,比少样本(<100 个示例) ICL 在所有数据集上产生了显着改进。此外,Gemini 1.5 Pro 的性能在许多数据集上呈对数线性增长,直到达到最大测试样本数。考虑到所需长请求的推理成本,我们还研究了在单个 API 调用中批注多个查询对性能的影响。我们发现,批注多达 50 个查询可以提高零样本和多样本 ICL 的性能,在多个数据集上的零样本设置中实现显著的收益,而大大降低每个查询的代价和延迟。最后,我们测量了模型的 ICL 数据效率,即模型从更多示例中学习的速率。我们发现,尽管GPT-4o 和 Gemini 1.5 Pro 在各个数据集上的零样本性能类似,但Gemini 1.5 Pro在大多数数据集上的 ICL 数据效率要高于GPT-4o。我们的结果表明,许多样本在 ICL 中可以帮助用户有效地将多模态基础模型适应到新的应用和领域。我们的代码库公开可用,在这个链接 https:// 。
https://arxiv.org/abs/2405.09798
Due to spatial redundancy in remote sensing images, sparse tokens containing rich information are usually involved in self-attention (SA) to reduce the overall token numbers within the calculation, avoiding the high computational cost issue in Vision Transformers. However, such methods usually obtain sparse tokens by hand-crafted or parallel-unfriendly designs, posing a challenge to reach a better balance between efficiency and performance. Different from them, this paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information meanwhile improving the inference speed. Technically, the meta tokens are first initialized from image tokens via cross-attention. Then, we propose Dual Cross-Attention (DCA) to promote information exchange between image tokens and meta tokens, where they serve as query and key (value) tokens alternatively in a dual-branch structure, significantly reducing the computational complexity compared to self-attention. By employing DCA in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes. Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 \times$ speedup, fewer parameters, and competitive performance compared to the baseline models, and achieves a better trade-off between efficiency and performance.
由于遥感图像中的空间冗余,通常包含丰富信息的稀疏标记通常参与自注意力(SA)以降低计算过程中的总体标记数量,从而避免在Vision Transformers中出现高计算成本问题。然而,这些方法通常通过手动或并行不友好设计获得稀疏标记,挑战了在效率和性能之间实现更好的平衡。与它们不同,本文提出了一种使用可学习元标记来表示稀疏标记的方法,这通过同时学习关键信息有效地提高了推理速度。技术上,元标记首先通过跨注意力从图像标记中初始化。然后,我们提出Dual Cross-Attention (DCA)来促进图像标记和元标记之间的信息交流,在双分支结构中,它们作为查询和键(值)标记交替出现,从而大大减少了计算复杂性。通过在密集视觉标记的早期阶段使用DCA,我们获得了具有不同大小的层次结构LeMeViT。在分类和密集预测任务上的实验结果表明,LeMeViT与基线模型相比具有明显的1.7倍速度提升、更少的参数和竞争力的性能,并实现了效率与性能的更好平衡。
https://arxiv.org/abs/2405.09789
In this paper, we explore the power of Quantum Machine Learning as we extend, implement and evaluate algorithms like Quantum Support Vector Classifier (QSVC), Pegasos-QSVC, Variational Quantum Circuits (VQC), and Quantum Neural Networks (QNN) in Qiskit with diverse feature mapping techniques for genomic sequence classification.
在本文中,我们探讨了在量子机器学习(QML)中扩展、实施和评估与量子支持向量分类器(QSVC)、佩加索-QSVC、变分量子电路(VQC)和量子神经网络(QNN)等算法在Qiskit中处理多样特征映射技术以进行基因组序列分类时的力量。
https://arxiv.org/abs/2405.09781
Discourse relation classification is an especially difficult task without explicit context markers \cite{Prasad2008ThePD}. Current approaches to implicit relation prediction solely rely on two neighboring sentences being targeted, ignoring the broader context of their surrounding environments \cite{Atwell2021WhereAW}. In this research, we propose three new methods in which to incorporate context in the task of sentence relation prediction: (1) Direct Neighbors (DNs), (2) Expanded Window Neighbors (EWNs), and (3) Part-Smart Random Neighbors (PSRNs). Our findings indicate that the inclusion of context beyond one discourse unit is harmful in the task of discourse relation classification.
语义关系分类是一个尤其困难的任务,如果没有明确的上下文标记 \cite{Prasad2008ThePD},那么当前的隐含关系预测方法仅依赖于目标句子周围的邻居句子,而忽略了它们周围环境更广泛的上下文。在本文研究中,我们提出了三种新的方法来在句子关系预测任务中包含上下文:直接邻居(DNs)、扩展窗口邻居(EWNs)和部分智能随机邻居(PSRNs)。我们的研究结果表明,在语义关系分类任务中,仅仅依靠一个语义单位的上下文是不够的,是有害的。
https://arxiv.org/abs/2405.09735
The software industry is experiencing a surge in the adoption of Continuous Integration (CI) practices, both in commercial and open-source environments. CI practices facilitate the seamless integration of code changes by employing automated building and testing processes. Some frameworks, such as Travis CI and GitHub Actions have significantly contributed to simplifying and enhancing the CI process, rendering it more accessible and efficient for development teams. Despite the availability these CI tools , developers continue to encounter difficulties in accurately flagging commits as either suitable for CI execution or as candidates for skipping especially for large projects with many dependencies. Inaccurate flagging of commits can lead to resource-intensive test and build processes, as even minor commits may inadvertently trigger the Continuous Integration process. The problem of detecting CI-skip commits, can be modeled as binary classification task where we decide to either build a commit or to skip it. This study proposes a novel solution that leverages Deep Reinforcement Learning techniques to construct an optimal Decision Tree classifier that addresses the imbalanced nature of the data. We evaluate our solution by running a within and a cross project validation benchmark on diverse range of Open-Source projects hosted on GitHub which showcased superior results when compared with existing state-of-the-art methods.
软件产业正在经历持续集成(CI)实践的激增,无论是商业还是开源环境。CI 实践通过采用自动构建和测试过程,使代码更改的轻松集成变得无缝。一些工具,如Travis CI 和 GitHub Actions在简化和完善 CI 过程中发挥了重要作用,使开发团队更容易使用。尽管这些 CI 工具已经存在,但开发人员仍然会面临准确地标记提交为适合 CI 执行或作为跳过条件的困难,尤其是在大型具有多个依赖关系的项目中。不准确的标记提交可能导致资源密集的测试和构建过程,甚至小的提交也可能无意中触发 Continuous Integration 过程。检测 CI 跳过提交的问题可以建模为二元分类任务,我们决定构建或跳过它。本研究提出了一种新解决方案,利用深度强化学习技术构建了一个最优决策树分类器,解决了数据不平衡的问题。我们通过在 GitHub 上托管的多样化开源项目上进行内部分别和跨项目验证基准来评估我们的解决方案,这表明与现有最先进的方法相比,具有卓越的结果。
https://arxiv.org/abs/2405.09657