Concept Bottleneck Models (CBMs) ground image classification on human-understandable concepts to allow for interpretable model decisions. Crucially, the CBM design inherently allows for human interventions, in which expert users are given the ability to modify potentially misaligned concept choices to influence the decision behavior of the model in an interpretable fashion. However, existing approaches often require numerous human interventions per image to achieve strong performances, posing practical challenges in scenarios where obtaining human feedback is expensive. In this paper, we find that this is noticeably driven by an independent treatment of concepts during intervention, wherein a change of one concept does not influence the use of other ones in the model's final decision. To address this issue, we introduce a trainable concept intervention realignment module, which leverages concept relations to realign concept assignments post-intervention. Across standard, real-world benchmarks, we find that concept realignment can significantly improve intervention efficacy; significantly reducing the number of interventions needed to reach a target classification performance or concept prediction accuracy. In addition, it easily integrates into existing concept-based architectures without requiring changes to the models themselves. This reduced cost of human-model collaboration is crucial to enhancing the feasibility of CBMs in resource-constrained environments.
概念瓶颈模型(CBMs)将人类可理解的观念作为 ground image 分类,以实现模型决策的可解释性。关键是,CBM 设计本身允许专家用户进行干预,其中专家用户被赋予了修改可能存在错配的概念选择以影响模型决策行为的解释性方式。然而,现有的方法通常需要每个图像进行大量的人工干预才能实现强大的性能,这在一个获得人类反馈代价高昂的场景中提出了实际挑战。在本文中,我们发现这明显是由干预过程中的独立处理概念导致的,其中改变一个概念不会影响模型最终决策的使用其他概念。为了应对这个问题,我们引入了一个可训练的概念干预对齐模块,它利用概念关系在干预后对概念分配进行对齐。在标准、真实世界的基准测试中,我们发现概念对齐可以显著提高干预效果;大大减少了达到目标分类性能或概念预测精度所需的干预数量。此外,它很容易与现有的基于概念的架构集成,而无需对模型本身进行更改。这种降低人机协同成本对增强在资源受限环境中的 CBM 的可行性至关重要。
https://arxiv.org/abs/2405.01531
Generalization to unseen data remains poorly understood for deep learning classification and foundation models. How can one assess the ability of networks to adapt to new or extended versions of their input space in the spirit of few-shot learning, out-of-distribution generalization, and domain adaptation? Which layers of a network are likely to generalize best? We provide a new method for evaluating the capacity of networks to represent a sampled domain, regardless of whether the network has been trained on all classes in the domain. Our approach is the following: after fine-tuning state-of-the-art pre-trained models for visual classification on a particular domain, we assess their performance on data from related but distinct variations in that domain. Generalization power is quantified as a function of the latent embeddings of unseen data from intermediate layers for both unsupervised and supervised settings. Working throughout all stages of the network, we find that (i) high classification accuracy does not imply high generalizability; and (ii) deeper layers in a model do not always generalize the best, which has implications for pruning. Since the trends observed across datasets are largely consistent, we conclude that our approach reveals (a function of) the intrinsic capacity of the different layers of a model to generalize.
推广到未见过的数据在深度学习和基础模型中仍然存在很大的不确定性。如何评估网络在面对新或扩展的输入空间时的适应能力,以及少样本学习、离散泛化和领域适应?网络的哪些层可能最具泛化能力?我们提出了一种评估网络对给定域的表示能力的方法,无论网络是否在域上进行过所有的类别的预训练。我们的方法如下:在特定域上对先进的预训练模型进行微调后,我们评估它们在相关但不同的域上的数据上的表现。泛化能力被量化为中间层未见过的数据的潜在表示的功能,无论是无监督还是监督设置。在整个网络的工作过程中,我们发现:(i)高分类准确率并不一定意味着高泛化能力;(ii)模型中的更深层并不总是泛化最好,这会对剪裁产生影响。由于数据集的趋势在很大程度上是一致的,我们得出结论,我们的方法揭示了模型不同层之间泛化的内在能力。
https://arxiv.org/abs/2405.01524
AI Foundation models are gaining traction in various applications, including medical fields like radiology. However, medical foundation models are often tested on limited tasks, leaving their generalisability and biases unexplored. We present RayDINO, a large visual encoder trained by self-supervision on 873k chest X-rays. We compare RayDINO to previous state-of-the-art models across nine radiology tasks, from classification and dense segmentation to text generation, and provide an in depth analysis of population, age and sex biases of our model. Our findings suggest that self-supervision allows patient-centric AI proving useful in clinical workflows and interpreting X-rays holistically. With RayDINO and small task-specific adapters, we reach state-of-the-art results and improve generalization to unseen populations while mitigating bias, illustrating the true promise of foundation models: versatility and robustness.
AI 基础模型在各种应用领域都取得了越来越广泛的应用,包括医学领域如 radiology。然而,这些模型通常仅在有限的任务上进行测试,导致其泛化能力和偏见未被探索。我们提出了 RayDINO,一种通过自监督学习在 873k 张胸部 X 光片上进行训练的大视觉编码器。我们比较了 RayDINO 与之前最先进的模型在九个放射学任务上的效果,从分类和密集分割到文本生成,并对我们模型的Population、Age和Gender偏见进行了深入分析。我们的研究结果表明,自监督训练使患者为中心的 AI 在临床工作流程中具有实际价值,并能够全面地解释 X 光片。与 RayDINO 和针对具体任务的适配器相结合,我们达到了最先进的结果,并在未见过的受众上进行了泛化能力的提高,同时减轻了偏见,证明了基础模型的真正价值:多样性和稳健性。
https://arxiv.org/abs/2405.01469
This paper investigates the use of Large Language Models (LLMs) for automating the generation of hardware description code, aiming to explore their potential in supporting and enhancing the development of efficient neuromorphic computing architectures. Building on our prior work, we employ OpenAI's ChatGPT4 and natural language prompts to synthesize a RTL Verilog module of a programmable recurrent spiking neural network, while also generating test benches to assess the system's correctness. The resultant design was validated in three case studies, the exclusive OR,the IRIS flower classification and the MNIST hand-written digit classification, achieving accuracies of up to 96.6%. To verify its synthesizability and implementability, the design was prototyped on a field-programmable gate array and implemented on SkyWater 130 nm technology by using an open-source electronic design automation flow. Additionally, we have submitted it to Tiny Tapeout 6 chip fabrication program to further evaluate the system on-chip performance in the future.
本文研究了使用大型语言模型(LLMs)自动生成硬件描述代码的应用,旨在探讨它们在支持和发展高效神经形态计算架构方面的潜力。在我们之前的工作基础上,我们利用OpenAI的ChatGPT4和自然语言提示来合成一个可编程反复抽动的神经网络的RTL Verilog模块,同时生成测试基准以评估系统的正确性。基于所得设计,我们在三个案例研究中进行了验证: exclusive OR,IRIS花分类和MNIST手写数字分类,达到96.6%的准确度。为了验证其可合成性和可实现性,该设计在一块现场可编程门阵列上进行了原型设计,并使用SkyWater 130纳米技术通过开源电子设计自动化流程进行了实现。此外,我们还将其提交给Tiny Tapeout 6芯片制造程序,以进一步评估系统在芯片上的性能。
https://arxiv.org/abs/2405.01419
Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging Large Language Models (LLMs) with images using a simple projector. Inspired by their success, large 3D point cloud-language models (3D-LLMs) also integrate point clouds into LLMs. However, directly aligning point clouds with LLM requires expensive training costs, typically in hundreds of GPU-hours on A100, which hinders the development of 3D-LLMs. In this paper, we introduce MiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTA results while training for only 27 hours on one RTX 3090. Specifically, we propose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, which can leverage the similarity between 2D and 3D visual information. We introduce a novel four-stage training strategy for modality alignment in a cascaded way, and a mixture of query experts module to adaptively aggregate features with high efficiency. Moreover, we utilize parameter-efficient fine-tuning methods LoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, which is up to 260x fewer than existing methods. Extensive experiments show that MiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, with significantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12 increase on GPT-4 evaluation score for the challenging object captioning task compared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800. We are the first to explore the efficient 3D-LLM, offering new insights to the community. Code and weights are available at this https URL.
大型的2D视觉语言模型(2D-LLMs)通过简单地使用投影器将大型语言模型(LLMs)与图像相连接,已经引起了 significant 的关注。受到他们的成功启发,大型3D点云语言模型(3D-LLMs)也 将点云集成到 LLMs 中。然而,直接将点云与 LLM 对齐需要昂贵的训练成本,通常在 A100 上需要数百个 GPU-小时,这阻碍了 3D-LLMs 的开发。在本文中,我们介绍了 MiniGPT-3D,一种高效且强大的3D-LLM,在仅训练27小时的情况下实现了多个SOTA结果。具体来说,我们提出了一种使用来自2D-LLMs的2D先验来对齐3D点云与LLM的方法,可以利用2D和3D视觉信息的相似性。我们还提出了一个新颖的四阶段模态对齐训练策略,以及一个混合查询专家模块以高效地适应性地聚合特征。此外,我们还利用参数高效的微调方法 LoRA 和 Norm 微调,实现了仅47.8M可学习参数,比现有方法少260倍。 extensive实验证明,MiniGPT-3D在3D物体分类和文本摘要任务上实现了SOTA,具有显著的训练成本优势。值得注意的是,与ShapeLLM-13B相比,MiniGPT-3D在具有挑战性的物体文本摘要任务上获得了8.12的提高,而后者需要160个总共的GPU-小时,在8个A800上。我们是第一个探索高效3D-LLM,为社区提供了新的见解。代码和权重可以从该https URL获取。
https://arxiv.org/abs/2405.01413
Pedestrian detection has significantly progressed in recent years, thanks to the development of DNNs. However, detection performance at occluded scenes is still far from satisfactory, as occlusion increases the intra-class variance of pedestrians, hindering the model from finding an accurate classification boundary between pedestrians and background clutters. From the perspective of reducing intra-class variance, we propose to complete features for occluded regions so as to align the features of pedestrians across different occlusion patterns. An important premise for feature completion is to locate occluded regions. From our analysis, channel features of different pedestrian proposals only show high correlation values at visible parts and thus feature correlations can be used to model occlusion patterns. In order to narrow down the gap between completed features and real fully visible ones, we propose an adversarial learning method, which completes occluded features with a generator such that they can hardly be distinguished by the discriminator from real fully visible features. We report experimental results on the CityPersons, Caltech and CrowdHuman datasets. On CityPersons, we show significant improvements over five different baseline detectors, especially on the heavy occlusion subset. Furthermore, we show that our proposed method FeatComp++ achieves state-of-the-art results on all the above three datasets without relying on extra cues.
行人检测在近年来取得了显著的进步,得益于深度学习网络的发展。然而,在遮挡场景中的检测性能仍然离满意尚较远,因为遮挡增加了行人的内类方差,阻碍了模型在行人与背景混淆之间找到精确分类边界。从减少内类方差的角度来看,我们提出了一种方法来完成遮挡区域的特征,以使不同遮挡模式下的行人特征对齐。完成特征的一个重要前提是找到遮挡区域。从我们的分析中,不同行人建议的通道特征仅在可见部分表现出高相关性,因此特征相关性可用于建模遮挡模式。为了缩小完成特征与真实完全可见特征之间的差距,我们提出了一个对抗学习方法,该方法使用生成器完成遮挡特征,这样它们很难被鉴别器与真实完全可见特征区分开来。我们在CityPersons、Caltech和CrowdHuman数据集上进行了实验。在CityPersons数据集上,我们展示了五种不同的基线检测器中显著的改进,特别是在重度遮挡子集中。此外,我们还证明了我们的方法FeatureComp++在三个数据集上均实现了最先进的成果,而无需依赖额外的提示。
https://arxiv.org/abs/2405.01311
Self-training is a powerful approach to deep learning. The key process is to find a pseudo-label for modeling. However, previous self-training algorithms suffer from the over-confidence issue brought by the hard labels, even some confidence-related regularizers cannot comprehensively catch the uncertainty. Therefore, we propose a new self-training framework to combine uncertainty information of both model and dataset. Specifically, we propose to use Expectation-Maximization (EM) to smooth the labels and comprehensively estimate the uncertainty information. We further design a basis extraction network to estimate the initial basis from the dataset. The obtained basis with uncertainty can be filtered based on uncertainty information. It can then be transformed into the real hard label to iteratively update the model and basis in the retraining process. Experiments on image classification and semantic segmentation show the advantages of our methods among confidence-aware self-training algorithms with 1-3 percentage improvement on different datasets.
自我训练是一种强大的深度学习方法。关键的过程是找到一个伪标签来建模。然而,之前的自我训练算法由于硬标签带来的过度自信问题,即使是一些与信心相关的正则化器也无法全面捕捉不确定性。因此,我们提出了一个新的自我训练框架来结合模型和数据的不确定性信息。具体来说,我们提出使用Expectation-Maximization(EM)来平滑标签,并全面估计不确定性信息。我们还设计了一个基提取网络,从数据集中估计初始基。具有不确定性的基可以基于不确定性信息进行筛选。然后可以将其转换为真实硬标签,在重新训练过程中逐步更新模型和基。在图像分类和语义分割上的实验表明,我们的方法在关注信心的自训练算法中具有1-3个百分比的改进。
https://arxiv.org/abs/2405.01175
Anomaly detection (AD) is a crucial process often required in industrial settings. Anomalies can signal underlying issues within a system, prompting further investigation. Industrial processes aim to streamline operations as much as possible, encompassing the production of the final product, making AD an essential mean to reach this goal.Conventional anomaly detection methodologies typically classify observations as either normal or anomalous without providing insight into the reasons behind these classifications.Consequently, in light of the emergence of Industry 5.0, a more desirable approach involves providing interpretable outcomes, enabling users to understand the rationale behind the results.This paper presents the first industrial application of ExIFFI, a recently developed approach focused on the production of fast and efficient explanations for the Extended Isolation Forest (EIF) Anomaly detection method. ExIFFI is tested on two publicly available industrial datasets demonstrating superior effectiveness in explanations and computational efficiency with the respect to other state-of-the-art explainable AD models.
异常检测(AD)是工业环境中经常需要的关键过程。异常可以表明系统内部潜在的问题,从而促使进行进一步调查。工业过程旨在尽可能简化操作,包括生产最终产品的过程,因此AD成为实现这一目标的基本手段。传统的异常检测方法通常将观察结果分为正常或异常两类,而没有提供这些分类背后的原因。因此,考虑到工业4.0的兴起,一种更令人满意的方法是提供可解释的结果,使用户能够理解结果背后的推理过程。本文介绍了ExIFFI这一新方法在工业领域的首次应用,该方法专注于为Extended Isolation Forest(EIF)异常检测方法生产快速且高效的解释。ExIFFI在两个公开可用的工业数据集上的测试表明,与其他最先进的可解释AD模型相比,其解释性和计算效率具有优越性。
https://arxiv.org/abs/2405.01158
Whistleblowing is essential for ensuring transparency and accountability in both public and private sectors. However, (potential) whistleblowers often fear or face retaliation, even when reporting anonymously. The specific content of their disclosures and their distinct writing style may re-identify them as the source. Legal measures, such as the EU WBD, are limited in their scope and effectiveness. Therefore, computational methods to prevent re-identification are important complementary tools for encouraging whistleblowers to come forward. However, current text sanitization tools follow a one-size-fits-all approach and take an overly limited view of anonymity. They aim to mitigate identification risk by replacing typical high-risk words (such as person names and other NE labels) and combinations thereof with placeholders. Such an approach, however, is inadequate for the whistleblowing scenario since it neglects further re-identification potential in textual features, including writing style. Therefore, we propose, implement, and evaluate a novel classification and mitigation strategy for rewriting texts that involves the whistleblower in the assessment of the risk and utility. Our prototypical tool semi-automatically evaluates risk at the word/term level and applies risk-adapted anonymization techniques to produce a grammatically disjointed yet appropriately sanitized text. We then use a LLM that we fine-tuned for paraphrasing to render this text coherent and style-neutral. We evaluate our tool's effectiveness using court cases from the ECHR and excerpts from a real-world whistleblower testimony and measure the protection against authorship attribution (AA) attacks and utility loss statistically using the popular IMDb62 movie reviews dataset. Our method can significantly reduce AA accuracy from 98.81% to 31.22%, while preserving up to 73.1% of the original content's semantics.
举报举报对于确保公共和私营部门的高透明度和问责制至关重要。然而,(可能的)举报者经常担心或面临报复,即使他们匿名举报。他们举报内容的具体内容和独特的写作风格可能会使他们重新被识别为来源。法律措施,如欧盟举报机制(WBD),在范围和效果上有限。因此,计算方法防止重新识别是鼓励举报者举报的重要补充工具。然而,当前的文本消毒工具遵循一种一刀切的方法,对匿名性持过于狭隘的观点。它们试图通过用典型高风险词汇(如人物姓名和其他标签)替换来减轻识别风险。然而,这种方法在举报场景中是不够的,因为它忽略了文本特征中进一步的重新识别可能性,包括文体。因此,我们提出了一个新颖的分类和减轻策略,该策略让举报者在评估风险和效用时参与其中。我们的原型工具在词/短语级别评估风险,并应用风险适应的匿名化技术生成语法不连贯但适度消毒的文本。然后,我们使用一个我们微调用于复写的LLM来生成连贯且风格中性的文本。我们用IMDb62电影评论数据集中的片段来衡量我们工具的有效性,并评估其对作者归属攻击和效用损失的统计保护。我们的方法可以从98.81%的准确度降低到31.22%,同时保留原始内容的73.1%的语义。
https://arxiv.org/abs/2405.01097
3D Swin Transformer (3D-ST) known for its hierarchical attention and window-based processing, excels in capturing intricate spatial relationships within images. Spatial-spectral Transformer (SST), meanwhile, specializes in modeling long-range dependencies through self-attention mechanisms. Therefore, this paper introduces a novel method: an attentional fusion of these two transformers to significantly enhance the classification performance of Hyperspectral Images (HSIs). What sets this approach apart is its emphasis on the integration of attentional mechanisms from both architectures. This integration not only refines the modeling of spatial and spectral information but also contributes to achieving more precise and accurate classification results. The experimentation and evaluation of benchmark HSI datasets underscore the importance of employing disjoint training, validation, and test samples. The results demonstrate the effectiveness of the fusion approach, showcasing its superiority over traditional methods and individual transformers. Incorporating disjoint samples enhances the robustness and reliability of the proposed methodology, emphasizing its potential for advancing hyperspectral image classification.
3D Swin Transformer(3D-ST)以其层次化的注意力和基于窗口的处理而闻名,在图像中捕捉复杂的空间关系方面表现出色。同时,Spatial-spectral Transformer(SST)专门通过自注意力机制建模长距离依赖关系。因此,本文提出了一种新颖的方法:这两个变压器的注意合并,可以显著提高超分辨率图像(HSIs)的分类性能。这种方法的特点在于其强调将两种架构的注意力机制进行整合。这种整合不仅精炼了空间和光谱信息的建模,还促进了更精确和准确的分类结果的实现。基准HSI数据集的实验和评估强调了对使用离散训练、验证和测试样本的重要性。结果表明,融合方法的有效性得到了展示,其优越性超过了传统方法和单独的变压器。纳入离散样本增强了所提出方法的可行性、可靠性和潜在的提高超分辨率图像分类的能力。
https://arxiv.org/abs/2405.01095
Although pre-trained language models have exhibited great flexibility and versatility with prompt-based few-shot learning, they suffer from the extensive parameter size and limited applicability for inference. Recent studies have suggested that PLMs be used as dataset generators and a tiny task-specific model be trained to achieve efficient inference. However, their applicability to various domains is limited because they tend to generate domain-specific datasets. In this work, we propose a novel approach to universal domain generalization that generates a dataset regardless of the target domain. This allows for generalization of the tiny task model to any domain that shares the label space, thus enhancing the real-world applicability of the dataset generation paradigm. Our experiments indicate that the proposed method accomplishes generalizability across various domains while using a parameter set that is orders of magnitude smaller than PLMs.
尽管预训练语言模型在基于提示的少样本学习方面表现出巨大的灵活性和多样性,但它们仍然受到参数数量庞大和推理应用受限的问题。最近的研究表明,PLM可以作为数据集生成器,并训练一个任务特定的微小模型来实现高效的推理。然而,由于它们倾向于生成特定领域的数据集,因此它们的适用性在各个领域都有限。在本文中,我们提出了一种通用的领域泛化方法,可以生成与目标领域不同的数据集。这使得将微小任务模型应用于任何具有共同标签空间的所有领域成为可能,从而提高了数据生成范式在现实世界的应用价值。我们的实验结果表明,与PLM相比,所提出的方法在各个领域都实现了泛化能力,同时使用了参数集 orders of magnitude 更小。
https://arxiv.org/abs/2405.01022
Neural collapse (NC) is a simple and symmetric phenomenon for deep neural networks (DNNs) at the terminal phase of training, where the last-layer features collapse to their class means and form a simplex equiangular tight frame aligning with the classifier vectors. However, the relationship of the last-layer features to the data and intermediate layers during training remains unexplored. To this end, we characterize the geometry of intermediate layers of ResNet and propose a novel conjecture, progressive feedforward collapse (PFC), claiming the degree of collapse increases during the forward propagation of DNNs. We derive a transparent model for the well-trained ResNet according to that ResNet with weight decay approximates the geodesic curve in Wasserstein space at the terminal phase. The metrics of PFC indeed monotonically decrease across depth on various datasets. We propose a new surrogate model, multilayer unconstrained feature model (MUFM), connecting intermediate layers by an optimal transport regularizer. The optimal solution of MUFM is inconsistent with NC but is more concentrated relative to the input data. Overall, this study extends NC to PFC to model the collapse phenomenon of intermediate layers and its dependence on the input data, shedding light on the theoretical understanding of ResNet in classification problems.
神经崩溃(NC)是深度神经网络(DNNs)在训练的末端阶段的一种简单对称现象,其中最后一层的特征崩溃到其类别意味着并形成了一个类ifier向量之间的等距 tight frame。然而,在训练过程中最后一层特征与数据和中间层之间的关系仍然没有被探索。为此,我们研究了 ResNet 的中间层几何,并提出了一个新的假设,称为 progressive feedforward collapse(PFC),声称在 DNNs 的前向传播过程中崩溃程度会增加。根据那个在 ResNet 中,权重衰减逼近在 Wasserstein 空间中的测地线的模型,我们得到了一个透明的模型。PFC 在各种数据集上的深度确实单调递减。我们提出了一个新代理模型,多层约束特征模型(MUFM),通过最优传输 regularizer 连接中间层。MUFM 的最优解与 NC 不一致,但相对于输入数据更加集中。总的来说,这项研究将 NC 扩展到 PFC,以建模中间层的崩溃现象及其与输入数据的关系,为 ResNet 在分类问题中的理论理解提供了更深入的认识。
https://arxiv.org/abs/2405.00985
Graph Neural Networks (GNNs) demonstrate excellent performance on graphs, with their core idea about aggregating neighborhood information and learning from labels. However, the prevailing challenges in most graph datasets are twofold of Insufficient High-Quality Labels and Lack of Neighborhoods, resulting in weak GNNs. Existing data augmentation methods designed to address these two issues often tackle only one. They may either require extensive training of generators, rely on overly simplistic strategies, or demand substantial prior knowledge, leading to suboptimal generalization abilities. To simultaneously address both of these two challenges, we propose an elegant method called IntraMix. IntraMix innovatively employs Mixup among low-quality labeled data of the same class, generating high-quality labeled data at minimal cost. Additionally, it establishes neighborhoods for the generated data by connecting them with data from the same class with high confidence, thereby enriching the neighborhoods of graphs. IntraMix efficiently tackles both challenges faced by graphs and challenges the prior notion of the limited effectiveness of Mixup in node classification. IntraMix serves as a universal framework that can be readily applied to all GNNs. Extensive experiments demonstrate the effectiveness of IntraMix across various GNNs and datasets.
图神经网络(GNNs)在图中表现出色,其核心思想是聚合邻近信息并从标签中学习。然而,大多数图数据集的普遍挑战是缺乏高质量标签和缺乏邻近关系,导致低效的GNNs。为解决这两个问题,现有数据增强方法通常只解决一个问题。它们可能需要对生成器进行广泛的训练,依赖于过于简单的策略,或者需要相当多的先验知识,导致泛化能力下降。为了同时解决这两个问题,我们提出了一个优雅的方法,称为IntraMix。IntraMix通过在同一类低质量标记数据中使用Mixup创新地解决了这个问题,生成高质量标记数据且成本较低。此外,它通过连接生成的数据和同一类数据建立邻域,从而丰富图的邻域。IntraMix有效地解决了图面临的问题,并挑战了在节点分类中Mixup效果有限的概念。IntraMix成为了一个通用的框架,可以轻松应用于所有GNN。大量实验证明IntraMix在各种GNN和数据集上的有效性。
https://arxiv.org/abs/2405.00957
Limited diversity in standardized benchmarks for evaluating audio representation learning (ARL) methods may hinder systematic comparison of current methods' capabilities. We present ARCH, a comprehensive benchmark for evaluating ARL methods on diverse audio classification domains, covering acoustic events, music, and speech. ARCH comprises 12 datasets, that allow us to thoroughly assess pre-trained SSL models of different sizes. ARCH streamlines benchmarking of ARL techniques through its unified access to a wide range of domains and its ability to readily incorporate new datasets and models. To address the current lack of open-source, pre-trained models for non-speech audio, we also release new pre-trained models that demonstrate strong performance on non-speech datasets. We argue that the presented wide-ranging evaluation provides valuable insights into state-of-the-art ARL methods, and is useful to pinpoint promising research directions.
标准基准测试对评估音频表示学习(ARL)方法的有限多样性可能会阻碍当前方法的系统比较能力。我们提出了ARCH(音频分类域全面基准),一个用于评估各种音频分类域中ARL方法的全面基准,包括音频事件、音乐和语音。ARCH包括12个数据集,使我们能够深入评估不同大小的预训练SSL模型的性能。ARCH通过其广泛的领域访问权限和容易纳入新数据集和模型的能力,简化了ARL技术的基准测试。为了解决当前缺乏非语音音频的开放源代码预训练模型的问题,我们还发布了在非语音数据集上表现出强劲性能的新预训练模型。我们认为,所提出的广泛的评估为最先进的ARL方法提供了宝贵的见解,有助于确定有前途的研究方向。
https://arxiv.org/abs/2405.00934
Background and Purpose: Identifying the thromboembolism source in ischemic stroke is crucial for treatment and secondary prevention yet is often undetermined. This study describes a self-supervised deep learning approach in digital pathology of emboli for classifying ischemic stroke clot origin from histopathological images. Methods: The dataset included whole slide images (WSI) from the STRIP AI Kaggle challenge, consisting of retrieved clots from ischemic stroke patients following mechanical thrombectomy. Transformer-based deep learning models were developed using transfer learning and self-supervised pretraining for classifying WSI. Customizations included an attention pooling layer, weighted loss function, and threshold optimization. Various model architectures were tested and compared, and model performances were primarily evaluated using weighted logarithmic loss. Results: The model achieved a logloss score of 0.662 in cross-validation and 0.659 on the test set. Different model backbones were compared, with the swin_large_patch4_window12_384 showed higher performance. Thresholding techniques for clot origin classification were employed to balance false positives and negatives. Conclusion: The study demonstrates the extent of efficacy of transformer-based deep learning models in identifying ischemic stroke clot origins from histopathological images and emphasizes the need for refined modeling techniques specifically adapted to thrombi WSI. Further research is needed to improve model performance, interpretability, validate its effectiveness. Future enhancement could include integrating larger patient cohorts, advanced preprocessing strategies, and exploring ensemble multimodal methods for enhanced diagnostic accuracy.
背景和目的:确定动脉粥样硬化性中风血栓的来源对于治疗和二次预防至关重要,但通常很难确定。这项研究描述了一种自监督的深度学习方法,用于病理图像中的血栓分类,以确定动脉粥样硬化性中风血栓的来源。 方法:数据集包括从STRIP AI Kaggle挑战中获取的整张图像(WSI),这些 WSI 是来自接受机械取栓治疗的患者。使用迁移学习和自监督预训练的Transformer-based深度学习模型进行分类。自定义包括注意力池化层、加权损失函数和阈值优化。各种模型架构都被测试和比较,主要通过加权对数损失进行评估来评估模型性能。 结果:在交叉验证中,模型实现了logloss分数为0.662,在测试集中为0.659。对不同的模型骨干进行了比较, swin_large_patch4_window12_384 显示了更高的性能。采用阈值技术平衡 false positives 和 negatives。 结论:本研究证明了Transformer-based深度学习模型在从病理图像中确定动脉粥样硬化性中风血栓来源方面的有效性。强调了需要专门针对血栓 WSI 的精细建模技术。还需要进一步研究提高模型性能、可解释性和验证其有效性。未来的增强可以包括纳入更大的患者队列、采用更先进的预处理策略和探索集成多模态方法以提高诊断准确性。
https://arxiv.org/abs/2405.00908
Over the last decade, similar to other application domains, social media content has been proven very effective in disaster informatics. However, due to the unstructured nature of the data, several challenges are associated with disaster analysis in social media content. To fully explore the potential of social media content in disaster informatics, access to relevant content and the correct geo-location information is very critical. In this paper, we propose a three-step solution to tackling these challenges. Firstly, the proposed solution aims to classify social media posts into relevant and irrelevant posts followed by the automatic extraction of location information from the posts' text through Named Entity Recognition (NER) analysis. Finally, to quickly analyze the topics covered in large volumes of social media posts, we perform topic modeling resulting in a list of top keywords, that highlight the issues discussed in the tweet. For the Relevant Classification of Twitter Posts (RCTP), we proposed a merit-based fusion framework combining the capabilities of four different models namely BERT, RoBERTa, Distil BERT, and ALBERT obtaining the highest F1-score of 0.933 on a benchmark dataset. For the Location Extraction from Twitter Text (LETT), we evaluated four models namely BERT, RoBERTa, Distil BERTA, and Electra in an NER framework obtaining the highest F1-score of 0.960. For topic modeling, we used the BERTopic library to discover the hidden topic patterns in the relevant tweets. The experimental results of all the components of the proposed end-to-end solution are very encouraging and hint at the potential of social media content and NLP in disaster management.
在过去的十年里,与其他应用领域一样,社交媒体内容在灾难信息学中已经被证明非常有效。然而,由于数据的无结构性质,社交媒体内容灾难分析面临着几个挑战。为了全面探索社交媒体内容在灾难信息学中的潜力,访问相关内容并获取正确的美地位置信息非常重要。在本文中,我们提出了一个解决这些挑战的三步解决方案。首先,所提出的解决方案旨在对社交媒体帖子进行分类,包括相关和不相关帖子,然后通过命名实体识别(NER)分析从帖子文本中自动提取位置信息。最后,为了快速分析大量社交媒体帖子中涵盖的主题,我们执行了主题建模,得到了一组关键词,突出了推特中讨论的问题。对于Twitter帖子相关分类(RCTP),我们提出了一个基于贡献的融合框架,结合了四种不同模型的功能,即BERT、RoBERTa、Distil BERT和ALBERT,在基准数据集上取得了最高F1分数为0.933。对于从Twitter文本中提取位置(LETT),我们在NER框架中评估了四种模型,即BERT、RoBERTa、Distil BERTA和Electra,取得了最高F1分数为0.960。对于主题建模,我们使用了BERTopic库来发现相关推特中的隐藏主题模式。所有组件的实验结果都非常鼓舞人心,表明了社交媒体内容和自然语言处理在灾难管理中的潜力。
https://arxiv.org/abs/2405.00903
Vision language models (VLMs) have recently emerged and gained the spotlight for their ability to comprehend the dual modality of image and textual data. VLMs such as LLaVA, ChatGPT-4, and Gemini have recently shown impressive performance on tasks such as natural image captioning, visual question answering (VQA), and spatial reasoning. Additionally, a universal segmentation model by Meta AI, Segment Anything Model (SAM) shows unprecedented performance at isolating objects from unforeseen images. Since medical experts, biologists, and materials scientists routinely examine microscopy or medical images in conjunction with textual information in the form of captions, literature, or reports, and draw conclusions of great importance and merit, it is indubitably essential to test the performance of VLMs and foundation models such as SAM, on these images. In this study, we charge ChatGPT, LLaVA, Gemini, and SAM with classification, segmentation, counting, and VQA tasks on a variety of microscopy images. We observe that ChatGPT and Gemini are impressively able to comprehend the visual features in microscopy images, while SAM is quite capable at isolating artefacts in a general sense. However, the performance is not close to that of a domain expert - the models are readily encumbered by the introduction of impurities, defects, artefact overlaps and diversity present in the images.
近年来,随着Vision语言模型(VLMs)的出现,它们在理解图像和文本数据的双模态方面得到了关注。例如,LLaVA、ChatGPT-4和Gemini等VLM最近在自然图像描述性、视觉问答(VQA)和空间推理等任务中表现出色。此外,由元人工智能(Meta AI)开发的普遍分割模型Semantic Anywhere Model(SAM)在从未见过的图像中隔离物体方面表现出史无前例的性能。由于医疗专家、生物学家和材料科学家通常将显微镜图像或医学图像与文本信息(标题、文献或报告)一起检查,并从中得出重要且有益的结论,因此测试VLM和基础模型(如SAM)在这些图像上的性能无疑至关重要。在这项研究中,我们对ChatGPT、LLaVA、Gemini和SAM在各种显微镜图像上执行分类、分割、计数和VQA任务。我们观察到,ChatGPT和Gemini在显微镜图像的视觉特征方面表现出惊人的理解能力,而SAM在分离总体上的伪影方面表现相当出色。然而,这些模型的性能与领域专家的相当距离,模型很容易受到图像中存在的杂质、缺陷、伪影和多样性等因素的影响。
https://arxiv.org/abs/2405.00876
To detect infected wounds in Diabetic Foot Ulcers (DFUs) from photographs, preventing severe complications and amputations. Methods: This paper proposes the Guided Conditional Diffusion Classifier (ConDiff), a novel deep-learning infection detection model that combines guided image synthesis with a denoising diffusion model and distance-based classification. The process involves (1) generating guided conditional synthetic images by injecting Gaussian noise to a guide image, followed by denoising the noise-perturbed image through a reverse diffusion process, conditioned on infection status and (2) classifying infections based on the minimum Euclidean distance between synthesized images and the original guide image in embedding space. Results: ConDiff demonstrated superior performance with an accuracy of 83% and an F1-score of 0.858, outperforming state-of-the-art models by at least 3%. The use of a triplet loss function reduces overfitting in the distance-based classifier. Conclusions: ConDiff not only enhances diagnostic accuracy for DFU infections but also pioneers the use of generative discriminative models for detailed medical image analysis, offering a promising approach for improving patient outcomes.
为了从照片中检测糖尿病足溃疡(DFUs)中的感染伤口,预防和减轻严重并发症和截肢,本文提出了一种名为引导条件扩散分类器(ConDiff)的新颖深度学习感染检测模型,它结合了指导图像生成和去噪扩散模型以及距离分类。该过程包括:(1)通过向指导图像注入高斯噪声来生成指导条件合成图像,然后通过反扩散过程对噪声污染的图像进行去噪,条件是感染状况;(2)根据合成图像与嵌入空间中原始指导图像之间的最小欧氏距离,对感染进行分类。结果:ConDiff在准确性和F1分数方面都表现出优异性能,至少超过了最先进的模型的3%。距离分类器的三元组损失函数有助于减少过拟合。结论:ConDiff不仅可以提高糖尿病足溃疡感染的诊断准确性,还开创了使用生成判别模型进行详细医学图像分析的先河,为提高患者治疗效果提供了有益的途径。
https://arxiv.org/abs/2405.00858
Differences in image quality, lighting conditions, and patient demographics pose challenges to automated glaucoma detection from color fundus photography. Brighteye, a method based on Vision Transformer, is proposed for glaucoma detection and glaucomatous feature classification. Brighteye learns long-range relationships among pixels within large fundus images using a self-attention mechanism. Prior to being input into Brighteye, the optic disc is localized using YOLOv8, and the region of interest (ROI) around the disc center is cropped to ensure alignment with clinical practice. Optic disc detection improves the sensitivity at 95% specificity from 79.20% to 85.70% for glaucoma detection and the Hamming distance from 0.2470 to 0.1250 for glaucomatous feature classification. In the developmental stage of the Justified Referral in AI Glaucoma Screening (JustRAIGS) challenge, the overall outcome secured the fifth position out of 226 entries.
图像质量、光线条件和患者 demographic差异对从彩色 fundus 摄影中自动斜视眼检测造成了挑战。为了解决这个问题,我们提出了 Brighteye 方法,该方法基于 Vision Transformer,用于斜视眼检测和斜视眼特征分类。Brighteye 使用自注意力机制在大型 fundus 图像中的像素之间学习长距离关系。在将图像输入 Brighteye 之前,使用 YOLOv8 局部化视网膜,并裁剪围绕视网膜中心周围的区域以确保与临床实践保持对齐。视网膜检测提高了对 95% 特异性率的斜视眼检测的灵敏度,从 79.20% 提高到了 85.70%,以及对斜视眼特征分类的 Hamming 距离从 0.2470 提高到了 0.1250。在人工智能斜视眼筛查(JustRAIGS)挑战的发育阶段,Brighteye 的整体成果获得了第五名,共 226 篇论文。
https://arxiv.org/abs/2405.00857
We propose WIBA, a novel framework and suite of methods that enable the comprehensive understanding of "What Is Being Argued" across contexts. Our approach develops a comprehensive framework that detects: (a) the existence, (b) the topic, and (c) the stance of an argument, correctly accounting for the logical dependence among the three tasks. Our algorithm leverages the fine-tuning and prompt-engineering of Large Language Models. We evaluate our approach and show that it performs well in all the three capabilities. First, we develop and release an Argument Detection model that can classify a piece of text as an argument with an F1 score between 79% and 86% on three different benchmark datasets. Second, we release a language model that can identify the topic being argued in a sentence, be it implicit or explicit, with an average similarity score of 71%, outperforming current naive methods by nearly 40%. Finally, we develop a method for Argument Stance Classification, and evaluate the capability of our approach, showing it achieves a classification F1 score between 71% and 78% across three diverse benchmark datasets. Our evaluation demonstrates that WIBA allows the comprehensive understanding of What Is Being Argued in large corpora across diverse contexts, which is of core interest to many applications in linguistics, communication, and social and computer science. To facilitate accessibility to the advancements outlined in this work, we release WIBA as a free open access platform (wiba.dev).
我们提出了WIBA,一种新的框架和方法,可让您全面理解跨背景下的“什么是被争论的”。我们的方法开发了一个全面的框架,能检测到:(a)存在的论据,(b)主题和(c)论点,正确地解释了这三个任务之间的逻辑依赖关系。我们的算法利用了大型语言模型的微调和高提示工程。我们评估了我们的方法,并证明了它在所有三个能力上都表现出色。首先,我们开发并发布了一个人工论据检测模型,在三个不同的基准数据集上,该模型的F1得分在79%至86%之间。其次,我们发布了一个语言模型,可以识别句子中正在被争论的主题,无论是隐含还是明确的,平均相似度为71%,几乎比当前的 naive 方法快40%。最后,我们开发了一种论点分类方法,并评估了我们的方法的表现,证明它在大致不同的基准数据集上的分类F1得分在71%至78%之间。我们的评估表明,WIBA允许在大型语料库的多样性背景下全面理解“什么是被争论的”,这正是许多语言学、交流和计算机科学等应用的核心兴趣所在。为了方便您了解本工作的进展,我们已将WIBA作为免费开源平台(wiba.dev)发布。
https://arxiv.org/abs/2405.00828