Convolutional neural networks (CNNs) are essential tools for computer vision tasks, but they lack traditionally desired properties of extracted features that could further improve model performance, e.g., rotational equivariance. Such properties are ubiquitous in biomedical images, which often lack explicit orientation. While current work largely relies on data augmentation or explicit modules to capture orientation information, this comes at the expense of increased training costs or ineffective approximations of the desired equivariance. To overcome these challenges, we propose a novel and efficient implementation of the Symmetric Rotation-Equivariant (SRE) Convolution (SRE-Conv) kernel, designed to learn rotation-invariant features while simultaneously compressing the model size. The SRE-Conv kernel can easily be incorporated into any CNN backbone. We validate the ability of a deep SRE-CNN to capture equivariance to rotation using the public MedMNISTv2 dataset (16 total tasks). SRE-Conv-CNN demonstrated improved rotated image classification performance accuracy on all 16 test datasets in both 2D and 3D images, all while increasing efficiency with fewer parameters and reduced memory footprint. The code is available at this https URL.
卷积神经网络(CNN)是计算机视觉任务中的关键工具,但它们缺乏一些传统上期望的特性,这些特性能够进一步提升模型性能,比如旋转等变性。这种性质在生物医学图像中很常见,而这类图像通常没有明确的方向信息。虽然当前的研究主要依赖于数据增强或显式模块来捕捉方向信息,但这会增加训练成本或导致对所需等变性的无效近似。为了克服这些挑战,我们提出了一种新颖且高效的对称旋转等变(SRE)卷积核(SRE-Conv)的实现方法,旨在学习旋转不变特征的同时压缩模型大小。SRE-Conv 核可以轻松集成到任何 CNN 主干网络中。我们使用公开的 MedMNISTv2 数据集(共16个任务)验证了深层 SRE-CNN 捕获旋转等变性的能力。在二维和三维图像的所有16个测试数据集中,SRE-Conv-CNN 在所有情况下都显示出了更高的旋转图像分类精度,并且通过减少参数数量和降低内存占用提高了效率。代码可在以下网址获得:[提供链接]。 请注意,在实际翻译过程中,请确保使用正确的URL以供访问相关资源。
https://arxiv.org/abs/2501.09753
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners, and these requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify three key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. (iii) In real-world scenarios, the training samples may be scarce or partially missing during the process of forgetting. To address them, we first propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we introduce LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. To further extend GS-LoRA to more practical scenarios, we incorporate prototype information as additional supervision and introduce a more practical approach, GS-LoRA++. For each forgotten class, we move the logits away from its original prototype. For the remaining classes, we pull the logits closer to their respective prototypes. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that our method manages to forget specific classes with minimal impact on other classes. Codes have been released on this https URL.
出于隐私和安全方面的考虑,从预训练的视觉模型中删除不需要的信息的需求变得越来越明显。在现实场景中,用户和模型所有者随时都可能提出擦除请求,并且这些请求通常形成一个序列。因此,在这种设置下,期望能够持续地从预训练模型中移除特定信息的同时保持其余部分不受影响。我们将这个问题定义为连续遗忘问题,并识别出三个关键挑战。(i)对于不需要的知识,高效的删除方法至关重要。(ii)对于保留下来的知识,遗忘过程带来的负面影响应该最小化。(iii)在现实场景中,在遗忘过程中可用的训练样本可能非常有限或不完整。 为了应对这些挑战,我们首先提出了组稀疏LoRA(GS-LoRA)。具体来说,针对(i),我们引入了用于独立微调Transformer块中的FFN层的LoRA模块,并且对于(ii),采用了简单的组稀疏正则化方法,从而能够自动选择特定的LoRA组并将其他部分置零。为了将GS-LoRA进一步扩展到更多实际场景中使用,我们将原型信息作为额外监督引入,并提出了一种更实用的方法——GS-LoRA++。对于每个被遗忘的类别,我们将其logits远离其原始原型;而对于剩余的类别,则吸引它们各自的原型。我们在人脸识别、目标检测和图像分类上进行了广泛的实验,证明我们的方法能够以最小影响从特定类中进行遗忘操作。 代码已经在以下网址发布:[此链接处应填写实际提供的GitHub或相关代码存储库URL]。
https://arxiv.org/abs/2501.09705
Electroencephalogram (EEG) signals have emerged as a promising modality for biometric identification. While previous studies have explored the use of imagined speech with semantically meaningful words for subject identification, most have relied on additional visual or auditory cues. In this study, we introduce a cueless EEG-based imagined speech paradigm, where subjects imagine the pronunciation of semantically meaningful words without any external cues. This innovative approach addresses the limitations of prior methods by requiring subjects to select and imagine words from a predefined list naturally. The dataset comprises over 4,350 trials from 11 subjects across five sessions. We assess a variety of classification methods, including traditional machine learning techniques such as Support Vector Machines (SVM) and XGBoost, as well as time-series foundation models and deep learning architectures specifically designed for EEG classification, such as EEG Conformer and Shallow ConvNet. A session-based hold-out validation strategy was employed to ensure reliable evaluation and prevent data leakage. Our results demonstrate outstanding classification accuracy, reaching 97.93%. These findings highlight the potential of cueless EEG paradigms for secure and reliable subject identification in real-world applications, such as brain-computer interfaces (BCIs).
脑电图(EEG)信号已作为生物识别的一种有前途的模式出现。尽管先前的研究已经探索了使用具有语义意义词汇的想象语言来进行身份识别,但大多数研究依赖于额外的视觉或听觉线索。在这项研究中,我们引入了一种无提示的基于 EEG 的想象言语范式,在这种范式下受试者在没有任何外部提示的情况下想象发音具有语义意义的单词。这种方法通过要求受试者自然地从预定义列表中选择和想象词汇来解决先前方法的局限性。 该数据集包含来自 11 名受试者的超过 4,350 次试验,这些试验分布在五个会话内。我们评估了一系列分类方法,包括传统的机器学习技术(如支持向量机SVM 和 XGBoost),以及专门用于 EEG 分类的时间序列基础模型和深度学习架构(例如 EEG Conformer 和 Shallow ConvNet)。采用基于会话的保留验证策略来确保可靠的评估并防止数据泄漏。我们的研究结果展示了出色的分类准确率,达到了 97.93%。 这些发现突显了无提示 EEG 范式在实际应用中的潜在价值,如用于安全和可靠的身份识别的脑机接口(BCIs)。
https://arxiv.org/abs/2501.09700
The pivotal shift from traditional paper-based records to sophisticated Electronic Health Records (EHR), enabled systematic collection and analysis of patient data through descriptive statistics, providing insight into patterns and trends across patient populations. This evolution continued toward predictive analytics, allowing healthcare providers to anticipate patient outcomes and potential complications before they occur. This progression from basic digital record-keeping to sophisticated predictive modelling and digital twins reflects healthcare's broader evolution toward more integrated, patient-centred approaches that combine data-driven insights with personalized care delivery. This chapter explores the evolution and significance of healthcare information systems, beginning with an examination of the implementation of EHR in the UK and the USA. It provides a comprehensive overview of the International Classification of Diseases (ICD) system, tracing its development from ICD-9 to ICD-10. Central to this discussion is the MIMIC-III database, a landmark achievement in healthcare data sharing and arguably the most comprehensive critical care database freely available to researchers worldwide. MIMIC-III has democratized access to high-quality healthcare data, enabling unprecedented opportunities for research and analysis. The chapter examines its structure, clinical outcome analysis capabilities, and practical applications through case studies, with a particular focus on mortality and length of stay metrics, vital signs extraction, and ICD coding. Through detailed entity-relationship diagrams and practical examples, the text illustrates MIMIC's complex data structure and demonstrates how different querying approaches can lead to subtly different results, emphasizing the critical importance of understanding the database's architecture for accurate data extraction.
从传统的纸质记录向复杂的电子健康记录(EHR)转变的关键性变化,使得通过描述性统计方法系统地收集和分析患者数据成为可能,从而揭示了不同人群中的模式和趋势。这种演变进一步向着预测性分析发展,使医疗服务提供者能够提前预知患者的结局和潜在并发症。从基本的数字记录管理到复杂的预测建模以及数字孪生的发展,反映了医疗保健更广泛的向更加集成、以患者为中心的方法转变,这种方法结合了数据驱动的洞察力和个人化护理交付。 本章探讨了医疗信息系统的演变及其意义,从英国和美国实施EHR开始。它还提供了关于国际疾病分类(ICD)系统的一个全面概述,并追溯其从ICD-9到ICD-10的发展历程。在这其中的核心是MIMIC-III数据库,这是医疗数据共享领域的一项里程碑式成就,也是目前世界上免费提供给研究人员的最全面的重症监护数据库之一。MIMIC-III使高质量的医疗数据获取民主化,并为研究和分析提供了前所未有的机会。本章考察了它的结构、临床结果分析能力以及通过案例研究的实际应用情况,特别关注死亡率和住院时间指标、生命体征提取以及ICD编码。 通过详细的实体关系图和实际示例,该文本展示了MIMIC复杂的数据结构,并说明了不同的查询方法如何会导致细微但重要的不同结果,强调理解数据库架构对于准确数据提取的重要性。
https://arxiv.org/abs/2501.09640
Face recognition technology has dramatically transformed the landscape of security, surveillance, and authentication systems, offering a user-friendly and non-invasive biometric solution. However, despite its significant advantages, face recognition systems face increasing threats from physical and digital spoofing attacks. Current research typically treats face recognition and attack detection as distinct classification challenges. This approach necessitates the implementation of separate models for each task, leading to considerable computational complexity, particularly on devices with limited resources. Such inefficiencies can stifle scalability and hinder performance. In response to these challenges, this paper introduces an innovative unified model designed for face recognition and detection of physical and digital attacks. By leveraging the advanced Swin Transformer backbone and incorporating HiLo attention in a convolutional neural network framework, we address unified face recognition and spoof attack detection more effectively. Moreover, we introduce augmentation techniques that replicate the traits of physical and digital spoofing cues, significantly enhancing our model robustness. Through comprehensive experimental evaluation across various datasets, we showcase the effectiveness of our model in unified face recognition and spoof detection. Additionally, we confirm its resilience against unseen physical and digital spoofing attacks, underscoring its potential for real-world applications.
面部识别技术已显著改变了安全、监控和认证系统的格局,提供了一种用户友好且非侵入性的生物特征解决方案。然而,尽管其具有明显的优势,但面部识别系统面临着来自物理和数字伪造攻击的日益增加的威胁。目前的研究通常将面部识别与攻击检测视为两个独立的分类挑战。这种方法需要为每个任务实施单独的模型,导致计算复杂性大幅增加,尤其是在资源有限的设备上。这种低效会限制可扩展性并阻碍性能。 为了应对这些挑战,本文介绍了一种创新的一体化模型,用于面部识别和物理及数字攻击检测。通过利用先进的Swin Transformer骨干网,并在卷积神经网络框架中融入HiLo注意力机制,我们更有效地解决了统一的面部识别和伪造攻击检测问题。此外,我们引入了增强技术来复制物理和数字伪造线索的特点,大大增强了模型的鲁棒性。 通过跨多种数据集进行全面实验评估,我们展示了我们的模型在统一面部识别和伪造检测方面的有效性。另外,我们也确认了该模型对未见过的物理及数字伪造攻击具有抗御能力,突显其在实际应用中的潜力。
https://arxiv.org/abs/2501.09635
Code-switching, the alternation of languages within a single discourse, presents a significant challenge for Automatic Speech Recognition. Despite the unique nature of the task, performance is commonly measured with established metrics such as Word-Error-Rate (WER). However, in this paper, we question whether these general metrics accurately assess performance on code-switching. Specifically, using both Connectionist-Temporal-Classification and Encoder-Decoder models, we show fine-tuning on non-code-switched data from both matrix and embedded language improves classical metrics on code-switching test sets, although actual code-switched words worsen (as expected). Therefore, we propose Point-of-Interest Error Rate (PIER), a variant of WER that focuses only on specific words of interest. We instantiate PIER on code-switched utterances and show that this more accurately describes the code-switching performance, showing huge room for improvement in future work. This focused evaluation allows for a more precise assessment of model performance, particularly in challenging aspects such as inter-word and intra-word code-switching.
代码转换(code-switching),即在一个对话中交替使用多种语言,对自动语音识别技术提出了重大挑战。尽管任务具有独特性,但性能通常还是通过诸如单词错误率(Word-Error-Rate, WER)等既定指标来衡量的。然而,在本文中我们质疑这些通用度量是否能够准确评估代码转换的表现。 具体而言,基于连接主义时间分类和编码器-解码器模型,我们展示了在来自两种语言(矩阵语言和嵌入式语言)而非代码混合数据上的微调可以改善经典指标对代码转换测试集的性能。然而,实际上,在这些场景中涉及的代码切换词语的表现却更差(符合预期)。因此,本文提出了一个单词错误率(WER)变体——兴趣点误差率(Point-of-Interest Error Rate, PIER),该指标专门针对特定感兴趣的词进行评估。 我们利用PIER来分析代码转换中的特定词汇表现,并证明这种方法能更准确地描述模型在处理代码切换时的表现,指出未来改进的潜力巨大。这种有针对性的评估方式使我们可以更加精确地衡量模型性能,尤其是在跨词和词语内部的代码切换这些挑战性领域。
https://arxiv.org/abs/2501.09512
This study presents a comprehensive review of the potential of multimodal deep learning (DL) in medical diagnosis, using COVID-19 as a case example. Motivated by the success of artificial intelligence applications during the COVID-19 pandemic, this research aims to uncover the capabilities of DL in disease screening, prediction, and classification, and to derive insights that enhance the resilience, sustainability, and inclusiveness of science, technology, and innovation systems. Adopting a systematic approach, we investigate the fundamental methodologies, data sources, preprocessing steps, and challenges encountered in various studies and implementations. We explore the architecture of deep learning models, emphasising their data-specific structures and underlying algorithms. Subsequently, we compare different deep learning strategies utilised in COVID-19 analysis, evaluating them based on methodology, data, performance, and prerequisites for future research. By examining diverse data types and diagnostic modalities, this research contributes to scientific understanding and knowledge of the multimodal application of DL and its effectiveness in diagnosis. We have implemented and analysed 11 deep learning models using COVID-19 image, text, and speech (ie, cough) data. Our analysis revealed that the MobileNet model achieved the highest accuracy of 99.97% for COVID-19 image data and 93.73% for speech data (i.e., cough). However, the BiGRU model demonstrated superior performance in COVID-19 text classification with an accuracy of 99.89%. The broader implications of this research suggest potential benefits for other domains and disciplines that could leverage deep learning techniques for image, text, and speech analysis.
这项研究提供了多模态深度学习(DL)在医学诊断中潜在应用的全面回顾,以COVID-19为例。鉴于人工智能技术在新冠疫情期间的成功应用,本研究旨在揭示深度学习在疾病筛查、预测和分类方面的潜力,并从中获得有助于增强科学、技术和创新体系韧性、可持续性和包容性的见解。采用系统方法,我们探讨了各种研究与实施中遇到的基本方法论、数据来源、预处理步骤以及所面临的挑战。我们还探索了深度学习模型的架构,强调其特定于数据的结构及其基础算法。接下来,我们将比较在COVID-19分析中使用的不同深度学习策略,并根据方法学、数据、性能和未来研究的需求对其进行评估。 通过考察不同类型的数据及诊断模式,本研究为多模态应用下的DL科学理解和知识贡献了力量,并探讨其在诊断中的有效性。我们实施并分析了11种基于COVID-19图像、文本以及语音(即咳嗽)数据的深度学习模型。我们的分析表明,MobileNet模型对COVID-19图像数据实现了最高精度为99.97%,而针对语音数据(如咳嗽)的准确率达到了93.73%。然而,在COVID-19文本分类中,BiGRU模型表现出色,其准确性达到99.89%。 这项研究更广泛的含义在于,它可能对其他领域和学科产生潜在益处,这些领域和学科可以利用深度学习技术进行图像、文本以及语音分析。
https://arxiv.org/abs/2501.09506
Training deep neural networks requires datasets with a large number of annotated examples. The collection and annotation of these datasets is not only extremely expensive but also faces legal and privacy problems. These factors are a significant limitation for many real-world applications. To address this, we introduce HydraMix, a novel architecture that generates new image compositions by mixing multiple different images from the same class. HydraMix learns the fusion of the content of various images guided by a segmentation-based mixing mask in feature space and is optimized via a combination of unsupervised and adversarial training. Our data augmentation scheme allows the creation of models trained from scratch on very small datasets. We conduct extensive experiments on ciFAIR-10, STL-10, and ciFAIR-100. Additionally, we introduce a novel text-image metric to assess the generality of the augmented datasets. Our results show that HydraMix outperforms existing state-of-the-art methods for image classification on small datasets.
训练深度神经网络需要大量带有标注的数据集。这些数据集的收集和标注不仅成本极高,还面临着法律和隐私问题。这些问题在许多实际应用中构成了重大限制。为了应对这一挑战,我们引入了一种名为HydraMix的新架构,该架构通过混合同一类中的多个不同图像来生成新的图像组合。HydraMix利用基于分割的混合掩码,在特征空间中学习多种图像内容的融合,并通过无监督和对抗性训练进行优化。我们的数据增强方案允许从非常小的数据集中从头开始创建模型。 我们在ciFAIR-10、STL-10和ciFAIR-100上进行了广泛的实验,并且还引入了一种新的文本-图像指标,用于评估扩充后的数据集的泛化能力。实验结果表明,HydraMix在小数据集上的图像分类任务中优于现有的最先进的方法。
https://arxiv.org/abs/2501.09504
Foundation models have revolutionized computer vision by achieving vastly superior performance across diverse tasks through large-scale pretraining on extensive datasets. However, their application in surgical computer vision has been limited. This study addresses this gap by introducing SurgeNetXL, a novel surgical foundation model that sets a new benchmark in surgical computer vision. Trained on the largest reported surgical dataset to date, comprising over 4.7 million video frames, SurgeNetXL achieves consistent top-tier performance across six datasets spanning four surgical procedures and three tasks, including semantic segmentation, phase recognition, and critical view of safety (CVS) classification. Compared with the best-performing surgical foundation models, SurgeNetXL shows mean improvements of 2.4, 9.0, and 12.6 percent for semantic segmentation, phase recognition, and CVS classification, respectively. Additionally, SurgeNetXL outperforms the best-performing ImageNet-based variants by 14.4, 4.0, and 1.6 percent in the respective tasks. In addition to advancing model performance, this study provides key insights into scaling pretraining datasets, extending training durations, and optimizing model architectures specifically for surgical computer vision. These findings pave the way for improved generalizability and robustness in data-scarce scenarios, offering a comprehensive framework for future research in this domain. All models and a subset of the SurgeNetXL dataset, including over 2 million video frames, are publicly available at: this https URL.
基础模型通过在大规模数据集上的预训练,已在计算机视觉领域实现了跨多种任务的卓越性能。然而,在手术计算机视觉领域的应用却相对有限。本研究旨在填补这一空白,引入了SurgeNetXL,这是一种新型的手术基础模型,并为手术计算机视觉设定了新的基准。该模型是在迄今为止报道的最大规模的手术数据集上训练出来的,包含超过470万帧视频图像。SurgeNetXL在涵盖四个手术程序和三个任务(语义分割、阶段识别以及关键安全视图(CVS)分类)的六个数据集中均表现出持续领先的成绩。 相较于目前表现最佳的手术基础模型,SurgeNetXL在语义分割、阶段识别及CVS分类上分别提高了2.4%,9.0%和12.6%。此外,在各自的任务中,与基于ImageNet的最佳变体相比,SurgeNetXL的表现也高出14.4%,4.0%以及1.6%。 除提升模型性能外,本研究还提供了有关如何扩大预训练数据集规模、延长训练时长及优化手术计算机视觉领域中的模型架构的关键见解。这些发现为在数据稀缺场景下提高通用性和鲁棒性铺平了道路,并为该领域的未来研究提供了一个全面的框架。 所有模型以及SurgeNetXL数据集中的一部分(包括超过200万帧视频图像)均可从以下网址公开获取:[此链接](https://thishttpsURL.com)。
https://arxiv.org/abs/2501.09436
The proliferation of Internet of Things (IoT) devices equipped with acoustic sensors necessitates robust acoustic scene classification (ASC) capabilities, even in noisy and data-limited environments. Traditional machine learning methods often struggle to generalize effectively under such conditions. To address this, we introduce Q-ASC, a novel Quantum-Inspired Acoustic Scene Classifier that leverages the power of quantum-inspired transformers. By integrating quantum concepts like superposition and entanglement, Q-ASC achieves superior feature learning and enhanced noise resilience compared to classical models. Furthermore, we introduce a Quantum Variational Autoencoder (QVAE) based data augmentation technique to mitigate the challenge of limited labeled data in IoT deployments. Extensive evaluations on the Tampere University of Technology (TUT) Acoustic Scenes 2016 benchmark dataset demonstrate that Q-ASC achieves remarkable accuracy between 68.3% and 88.5% under challenging conditions, outperforming state-of-the-art methods by over 5% in the best case. This research paves the way for deploying intelligent acoustic sensing in IoT networks, with potential applications in smart homes, industrial monitoring, and environmental surveillance, even in adverse acoustic environments.
物联网(IoT)设备中配备声学传感器的数量激增,要求在嘈杂且数据有限的环境中具备强大的声景分类(ASC)能力。传统机器学习方法在这种条件下往往难以有效泛化。为此,我们引入了Q-ASC,这是一种基于量子启发式变压器的新型量子启发式声景分类器,利用了诸如叠加和纠缠等量子概念的力量,从而在特征学习和抗噪性能方面超越经典模型。此外,为了缓解物联网部署中标签数据不足的问题,我们还提出了一种基于量子变分自动编码器(QVAE)的数据增强技术。 我们在坦佩雷理工大学(TUT)2016年声景基准数据集上进行了广泛的评估,结果表明,在挑战性的条件下,Q-ASC的准确性达到了惊人的68.3%到88.5%,在最佳情况下比现有方法高出超过5%。这项研究为在物联网网络中部署智能声学传感铺平了道路,并且潜在的应用包括智能家居、工业监控和环境监视,即使是在不良的声学环境中也是如此。
https://arxiv.org/abs/2501.09394
We present a simple usage of pre-trained Vision Transformers (ViTs) for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as different bird species or dog breeds. Pre-trained ViTs such as DINO have shown remarkable capabilities to extract localized, informative features. However, using saliency maps like Grad-CAM can hardly point out the traits: they often locate the whole object by a blurred, coarse heatmap, not traits. We propose a novel approach Prompt Class Attention Map (Prompt-CAM) to the rescue. Prompt-CAM learns class-specific prompts to a pre-trained ViT and uses the corresponding outputs for classification. To classify an image correctly, the true-class prompt must attend to the unique image patches not seen in other classes' images, i.e., traits. As such, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a free lunch by simply modifying the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM fairly easy to train and apply, sharply contrasting other interpretable methods that design specific models and training processes. It is even simpler than the recently published INterpretable TRansformer (INTR), whose encoder-decoder architecture prevents it from leveraging pre-trained ViTs. Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate Prompt-CAM superior interpretation capability.
我们提出了一种简单的方法,利用预训练的视觉变压器(Vision Transformers, ViT)进行细粒度分析,旨在识别并定位区分类似视觉类别的特征,例如不同的鸟类或犬品种。像DINO这样的预训练ViT已经显示出能够提取局部化、有信息量特征的非凡能力。然而,使用诸如Grad-CAM之类的注意力图很难指出这些特征:它们通常通过模糊且粗略的热力图来定位整个对象,而不是具体的特征。为此,我们提出了一种新的方法——Prompt Class Attention Map(Prompt-CAM),以解决这一问题。 Prompt-CAM通过对预训练ViT进行类别特定的提示学习,并使用相应的输出来进行分类。为了正确地对图像进行分类,真实类别的提示必须关注在其他类别中未出现的独特图块(即特征)。因此,真实的多头注意力图揭示了特征及其位置。从实现角度来看,Prompt-CAM几乎是一顿免费午餐,只需修改视觉提示调整(Visual Prompt Tuning, VPT)的预测头部即可。这使得Prompt-CAM训练和应用起来相对容易,与需要设计特定模型和训练过程的其他可解释方法形成了鲜明对比。 甚至比最近发布的用于Transformer的INterpretable TRansformer (INTR)还要简单,后者由于其编码器-解码器架构而无法利用预训练的ViT。在来自不同领域的十几个数据集上进行广泛的实证研究验证了Prompt-CAM优越的解释能力。
https://arxiv.org/abs/2501.09333
In real-world sequential decision making tasks like autonomous driving, robotics, and healthcare, learning from observed state-action trajectories is critical for tasks like imitation, classification, and clustering. For example, self-driving cars must replicate human driving behaviors, while robots and healthcare systems benefit from modeling decision sequences, whether or not they come from expert data. Existing trajectory encoding methods often focus on specific tasks or rely on reward signals, limiting their ability to generalize across domains and tasks. Inspired by the success of embedding models like CLIP and BERT in static domains, we propose a novel method for embedding state-action trajectories into a latent space that captures the skills and competencies in the dynamic underlying decision-making processes. This method operates without the need for reward labels, enabling better generalization across diverse domains and tasks. Our contributions are threefold: (1) We introduce a trajectory embedding approach that captures multiple abilities from state-action data. (2) The learned embeddings exhibit strong representational power across downstream tasks, including imitation, classification, clustering, and regression. (3) The embeddings demonstrate unique properties, such as controlling agent behaviors in IQ-Learn and an additive structure in the latent space. Experimental results confirm that our method outperforms traditional approaches, offering more flexible and powerful trajectory representations for various applications. Our code is available at this https URL.
在现实世界的顺序决策任务中,如自动驾驶、机器人技术和医疗保健领域,从观察到的状态-动作轨迹(state-action trajectories)中学习对于模仿、分类和聚类等任务至关重要。例如,无人驾驶汽车需要复制人类驾驶行为,而机器人系统和医疗健康系统则可以从建模决策序列中受益,不论这些数据是否来自专家。现有的轨迹编码方法通常专注于特定的任务或依赖于奖励信号,这限制了它们在跨领域和任务中的泛化能力。 受嵌入模型如CLIP(Contrastive Language–Image Pre-training)和BERT(Bidirectional Encoder Representations from Transformers)在静态域中取得成功的启发,我们提出了一种将状态-动作轨迹嵌入到一个潜在空间的方法。这种方法旨在捕捉动态决策过程中的技能与能力,并且不需要奖励标签,从而能够在不同领域和任务之间实现更好的泛化。 我们的贡献主要包括三个方面: 1. 我们引入了一种新的轨迹嵌入方法,该方法能够从状态-动作数据中捕获多种能力。 2. 学习到的嵌入具有强大的跨下游任务表示能力,包括模仿、分类、聚类和回归等。 3. 嵌入还展示了独特的性质,如在IQ-Learn中控制代理行为以及潜在空间中的加性结构。 实验结果证实了我们提出的方法优于传统方法,为各种应用提供了更灵活且强大的轨迹表示。我们的代码可在以下网址获取:[这个URL应该是一个实际可用的链接,在此处用占位符"this https URL"代替]。
https://arxiv.org/abs/2501.09327
Nowadays, more and more images are available. Annotation and retrieval of the images pose classification problems, where each class is defined as the group of database images labelled with a common semantic label. Various systems have been proposed for content-based retrieval, as well as for image classification and indexing. In this paper, a hierarchical classification framework has been proposed for bridging the semantic gap effectively and achieving multi-category image classification. A well known pre-processing and post-processing method was used and applied to three problems; image segmentation, object identification and image classification. The method was applied to classify single object images from Amazon and Google datasets. The classification was tested for four different classifiers; BayesNetwork (BN), Random Forest (RF), Bagging and Vote. The estimated classification accuracies ranged from 20% to 99% (using 10-fold cross validation). The Bagging classifier presents the best performance, followed by the Random Forest classifier.
如今,越来越多的图像可供使用。对这些图像进行标注和检索时会遇到分类问题,每个类别被定义为一组带有共同语义标签的数据集图片。已经提出了多种基于内容的检索系统以及用于图像分类和索引的方法。本文提出了一种层次化分类框架,旨在有效地弥合语义差距,并实现多类别的图像分类。文中使用并应用了一个著名的预处理和后处理方法来解决三个问题:图像分割、对象识别和图像分类。该方法被应用于亚马逊(Amazon)和谷歌(Google)数据集中的单个对象图片的分类任务上。 采用四种不同的分类器进行测试,包括贝叶斯网络(Bayes Network, BN)、随机森林(Random Forest, RF)、Bagging 和投票(Vote)。经过10折交叉验证后,估计的分类准确率范围从20%到99%。其中,Bagging 分类器表现最佳,其次是随机森林分类器。
https://arxiv.org/abs/2501.09311
Few-shot learning in medical image classification presents a significant challenge due to the limited availability of annotated data and the complex nature of medical imagery. In this work, we propose Adaptive Vision-Language Fine-tuning with Hierarchical Contrastive Alignment (HiCA), a novel framework that leverages the capabilities of Large Vision-Language Models (LVLMs) for medical image analysis. HiCA introduces a two-stage fine-tuning strategy, combining domain-specific pretraining and hierarchical contrastive learning to align visual and textual representations at multiple levels. We evaluate our approach on two benchmark datasets, Chest X-ray and Breast Ultrasound, achieving state-of-the-art performance in both few-shot and zero-shot settings. Further analyses demonstrate the robustness, generalizability, and interpretability of our method, with substantial improvements in performance compared to existing baselines. Our work highlights the potential of hierarchical contrastive strategies in adapting LVLMs to the unique challenges of medical imaging tasks.
在医疗图像分类中,少样本学习(few-shot learning)面临着一个显著的挑战,即标注数据的有限可用性和医学影像的复杂性。本文提出了一种新的框架——自适应视觉-语言微调与层次对比对齐(HiCA),该框架利用大规模视觉-语言模型(LVLMs)的能力来进行医疗图像分析。HiCA引入了一个两阶段的微调策略,结合领域特定预训练和分层对比学习来在多个层级上对齐视觉和文本表示。我们在两个基准数据集——胸部X光片和乳腺超声检查数据集中评估了我们的方法,在少样本(few-shot)和零样本(zero-shot)设置下均取得了最先进的性能表现。进一步的分析表明,与现有的基线相比,该方法具有更强的鲁棒性、泛化能力和可解释性,并在性能上实现了显著提升。我们的工作强调了层次对比策略在将LVLMs适应医学成像任务独特挑战方面的潜力。
https://arxiv.org/abs/2501.09294
Zero-shot recognition models require extensive training data for generalization. However, in zero-shot 3D classification, collecting 3D data and captions is costly and laborintensive, posing a significant barrier compared to 2D vision. Recent advances in generative models have achieved unprecedented realism in synthetic data production, and recent research shows the potential for using generated data as training data. Here, naturally raising the question: Can synthetic 3D data generated by generative models be used as expanding limited 3D datasets? In response, we present a synthetic 3D dataset expansion method, Textguided Geometric Augmentation (TeGA). TeGA is tailored for language-image-3D pretraining, which achieves SoTA in zero-shot 3D classification, and uses a generative textto-3D model to enhance and extend limited 3D datasets. Specifically, we automatically generate text-guided synthetic 3D data and introduce a consistency filtering strategy to discard noisy samples where semantics and geometric shapes do not match with text. In the experiment to double the original dataset size using TeGA, our approach demonstrates improvements over the baselines, achieving zeroshot performance gains of 3.0% on Objaverse-LVIS, 4.6% on ScanObjectNN, and 8.7% on ModelNet40. These results demonstrate that TeGA effectively bridges the 3D data gap, enabling robust zero-shot 3D classification even with limited real training data and paving the way for zero-shot 3D vision application.
零样本识别模型需要大量的训练数据来实现泛化能力。然而,在零样本三维分类中,收集三维数据和描述信息的成本高昂且费时,相较于二维视觉任务而言这是一个显著的障碍。近期在生成模型方面的进展实现了前所未有的合成数据的真实感,最近的研究表明可以利用这些生成的数据作为训练数据使用。这自然引出了一个问题:通过生成模型创建的合成三维数据能否用于扩展有限的三维数据集? 为此,我们提出了一种基于文本引导几何增强(Text-guided Geometric Augmentation, TeGA)的方法来扩充有限的3D数据集。TeGA专门针对语言-图像-3D预训练设计,在零样本3D分类中达到了最先进的性能,并利用生成的文本到三维模型来提升和扩展受限的3D数据集。 具体而言,我们自动根据描述生成合成的3D数据,并引入了一种一致性过滤策略以剔除语义或几何形状与文本不匹配的噪声样本。在使用TeGA将原始数据集大小翻倍的实验中,我们的方法相较于基准线表现出显著改进,在Objaverse-LVIS上实现了零样本性能提升3.0%,在ScanObjectNN上为4.6%,而在ModelNet40上则达到了8.7%。 这些结果表明,TeGA有效地填补了三维数据的缺口,并且即使是在有限的真实训练数据的情况下,也能实现稳健的零样本三维分类。这为进一步开展零样本三维视觉应用铺平了道路。
https://arxiv.org/abs/2501.09278
Model compression through knowledge distillation has seen extensive application in classification and segmentation tasks. However, its potential in image-to-image translation, particularly in image restoration, remains underexplored. To address this gap, we propose a Simultaneous Learning Knowledge Distillation (SLKD) framework tailored for model compression in image restoration tasks. SLKD employs a dual-teacher, single-student architecture with two distinct learning strategies: Degradation Removal Learning (DRL) and Image Reconstruction Learning (IRL), simultaneously. In DRL, the student encoder learns from Teacher A to focus on removing degradation factors, guided by a novel BRISQUE extractor. In IRL, the student decoder learns from Teacher B to reconstruct clean images, with the assistance of a proposed PIQE extractor. These strategies enable the student to learn from degraded and clean images simultaneously, ensuring high-quality compression of image restoration models. Experimental results across five datasets and three tasks demonstrate that SLKD achieves substantial reductions in FLOPs and parameters, exceeding 80\%, while maintaining strong image restoration performance.
模型通过知识蒸馏进行压缩在分类和分割任务中得到了广泛的应用,但在图像到图像的转换领域,尤其是在图像恢复方面,其潜力尚未被充分探索。为了填补这一空白,我们提出了一种专门用于图像恢复任务中的模型压缩的Simultaneous Learning Knowledge Distillation (SLKD)框架。SLKD采用了一个双教师单学生架构,并结合了两种不同的学习策略:退化移除学习(DRL)和图像重建学习(IRL),以同时进行。 在DRL中,学生编码器从教师A那里学习如何专注于去除退化因素,这受到新颖的BRISQUE提取器的指导。而在IRL中,学生解码器从教师B那里学习如何利用提出的PIQE提取器的帮助来重建干净图像。这两种策略使学生能够同时从降质和未降质的图像中学到知识,从而确保高质量地压缩图像恢复模型。 通过五个数据集和三个任务进行的实验结果表明,SLKD在保持强大的图像恢复性能的同时,可以实现超过80%的FLOPs(浮点运算次数)和参数减少。
https://arxiv.org/abs/2501.09268
Short text classification has gained significant attention in the information age due to its prevalence and real-world applications. Recent advancements in graph learning combined with contrastive learning have shown promising results in addressing the challenges of semantic sparsity and limited labeled data in short text classification. However, existing models have certain limitations. They rely on explicit data augmentation techniques to generate contrastive views, resulting in semantic corruption and noise. Additionally, these models only focus on learning the intrinsic consistency between the generated views, neglecting valuable discriminative information from other potential views. To address these issues, we propose a Simple graph contrastive learning framework for Short Text Classification (SimSTC). Our approach involves performing graph learning on multiple text-related component graphs to obtain multi-view text embeddings. Subsequently, we directly apply contrastive learning on these embeddings. Notably, our method eliminates the need for data augmentation operations to generate contrastive views while still leveraging the benefits of multi-view contrastive learning. Despite its simplicity, our model achieves outstanding performance, surpassing large language models on various datasets.
在信息时代,由于其普遍性和实际应用价值,短文本分类已获得了广泛关注。近期,图学习与对比学习相结合的技术,在解决短文本分类中的语义稀疏性和标注数据不足的挑战方面显示出巨大潜力。然而,现有的模型存在一定的局限性:它们依赖于显式的数据增强技术来生成对比视图,这会导致语义污染和噪声;此外,这些模型仅关注于从生成的视图中学习内在一致性,而忽视了其他潜在视图中的有价值的区别信息。 为了解决这些问题,我们提出了一种用于短文本分类的简单图对比学习框架(SimSTC)。我们的方法包括在多个与文本相关的组件图上执行图学习以获取多视角文本嵌入,随后直接在这些建模后的嵌入上应用对比学习。特别值得注意的是,我们的方法消除了生成对比视图所需的数据增强操作,同时仍然利用了多视图对比学习的好处。尽管该模型结构简单,但在各种数据集上的性能表现却非常出色,并且超过了大型语言模型的水平。
https://arxiv.org/abs/2501.09219
Accurate molecular quantification is essential for advancing research and diagnostics in fields such as infectious diseases, cancer biology, and genetic disorders. Droplet digital PCR (ddPCR) has emerged as a gold standard for achieving absolute quantification. While computational ddPCR technologies have advanced significantly, achieving automatic interpretation and consistent adaptability across diverse operational environments remains a challenge. To address these limitations, we introduce the intelligent interpretable droplet digital PCR (I2ddPCR) assay, a comprehensive framework integrating front-end predictive models (for droplet segmentation and classification) with GPT-4o multimodal large language model (MLLM, for context-aware explanations and recommendations) to automate and enhance ddPCR image analysis. This approach surpasses the state-of-the-art models, affording 99.05% accuracy in processing complex ddPCR images containing over 300 droplets per image with varying signal-to-noise ratios (SNRs). By combining specialized neural networks and large language models, the I2ddPCR assay offers a robust and adaptable solution for absolute molecular quantification, achieving a sensitivity capable of detecting low-abundance targets as low as 90.32 copies/{\mu}L. Furthermore, it improves model's transparency through detailed explanation and troubleshooting guidance, empowering users to make informed decisions. This innovative framework has the potential to benefit molecular diagnostics, disease research, and clinical applications, especially in resource-constrained settings.
准确的分子定量对于传染病、癌症生物学和遗传疾病等领域的研究与诊断至关重要。液滴数字PCR(ddPCR)已成为实现绝对定量的标准方法之一。尽管计算型ddPCR技术已经取得了显著进展,但自动解读及在各种操作环境下的一致适应性仍然面临挑战。为了解决这些问题,我们引入了智能可解释的液滴数字PCR(I2ddPCR)检测法,这是一个全面的框架,结合前端预测模型(用于液滴分割和分类)与GPT-4o多模态大型语言模型(MLLM,用于上下文感知说明及建议),以实现并增强ddPCR图像分析的自动化。这种方法超越了现有技术,在处理每张包含超过300个液滴且信噪比变化复杂的ddPCR图像时,达到了99.05%的精度。 通过结合专门的神经网络和大型语言模型,I2ddPCR检测法提供了一种稳健且适应性强的解决方案,实现了能够检测低至每微升90.32拷贝数的目标分子绝对定量。此外,它通过详细的解释与故障排除指导提高了模型的透明度,使用户能够做出明智的决定。 这种创新框架有可能在分子诊断、疾病研究和临床应用(尤其是在资源受限的情况下)中发挥重要作用。
https://arxiv.org/abs/2501.09218
Time series classification (TSC) is fundamental in numerous domains, including finance, healthcare, and environmental monitoring. However, traditional TSC methods often struggle with the inherent complexity and variability of time series data. Building on our previous work with the linear law-based transformation (LLT) - which improved classification accuracy by transforming the feature space based on key data patterns - we introduce adaptive law-based transformation (ALT). ALT enhances LLT by incorporating variable-length shifted time windows, enabling it to capture distinguishing patterns of various lengths and thereby handle complex time series more effectively. By mapping features into a linearly separable space, ALT provides a fast, robust, and transparent solution that achieves state-of-the-art performance with only a few hyperparameters.
时间序列分类(TSC)在金融、医疗保健和环境监测等多个领域中至关重要。然而,传统的TSC方法往往难以应对时间序列数据的固有复杂性和多变性。基于我们之前关于线性法则转换(LLT)的工作——该工作通过根据关键数据模式转换特征空间来提高分类准确性——我们现在引入了自适应法则转换(ALT)。ALT通过加入可变长度的时间窗口偏移,增强了LLT的功能,使其能够捕捉不同长度的区分模式,并因此更有效地处理复杂时间序列。通过将特征映射到线性可分离的空间中,ALT提供了一种快速、稳健且透明的解决方案,在仅使用少量超参数的情况下就能达到最先进的性能水平。
https://arxiv.org/abs/2501.09217
Short text classification, as a research subtopic in natural language processing, is more challenging due to its semantic sparsity and insufficient labeled samples in practical scenarios. We propose a novel model named MI-DELIGHT for short text classification in this work. Specifically, it first performs multi-source information (i.e., statistical information, linguistic information, and factual information) exploration to alleviate the sparsity issues. Then, the graph learning approach is adopted to learn the representation of short texts, which are presented in graph forms. Moreover, we introduce a dual-level (i.e., instance-level and cluster-level) contrastive learning auxiliary task to effectively capture different-grained contrastive information within massive unlabeled data. Meanwhile, previous models merely perform the main task and auxiliary tasks in parallel, without considering the relationship among tasks. Therefore, we introduce a hierarchical architecture to explicitly model the correlations between tasks. We conduct extensive experiments across various benchmark datasets, demonstrating that MI-DELIGHT significantly surpasses previous competitive models. It even outperforms popular large language models on several datasets.
短文本分类作为自然语言处理领域的研究子课题,因其语义稀疏性和实际场景中标签样本不足而更具挑战性。在本工作中,我们提出了一种名为MI-DELIGHT的新模型,用于解决短文本分类问题。具体来说,该模型首先进行多源信息(即统计信息、语言学信息和事实信息)的探索以缓解语义稀疏性的问题。然后,采用图学习方法来学习用图形式表示的短文本的表征。此外,我们引入了一种双层次(实例级和聚类级)对比学习辅助任务,有效捕获大量未标记数据中的不同粒度的对比信息。同时,先前的模型仅在并行执行主要任务与辅助任务时,并不考虑各任务之间的关系。因此,我们引入了分层架构以明确建模各个任务间的相关性。我们在多个基准数据集上进行了广泛的实验,结果表明MI-DELIGHT显著超越了以前的竞争模型,在某些数据集中甚至超过了流行的大规模语言模型的性能。
https://arxiv.org/abs/2501.09214