The appearance of surface impurities (e.g., water stains, fingerprints, stickers) is an often-mentioned issue that causes degradation of automated visual inspection systems. At the same time, synthetic data generation techniques for visual surface inspection have focused primarily on generating perfect examples and defects, disregarding impurities. This study highlights the importance of considering impurities when generating synthetic data. We introduce a procedural method to include photorealistic water stains in synthetic data. The synthetic datasets are generated to correspond to real datasets and are further used to train an anomaly detection model and investigate the influence of water stains. The high-resolution images used for surface inspection lead to memory bottlenecks during anomaly detection training. To address this, we introduce Sequential PatchCore - a method to build coresets sequentially and make training on large images using consumer-grade hardware tractable. This allows us to perform transfer learning using coresets pre-trained on different dataset versions. Our results show the benefits of using synthetic data for pre-training an explicit coreset anomaly model and the extended performance benefits of finetuning the coreset using real data. We observed how the impurities and labelling ambiguity lower the model performance and have additionally reported the defect-wise recall to provide an industrially relevant perspective on model performance.
表面杂质(如水渍、指纹和贴纸)的出现是导致自动化视觉检测系统性能下降的一个经常被提及的问题。同时,用于视觉表面检查的合成数据生成技术主要集中在生成完美示例及缺陷上,而忽略了杂质的影响。本研究强调了在生成合成数据时考虑杂质的重要性,并提出了一种程序化方法来将逼真的水渍纳入合成数据中。我们生成的合成数据集与真实数据集相对应,并进一步用于训练异常检测模型以调查水渍的影响。 高分辨率图像在表面检查中的使用会导致在进行异常检测训练时出现内存瓶颈问题。为了解决这个问题,我们引入了Sequential PatchCore方法——一种顺序构建核心样本集(coresets)的方法,使在大型图像上使用消费级硬件进行训练成为可能。这使得我们可以利用预先在不同数据版本上经过训练的核心集来进行迁移学习。 我们的结果显示,在预训练显式核心异常模型时使用合成数据是有益的,并且通过真实数据对核心集进行微调可以进一步提升性能表现。我们观察到杂质和标签模糊度会降低模型性能,此外还报告了缺陷级别的召回率,以提供一个与工业相关的视角来衡量模型性能。
https://arxiv.org/abs/2501.09579
AI algorithms have become valuable in aiding professionals in healthcare. The increasing confidence obtained by these models is helpful in critical decision demands. In clinical dermatology, classification models can detect malignant lesions on patients' skin using only RGB images as input. However, most learning-based methods employ data acquired from dermoscopic datasets on training, which are large and validated by a gold standard. Clinical models aim to deal with classification on users' smartphone cameras that do not contain the corresponding resolution provided by dermoscopy. Also, clinical applications bring new challenges. It can contain captures from uncontrolled environments, skin tone variations, viewpoint changes, noises in data and labels, and unbalanced classes. A possible alternative would be to use transfer learning to deal with the clinical images. However, as the number of samples is low, it can cause degradations on the model's performance; the source distribution used in training differs from the test set. This work aims to evaluate the gap between dermoscopic and clinical samples and understand how the dataset variations impact training. It assesses the main differences between distributions that disturb the model's prediction. Finally, from experiments on different architectures, we argue how to combine the data from divergent distributions, decreasing the impact on the model's final accuracy.
AI算法在医疗保健领域的专业人员中变得越来越有价值。这些模型获得的信心增长对关键决策需求非常有帮助。在临床皮肤学领域,分类模型能够仅通过RGB图像作为输入来检测患者的恶性病变。然而,大多数基于学习的方法都使用从皮肤病学数据集中获取的数据进行训练,而这些数据集规模庞大,并且经过黄金标准验证。相比之下,临床应用的模型旨在处理用户智能手机相机拍摄的照片,但这些照片并不具备与皮肤镜提供的分辨率相匹配的质量。 此外,临床应用场景带来了新的挑战:包括来自未受控环境中的图像捕捉、肤色变化、视角改变、数据和标签中的噪声以及类别不平衡等问题。一种可能的解决方案是利用迁移学习来应对临床图像的问题。然而,由于样本数量较少,这可能导致模型性能下降;训练时所用的数据源分布与测试集存在差异。 这项工作的目标是评估皮肤镜数据与临床样本之间的差距,并理解数据集变化如何影响训练过程中的表现。它还旨在评估不同分布的主要区别,这些区别会影响模型的预测准确性。最后,通过在不同架构上的实验,我们探讨了如何结合来自不同分布的数据,从而减少对最终模型精度的影响。
https://arxiv.org/abs/2501.08962
The number of people living in this agricultural nation of ours, which is surrounded by lush greenery, is growing on a daily basis. As a result of this, the level of arable land is decreasing, as well as residential houses and industrial factories. The food crisis is becoming the main threat for us in the upcoming days. Because on the one hand, the population is increasing, and on the other hand, the amount of food crop production is decreasing due to the attack of diseases. Rice is one of the most significant cultivated crops since it provides food for more than half of the world's population. Bangladesh is dependent on rice (Oryza sativa) as a vital crop for its agriculture, but it faces a significant problem as a result of the ongoing decline in rice yield brought on by common diseases. Early disease detection is the main difficulty in rice crop cultivation. In this paper, we proposed our own dataset, which was collected from the Bangladesh field, and also applied deep learning and transfer learning models for the evaluation of the datasets. We elaborately explain our dataset and also give direction for further research work to serve society using this dataset. We applied a light CNN model and pre-trained InceptionNet-V2, EfficientNet-V2, and MobileNet-V2 models, which achieved 91.5% performance for the EfficientNet-V2 model of this work. The results obtained assaulted other models and even exceeded approaches that are considered to be part of the state of the art. It has been demonstrated by this study that it is possible to precisely and effectively identify diseases that affect rice leaves using this unbiased datasets. After analysis of the performance of different models, the proposed datasets are significant for the society for research work to provide solutions for decreasing rice leaf disease.
我们这个被茂密绿色植被环绕的农业国家的人口数量每天都在增长。因此,可耕种的土地面积正在减少,并且住宅和工业厂房也在减少。这导致了即将到来的食物危机成为我们的主要威胁。一方面,人口在增加;另一方面,由于疾病的侵袭,粮食作物产量却在下降。大米是最重要的农作物之一,因为它为世界超过一半的人口提供食物。孟加拉国依赖于水稻(Oryza sativa)作为其农业的重要作物,但随着水稻产量因常见疾病而持续下降,它面临着一个重大问题。早期病害的检测是在水稻种植中遇到的主要困难。 在本文中,我们提出了自己的数据集,并从孟加拉现场收集了这些数据,还应用深度学习和迁移学习模型来评估这些数据集。我们详细解释了自己的数据集,并指出如何利用这个数据集进行进一步的研究工作以服务于社会。我们采用了一种轻量级的CNN模型以及预训练的InceptionNet-V2、EfficientNet-V2和MobileNet-V2模型,其中EfficientNet-V2模型在此工作中实现了91.5%的表现性能。这些结果超越了其他模型,并且甚至超过了被认为是当前最先进方法的技术。 研究表明,利用这个无偏数据集可以精确有效地识别影响稻叶的病害。通过分析不同模型的表现,提出的这些数据集对于研究工作来说非常重要,可以帮助减少水稻叶片疾病的问题。
https://arxiv.org/abs/2501.08912
Autonomous unmanned aerial vehicles (UAVs) integrated with edge computing capabilities empower real-time data processing directly on the device, dramatically reducing latency in critical scenarios such as wildfire detection. This study underscores Transfer Learning's (TL) significance in boosting the performance of object detectors for identifying wildfire smoke and flames, especially when trained on limited datasets, and investigates the impact TL has on edge computing metrics. With the latter focusing how TL-enhanced You Only Look Once (YOLO) models perform in terms of inference time, power usage, and energy consumption when using edge computing devices. This study utilizes the Aerial Fire and Smoke Essential (AFSE) dataset as the target, with the Flame and Smoke Detection Dataset (FASDD) and the Microsoft Common Objects in Context (COCO) dataset serving as source datasets. We explore a two-stage cascaded TL method, utilizing D-Fire or FASDD as initial stage target datasets and AFSE as the subsequent stage. Through fine-tuning, TL significantly enhances detection precision, achieving up to 79.2% mean Average Precision (mAP@0.5), reduces training time, and increases model generalizability across the AFSE dataset. However, cascaded TL yielded no notable improvements and TL alone did not benefit the edge computing metrics evaluated. Lastly, this work found that YOLOv5n remains a powerful model when lacking hardware acceleration, finding that YOLOv5n can process images nearly twice as fast as its newer counterpart, YOLO11n. Overall, the results affirm TL's role in augmenting the accuracy of object detectors while also illustrating that additional enhancements are needed to improve edge computing performance.
自主无人飞行器(UAV)与边缘计算能力的集成使设备能够直接进行实时数据处理,显著减少了诸如野火检测等关键场景中的延迟。这项研究强调了迁移学习(Transfer Learning, TL)在增强对象检测器识别野火烟雾和火焰性能方面的重要性,尤其是在使用有限的数据集时,并探讨了TL对边缘计算指标的影响。后者关注的是经过TL增强的You Only Look Once (YOLO)模型在推理时间、功耗和能量消耗方面的表现,这些数据是在使用边缘设备的情况下收集的。 本研究利用Aerial Fire and Smoke Essential (AFSE) 数据集作为目标,以Flame and Smoke Detection Dataset (FASDD) 和 Microsoft Common Objects in Context (COCO) 数据集作为源数据集。我们探索了一种两阶段级联迁移学习方法,第一阶段使用D-Fire或FASDD作为目标数据集,第二阶段则转向AFSE数据集。通过微调,TL显著提升了检测精度,在AFSE数据集上的平均准确率(mAP@0.5)达到了79.2%,同时减少了训练时间,并增强了模型在该数据集上的泛化能力。 然而,级联迁移学习并没有带来明显的改进,且单独的迁移学习并未改善所评估的边缘计算指标。最终发现,在缺乏硬件加速的情况下,YOLOv5n仍然是一个强大的模型,结果表明,YOLOv5n可以比其较新的版本YOLO11n处理图像快近一倍。 总体而言,研究结果确认了TL在提高对象检测器准确性方面的作用,并展示了为了提升边缘计算性能,还需要进行额外的改进。
https://arxiv.org/abs/2501.08639
In the domain of computer vision, Parameter-Efficient Tuning (PET) is increasingly replacing the traditional paradigm of pre-training followed by full fine-tuning. PET is particularly favored for its effectiveness in large foundation models, as it streamlines transfer learning costs and optimizes hardware utilization. However, the current PET methods are mainly designed for single-modal optimization. While some pioneering studies have undertaken preliminary explorations, they still remain at the level of aligned encoders (e.g., CLIP) and lack exploration of misaligned encoders. These methods show sub-optimal performance with misaligned encoders, as they fail to effectively align the multimodal features during fine-tuning. In this paper, we introduce DETRIS, a parameter-efficient tuning framework designed to enhance low-rank visual feature propagation by establishing dense interconnections between each layer and all preceding layers, which enables effective cross-modal feature interaction and adaptation to misaligned encoders. We also suggest using text adapters to improve textual features. Our simple yet efficient approach greatly surpasses state-of-the-art methods with 0.9% to 1.8% backbone parameter updates, evaluated on challenging benchmarks. Our project is available at \url{this https URL}.
在计算机视觉领域,参数高效调优(Parameter-Efficient Tuning,PET)正逐渐取代传统的预训练后进行全量微调的范式。PET特别受到青睐,因为它能在大型基础模型中有效实现迁移学习成本的降低和硬件资源利用率的最大化。然而,现有的PET方法主要针对单模态优化设计,尽管一些开创性的研究已经开始初步探索跨模态应用,但它们仍停留在对齐编码器(如CLIP)的阶段,并未深入探究不对齐编码器的应用情况。这些方法在面对不对齐编码器时表现出次优性能,因为它们无法有效地在微调过程中对多模态特征进行对齐。 本文介绍了一种新的参数高效调优框架DETRIS,该框架通过建立每一层与其他所有前一层之间的密集互联来增强低秩视觉特征的传播。这种设计使跨模态特征的有效交互和适应不对齐编码器成为可能。此外,我们建议使用文本适配器来改进文本特征。我们的方法简单而高效,在仅更新0.9%至1.8%主干参数的情况下,在具有挑战性的基准上超越了当前最先进的技术。 该项目的代码和资源可在此网址获取:\url{this https URL}。
https://arxiv.org/abs/2501.08580
Deep Learning for medical imaging faces challenges in adapting and generalizing to new contexts. Additionally, it often lacks sufficient labeled data for specific tasks requiring significant annotation effort. Continual Learning (CL) tackles adaptability and generalizability by enabling lifelong learning from a data stream while mitigating forgetting of previously learned knowledge. Active Learning (AL) reduces the number of required annotations for effective training. This work explores both approaches (CAL) to develop a novel framework for robust medical image analysis. Based on the automatic recognition of shifts in image characteristics, Replay-Base Architecture for Context Adaptation (RBACA) employs a CL rehearsal method to continually learn from diverse contexts, and an AL component to select the most informative instances for annotation. A novel approach to evaluate CAL methods is established using a defined metric denominated IL-Score, which allows for the simultaneous assessment of transfer learning, forgetting, and final model performance. We show that RBACA works in domain and class-incremental learning scenarios, by assessing its IL-Score on the segmentation and diagnosis of cardiac images. The results show that RBACA outperforms a baseline framework without CAL, and a state-of-the-art CAL method across various memory sizes and annotation budgets. Our code is available in this https URL .
深度学习在医学影像领域的应用面临着适应和泛化到新上下文中的挑战,同时往往缺乏特定任务所需的足够标注数据,这需要大量的注释工作。连续学习(Continual Learning, CL)通过从数据流中实现终身学习并减轻对先前知识遗忘的问题来应对这些挑战。主动学习(Active Learning, AL)则减少了有效训练所需的手动注释数量。本研究探讨了这两种方法的结合(CAL),旨在开发一种稳健的医学影像分析框架。 基于图像特征变化的自动识别,Replay-Base Architecture for Context Adaptation (RBACA) 使用连续学习的重演法不断从各种上下文中学习,并利用主动学习组件选择最具有信息量的数据实例进行注释。我们建立了一种新的方法来评估CAL方法的有效性,使用一个称为IL-Score的新定义指标,该指标同时评价了迁移学习、遗忘和最终模型性能。 通过在心脏影像分割与诊断中的域增量(domain-incremental)和类别增量(class-incremental)学习场景下评估RBACA的IL-Score,我们证明了RBACA能够超越没有CAL的基础框架,并且优于当前最先进的CAL方法,在各种记忆大小和注释预算条件下均表现出色。 我们的代码可以在提供的URL中获取。
https://arxiv.org/abs/2501.08245
Accurate human posture classification in images and videos is crucial for automated applications across various fields, including work safety, physical rehabilitation, sports training, or daily assisted living. Recently, multimodal learning methods, such as Contrastive Language-Image Pretraining (CLIP), have advanced significantly in jointly understanding images and text. This study aims to assess the effectiveness of CLIP in classifying human postures, focusing on its application in yoga. Despite the initial limitations of the zero-shot approach, applying transfer learning on 15,301 images (real and synthetic) with 82 classes has shown promising results. The article describes the full procedure for fine-tuning, including the choice for image description syntax, models and hyperparameters adjustment. The fine-tuned CLIP model, tested on 3826 images, achieves an accuracy of over 85%, surpassing the current state-of-the-art of previous works on the same dataset by approximately 6%, its training time being 3.5 times lower than what is needed to fine-tune a YOLOv8-based model. For more application-oriented scenarios, with smaller datasets of six postures each, containing 1301 and 401 training images, the fine-tuned models attain an accuracy of 98.8% and 99.1%, respectively. Furthermore, our experiments indicate that training with as few as 20 images per pose can yield around 90% accuracy in a six-class dataset. This study demonstrates that this multimodal technique can be effectively used for yoga pose classification, and possibly for human posture classification, in general. Additionally, CLIP inference time (around 7 ms) supports that the model can be integrated into automated systems for posture evaluation, e.g., for developing a real-time personal yoga assistant for performance assessment.
在图像和视频中准确分类人体姿态对于工作安全、物理康复、体育训练以及日常生活辅助等领域中的自动化应用至关重要。最近,多模态学习方法(如对比语言-图像预训练模型CLIP)在同时理解图像和文本方面取得了显著进展。本研究旨在评估CLIP在分类人类姿势方面的有效性,并重点关注其在瑜伽领域的应用。 尽管零样本方法存在初期限制,但通过对包含15,301张图片(真实和合成的)与82类别的数据集进行迁移学习,已经显示出非常有希望的结果。文章详细描述了微调过程中的全部步骤,包括选择图像描述语法、模型及超参数调整。 经过微调后的CLIP模型,在测试包含3826张图片的数据集上达到了超过85%的准确率,比之前在同一数据集中使用的方法高出了大约6%,且训练时间仅为基于YOLOv8模型微调所需时间的大约三分之一。对于更多应用场景,即每个姿势只有六类的小型数据集(分别包含1301张和401张训练图片),经过微调后的模型准确率达到了98.8%和99.1%,而且在每个姿势仅用20幅图像的数据集中,其在六个类别上的准确率约为90%。 这项研究表明,这种多模态技术可以有效地用于瑜伽姿势分类,并且可能适用于一般的人体姿态分类。此外,CLIP的推理时间(大约7毫秒)表明该模型可以集成到自动化系统中以评估人体姿态,例如开发实时个人瑜伽助手进行表现评估。
https://arxiv.org/abs/2501.07221
Predicting the impact of single-point amino acid mutations on protein stability is essential for understanding disease mechanisms and advancing drug development. Protein stability, quantified by changes in Gibbs free energy ($\Delta\Delta G$), is influenced by these mutations. However, the scarcity of data and the complexity of model interpretation pose challenges in accurately predicting stability changes. This study proposes the application of deep neural networks, leveraging transfer learning and fusing complementary information from different models, to create a feature-rich representation of the protein stability landscape. We developed four models, with our third model, ThermoMPNN+, demonstrating the best performance in predicting $\Delta\Delta G$ values. This approach, which integrates diverse feature sets and embeddings through latent transfusion techniques, aims to refine $\Delta\Delta G$ predictions and contribute to a deeper understanding of protein dynamics, potentially leading to advancements in disease research and drug discovery.
预测单点氨基酸突变对蛋白质稳定性的影响对于理解疾病机制和推动药物研发至关重要。蛋白质的稳定性可以通过吉布斯自由能变化($\Delta\Delta G$)来量化,这些变化受到突变的影响。然而,由于数据稀缺性和模型解释的复杂性,准确预测稳定性的变化面临着挑战。 本研究提出应用深度神经网络,并结合迁移学习和融合不同模型提供的互补信息,以创建蛋白质稳定性景观的丰富特征表示。我们开发了四种模型,其中第三种模型ThermoMPNN+在预测$\Delta\Delta G$值方面表现出最佳性能。这种通过潜在输血技术整合多样特征集和嵌入的方法旨在改进$\Delta\Delta G$预测,并深化对蛋白质动态的理解,这可能有助于疾病研究和药物发现的进步。
https://arxiv.org/abs/2501.07014
Despite the artificial intelligence (AI) revolution, deep learning has yet to achieve much success with tabular data due to heterogeneous feature space and limited sample sizes without viable transfer learning. The new era of generative AI, powered by large language models (LLM), brings unprecedented learning opportunities to diverse data and domains. This paper investigates the effectiveness of an LLM application programming interface (API) and transfer learning of LLM in tabular data classification. LLM APIs respond to input text prompts with tokenized data and instructions, whereas transfer learning finetunes an LLM for a target classification task. This paper proposes an end-to-end finetuning of LLM to demonstrate cross-data transfer learning on ten benchmark data sets when large pre-trained tabular data models do not exist to facilitate transfer learning. The proposed LLM finetuning method outperforms state-of-the-art machine and deep learning methods on tabular data with less than ten features - a standard feature size for tabular data sets. The transfer learning approach uses a fraction of the computational cost of other deep learning or API-based solutions while ensuring competitive or superior classification performance.
尽管人工智能(AI)革命带来了许多变革,深度学习在处理表格数据方面仍面临挑战,原因在于异构特征空间和样本量有限,导致难以实施有效的迁移学习。由大型语言模型(LLM)驱动的新一代生成式AI时代,则为各种类型的数据和领域提供了前所未有的学习机会。本文探讨了将大型语言模型应用编程接口(API)以及其在表格数据分类中的迁移学习的有效性。 大型语言模型的API能够根据输入文本提示返回标记化数据和指令,而迁移学习则通过微调LLM以适应特定的目标分类任务。本文提出了一种端到端的LLM微调方法,在不存在大规模预训练表格数据模型的情况下,展示了跨数据集的迁移学习效果,并在十个基准数据集上进行验证。当特征数量少于十个(这是大多数表格数据集中常见的标准)时,所提出的LLM微调方法优于现有的机器和深度学习方法。 该研究中采用的迁移学习方法只需其他深度学习或基于API解决方案计算成本的一小部分,同时确保了具有竞争力甚至更优的分类性能。
https://arxiv.org/abs/2501.06863
In nations such as Bangladesh, agriculture plays a vital role in providing livelihoods for a significant portion of the population. Identifying and classifying plant diseases early is critical to prevent their spread and minimize their impact on crop yield and quality. Various computer vision techniques can be used for such detection and classification. While CNNs have been dominant on such image classification tasks, vision transformers has become equally good in recent time also. In this paper we study the various computer vision techniques for Bangladeshi rice leaf disease detection. We use the Dhan-Shomadhan -- a Bangladeshi rice leaf disease dataset, to experiment with various CNN and ViT models. We also compared the performance of such deep neural network architecture with traditional machine learning architecture like Support Vector Machine(SVM). We leveraged transfer learning for better generalization with lower amount of training data. Among the models tested, ResNet50 exhibited the best performance over other CNN and transformer-based models making it the optimal choice for this task.
在像孟加拉国这样的国家,农业对大量人口的生计起着至关重要的作用。早期识别和分类植物疾病对于防止疾病的传播以及减少其对农作物产量和质量的影响至关重要。可以使用各种计算机视觉技术来进行此类检测和分类。虽然卷积神经网络(CNNs)在这类图像分类任务中占据主导地位,但视觉变压器在近期也取得了同样出色的性能。本文研究了用于孟加拉国水稻叶片疾病检测的各种计算机视觉技术。我们使用 Dhan-Shomadhan——一个孟加拉国的水稻叶片病害数据集,来试验各种 CNN 和 ViT(Vision Transformer)模型。我们还将这些深度神经网络架构与传统的机器学习架构如支持向量机(SVM)进行了性能比较。为了在训练数据较少的情况下实现更好的泛化效果,我们利用了迁移学习技术。在测试的所有模型中,ResNet50 在其他 CNN 和基于变压器的模型之上表现最佳,使其成为该任务的最佳选择。
https://arxiv.org/abs/2501.06740
Large-N nationally representative surveys, which have profoundly shaped American politics scholarship, represent related but distinct domains -a key condition for transfer learning applications. These surveys are related through their shared demographic, party identification, and ideological variables, yet differ in that individual surveys often lack specific policy preference questions that researchers require. Our study introduces a novel application of transfer learning (TL) to address these gaps, marking the first systematic use of TL paradigms in the context of survey data. Specifically, models pre-trained on the Cooperative Election Study (CES) dataset are fine-tuned for use in the American National Election Studies (ANES) dataset to predict policy questions based on demographic variables. Even with a naive architecture, our transfer learning approach achieves approximately 92 percentage accuracy in predicting missing variables across surveys, demonstrating the robust potential of this method. Beyond this specific application, our paper argues that transfer learning is a promising framework for maximizing the utility of existing survey data. We contend that artificial intelligence, particularly transfer learning, opens new frontiers in social science methodology by enabling systematic knowledge transfer between well-administered surveys that share common variables but differ in their outcomes of interest.
大型全国代表性调查(Large-N 全国性代表调查)在美国政治学研究中产生了深远的影响,这些调查涵盖了相关但不同的领域——这是转移学习应用的一个关键条件。这些调查通过共享的人口统计、政党认同和意识形态变量相互联系,然而,在个人层面的调查往往缺乏研究人员所需的特定政策偏好问题。我们的研究表明,首次系统性地将迁移学习(Transfer Learning, TL)应用于填补这些空白的新颖方法,这标志着在调查数据上下文中使用迁移学习范例的开端。 具体而言,我们采用预训练于“合作选举研究”(Cooperative Election Study, CES) 数据集上的模型,并将其微调以用于“美国全国选举研究”(American National Election Studies, ANES) 数据集,以此来基于人口统计变量预测政策问题。即使采用的是较为简单的架构,我们的迁移学习方法也能在跨调查数据中对缺失变量进行预测时达到约92%的准确率,证明了这种方法的强大潜力。 除了这一具体应用之外,我们的论文还主张,迁移学习是最大化现有调查数据效用的一个有前景的方法框架。我们坚持认为,人工智能(尤其是迁移学习)为社会科学方法开辟了新的前沿领域,通过使系统性知识转移成为可能,在那些共享共同变量但感兴趣的成果不同的良好管理的调查之间建立了联系。
https://arxiv.org/abs/2501.06577
Cloud removal plays a crucial role in enhancing remote sensing image analysis, yet accurately reconstructing cloud-obscured regions remains a significant challenge. Recent advancements in generative models have made the generation of realistic images increasingly accessible, offering new opportunities for this task. Given the conceptual alignment between image generation and cloud removal tasks, generative models present a promising approach for addressing cloud removal in remote sensing. In this work, we propose a deep transfer learning approach built on a generative adversarial network (GAN) framework to explore the potential of the novel masked autoencoder (MAE) image reconstruction model in cloud removal. Due to the complexity of remote sensing imagery, we further propose using a patch-wise discriminator to determine whether each patch of the image is real or not. The proposed reconstructive transfer learning approach demonstrates significant improvements in cloud removal performance compared to other GAN-based methods. Additionally, whilst direct comparisons with some of the state-of-the-art cloud removal techniques are limited due to unclear details regarding their train/test data splits, the proposed model achieves competitive results based on available benchmarks.
云去除在增强遥感图像分析中扮演着关键角色,但准确重构被云遮挡的区域仍然是一个重大挑战。近年来,生成模型的进步使得生成逼真的图像变得更加容易,为解决这一问题提供了新的机会。鉴于图像生成和云去除任务之间的概念一致性,生成模型为处理遥感中的云去除提供了一种有前景的方法。在本研究中,我们提出了一种基于生成对抗网络(GAN)框架的深度迁移学习方法,旨在探索新颖的掩码自动编码器(MAE)图像重构模型在云去除方面的潜力。鉴于遥感影像的复杂性,我们进一步提议采用分块鉴别器来判断图像中的每个区域是真是假。所提出的重构迁移学习方法在与其它基于GAN的方法相比时,在云去除性能上显示出显著改进。此外,尽管由于训练/测试数据划分细节不明确而限制了直接与一些最先进的云去除技术进行比较,但根据可用的基准来看,该模型实现了具有竞争力的结果。
https://arxiv.org/abs/2501.05265
In the medical field, accurate diagnosis of lung cancer is crucial for treatment. Traditional manual analysis methods have significant limitations in terms of accuracy and efficiency. To address this issue, this paper proposes a deep learning network framework based on the pre-trained MobileNetV2 model, initialized with weights from the ImageNet-1K dataset (version 2). The last layer of the model (the fully connected layer) is replaced with a new fully connected layer, and a softmax activation function is added to efficiently classify three types of lung cancer CT scan images. Experimental results show that the model achieves an accuracy of 99.6% on the test set, with significant improvements in feature extraction compared to traditional this http URL the rapid development of artificial intelligence technologies, deep learning applications in medical image processing are bringing revolutionary changes to the healthcare industry. AI-based lung cancer detection systems can significantly improve diagnostic efficiency, reduce the workload of doctors, and occupy an important position in the global healthcare market. The potential of AI to improve diagnostic accuracy, reduce medical costs, and promote precision medicine will have a profound impact on the future development of the healthcare industry.
在医学领域,准确诊断肺癌对于治疗至关重要。传统的手动分析方法在准确性与效率方面存在显著局限性。为了解决这一问题,本文提出了一种基于预训练的MobileNetV2模型(从ImageNet-1K数据集版本2中初始化权重)的深度学习网络框架。模型的最后一层(全连接层)被替换为一个新的全连接层,并添加了softmax激活函数以高效地分类三种类型的肺癌CT扫描图像。实验结果显示,该模型在测试集中达到了99.6%的准确率,在特征提取方面相比传统方法有显著改进。 随着人工智能技术的快速发展,深度学习在医学影像处理中的应用正在为医疗行业带来革命性的变化。基于AI的肺癌检测系统可以大幅提高诊断效率、减轻医生的工作负担,并在全球医疗卫生市场中占据重要地位。人工智能在提升诊断准确性、降低医疗成本和促进精准医疗方面的潜力将对未来医疗行业的未来发展产生深远影响。
https://arxiv.org/abs/2501.04996
As opposed to human drivers, current autonomous driving systems still require vast amounts of labeled data to train. Recently, world models have been proposed to simultaneously enhance autonomous driving capabilities by improving the way these systems understand complex real-world environments and reduce their data demands via self-supervised pre-training. In this paper, we present AD-L-JEPA (aka Autonomous Driving with LiDAR data via a Joint Embedding Predictive Architecture), a novel self-supervised pre-training framework for autonomous driving with LiDAR data that, as opposed to existing methods, is neither generative nor contrastive. Our method learns spatial world models with a joint embedding predictive architecture. Instead of explicitly generating masked unknown regions, our self-supervised world models predict Bird's Eye View (BEV) embeddings to represent the diverse nature of autonomous driving scenes. Our approach furthermore eliminates the need to manually create positive and negative pairs, as is the case in contrastive learning. AD-L-JEPA leads to simpler implementation and enhanced learned representations. We qualitatively and quantitatively demonstrate high-quality of embeddings learned with AD-L-JEPA. We furthermore evaluate the accuracy and label efficiency of AD-L-JEPA on popular downstream tasks such as LiDAR 3D object detection and associated transfer learning. Our experimental evaluation demonstrates that AD-L-JEPA is a plausible approach for self-supervised pre-training in autonomous driving applications and is the best available approach outperforming SOTA, including most recently proposed Occupancy-MAE [1] and ALSO [2]. The source code of AD-L-JEPA is available at this https URL.
与人类驾驶员不同,当前的自动驾驶系统仍然需要大量的标注数据来进行训练。最近,世界模型被提出以同时增强这些系统的理解能力,使其更好地处理复杂的现实环境,并通过自我监督的预训练来减少其对数据的需求。在本文中,我们提出了AD-L-JEPA(即基于LiDAR数据并通过联合嵌入预测架构进行自动驾驶),这是一种新颖的针对自动驾驶中的LiDAR数据的自监督预训练框架,与现有方法不同的是,它既不是生成式的也不是对比式的。我们的方法通过联合嵌入预测架构学习空间世界模型。不同于明确地生成遮蔽未知区域的方式,我们的自监督世界模型会预测俯视图(BEV)嵌入以表示自动驾驶场景的多样性。此外,我们所提出的方法还消除了创建正负样本对的需求,这是对比学习中需要手动完成的任务。因此,AD-L-JEPA简化了实现过程,并提升了学到的表示能力。我们在定性和定量上展示了通过AD-L-JEPA学得的嵌入具有高质量的特点。 为了评估AD-L-JEPA在下游任务中的准确性以及标注效率,我们对包括LiDAR 3D物体检测和相关迁移学习在内的流行任务进行了测试。实验结果表明,AD-L-JEPA是自监督预训练应用于自动驾驶领域的一种可行方法,并且优于现有的最佳方法(SOTA),包括最近提出的Occupancy-MAE [1]和ALSO [2]。 AD-L-JEPA的源代码可以在此网址获取:[此URL]。
https://arxiv.org/abs/2501.04969
This paper presents a novel approach for the automatic generation of Cued Speech (ACSG), a visual communication system used by people with hearing impairment to better elicit the spoken language. We explore transfer learning strategies by leveraging a pre-trained audiovisual autoregressive text-to-speech model (AVTacotron2). This model is reprogrammed to infer Cued Speech (CS) hand and lip movements from text input. Experiments are conducted on two publicly available datasets, including one recorded specifically for this study. Performance is assessed using an automatic CS recognition system. With a decoding accuracy at the phonetic level reaching approximately 77%, the results demonstrate the effectiveness of our approach.
本文提出了一种生成自动提示语(ACSG)的新方法,这是一种用于听力受损人群的视觉沟通系统,旨在更好地传达口语。我们探索了迁移学习策略,并利用了一个预训练的音视频自回归文本到语音模型(AVTacotron2)。该模型被重新编程以从文本输入中推断出提示语的手势和唇部动作。实验是在两个公开可用的数据集上进行的,其中一个数据集是专门为本研究录制的。使用自动提示语识别系统评估性能。在音素级别的解码准确性达到约77%的情况下,结果表明我们的方法的有效性。
https://arxiv.org/abs/2501.04799
Despite widespread adoption of deep learning models to address a variety of computer vision tasks, planetary science has yet to see extensive utilization of such tools to address its unique problems. On Titan, the largest moon of Saturn, tracking seasonal trends and weather patterns of clouds provides crucial insights into one of the most complex climates in the Solar System, yet much of the available image data are still analyzed in a conventional way. In this work, we apply a Mask R-CNN trained via transfer learning to perform instance segmentation of clouds in Titan images acquired by the Cassini spacecraft - a previously unexplored approach to a big data problem in planetary science. We demonstrate that an automated technique can provide quantitative measures for clouds, such as areas and centroids, that may otherwise be prohibitively time-intensive to produce by human mapping. Furthermore, despite Titan specific challenges, our approach yields accuracy comparable to contemporary cloud identification studies on Earth and other worlds. We compare the efficiencies of human-driven versus algorithmic approaches, showing that transfer learning provides speed-ups that may open new horizons for data investigation for Titan. Moreover, we suggest that such approaches have broad potential for application to similar problems in planetary science where they are currently under-utilized. Future planned missions to the planets and remote sensing initiatives for the Earth promise to provide a deluge of image data in the coming years that will benefit strongly from leveraging machine learning approaches to perform the analysis.
尽管深度学习模型已被广泛应用于解决各种计算机视觉任务,但行星科学领域尚未充分利用此类工具来应对其特有的问题。在土卫六——土星最大的卫星上,追踪季节趋势和云层天气模式为了解太阳系中最为复杂的气候之一提供了关键见解;然而,目前可用的大部分图像数据仍以传统方式分析处理。在这项工作中,我们采用了一种通过迁移学习训练的Mask R-CNN模型来对卡西尼号探测器获取的土卫六图像中的云层进行实例分割——这是一种以前未曾探索过的行星科学大数据问题解决方案。我们展示了自动化技术可以提供定量测量指标(如面积和中心点),这些指标对于人类制图来说耗时过长,难以实现。 此外,尽管存在特定于土卫六的技术挑战,我们的方法仍可达到与地球上及其他世界现行云层识别研究相媲美的准确度水平。我们将人力驱动的分析方式与算法化手段进行对比,证明了迁移学习提供了速度提升,可能为土卫六的数据调查开启新的前景。此外,我们建议此类方法在行星科学中具有广泛的应用潜力,尤其是在目前这些工具尚未充分利用的情况下。 未来计划中的行星探测任务和地球遥感项目有望在未来几年内产生大量图像数据,而利用机器学习方法进行分析将极大地受益于这种技术进步。
https://arxiv.org/abs/2501.04459
Recognizing the same faces with and without masks is important for ensuring consistent identification in security, access control, and public safety. This capability is crucial in scenarios like law enforcement, healthcare, and surveillance, where accurate recognition must be maintained despite facial occlusion. This research focuses on the challenge of recognizing the same faces with and without masks by employing cosine similarity as the primary technique. With the increased use of masks, traditional facial recognition systems face significant accuracy issues, making it crucial to develop methods that can reliably identify individuals in masked conditions. For that reason, this study proposed Masked-Unmasked Face Matching Model (MUFM). This model employs transfer learning using the Visual Geometry Group (VGG16) model to extract significant facial features, which are subsequently classified utilizing the K-Nearest Neighbors (K-NN) algorithm. The cosine similarity metric is employed to compare masked and unmasked faces of the same individuals. This approach represents a novel contribution, as the task of recognizing the same individual with and without a mask using cosine similarity has not been previously addressed. By integrating these advanced methodologies, the research demonstrates effective identification of individuals despite the presence of masks, addressing a significant limitation in traditional systems. Using data is another essential part of this work, by collecting and preparing an image dataset from three different sources especially some of those data are real provided a comprehensive power of this research. The image dataset used were already collected in three different datasets of masked and unmasked for the same faces.
识别戴口罩和不戴口罩的同一人脸对于确保安全、访问控制和公共安全中的持续身份验证非常重要。在执法、医疗保健和监控等场景中,即使面部被遮挡也必须保持准确的身份识别,因此这种能力至关重要。这项研究专注于通过使用余弦相似度作为主要技术来解决识别戴口罩和不戴口罩的同一人脸这一挑战。随着口罩使用的增加,传统的面部识别系统面临着显著的准确性问题,因此开发能够在戴口罩条件下可靠地识别人的方法变得非常重要。为此,本研究提出了遮挡-非遮挡脸部匹配模型(Masked-Unmasked Face Matching Model, MUFM)。该模型使用视觉几何组(Visual Geometry Group, VGG16)模型进行迁移学习来提取重要的面部特征,并利用K-近邻算法对这些特征进行分类。余弦相似度被用于比较同一人的戴口罩和不戴口罩的脸部图像。这种方法的贡献是新颖的,因为通过余弦相似度识别带或不带口罩的同一个人的任务此前未被解决过。 本研究还强调了数据的重要性:收集并准备了一个来自三个不同来源(包括部分真实数据)的图片集,为这项研究提供了全面的支持能力。用于实验的数据集合已经从三种不同的数据集中获取,包含戴口罩和不戴口罩的人脸图像。 通过结合这些先进的方法和技术,该研究展示了即使在存在遮挡的情况下也能有效地识别个体的能力,这解决了传统系统的一个重要限制。
https://arxiv.org/abs/2501.04444
Transfer learning paradigm has driven substantial advancements in various vision tasks. However, as state-of-the-art models continue to grow, classical full fine-tuning often becomes computationally impractical, particularly in multi-task learning (MTL) setup where training complexity increases proportional to the number of tasks. Consequently, recent studies have explored Parameter-Efficient Fine-Tuning (PEFT) for MTL architectures. Despite some progress, these approaches still exhibit limitations in capturing fine-grained, task-specific features that are crucial to MTL. In this paper, we introduce Task-Adaptive Dynamic transFormer, termed TADFormer, a novel PEFT framework that performs task-aware feature adaptation in the fine-grained manner by dynamically considering task-specific input contexts. TADFormer proposes the parameter-efficient prompting for task adaptation and the Dynamic Task Filter (DTF) to capture task information conditioned on input contexts. Experiments on the PASCAL-Context benchmark demonstrate that the proposed method achieves higher accuracy in dense scene understanding tasks, while reducing the number of trainable parameters by up to 8.4 times when compared to full fine-tuning of MTL models. TADFormer also demonstrates superior parameter efficiency and accuracy compared to recent PEFT methods.
迁移学习范式在各种视觉任务中推动了显著的进步。然而,随着最先进的模型不断增大,在多任务学习(MTL)环境中进行经典的全量微调变得计算上不可行,特别是在这种情况下,训练复杂性会随任务数量的增加而线性增长。因此,最近的研究探索了针对MTL架构的参数高效微调(PEFT)。尽管取得了一些进展,这些方法仍然在捕获细粒度、特定于任务的关键特征方面存在局限性。本文介绍了名为TADFormer的任务自适应动态转换器,这是一种新颖的PEFT框架,通过动态考虑特定于任务的输入上下文来进行细粒度的任务感知特征调整。TADFormer提出了用于任务调整的参数高效的提示机制以及动态任务过滤器(DTF),以根据输入上下文捕获任务信息。在PASCAL-Context基准测试上的实验表明,所提出的方法在密集场景理解任务中实现了更高的准确性,并且与多任务学习模型的全量微调相比,可训练参数减少了多达8.4倍。TADFormer还展示了比最近的PEFT方法更优越的参数效率和准确性。
https://arxiv.org/abs/2501.04293
Reliable slot and intent detection (SID) is crucial in natural language understanding for applications like digital assistants. Encoder-only transformer models fine-tuned on high-resource languages generally perform well on SID. However, they struggle with dialectal data, where no standardized form exists and training data is scarce and costly to produce. We explore zero-shot transfer learning for SID, focusing on multiple Bavarian dialects, for which we release a new dataset for the Munich dialect. We evaluate models trained on auxiliary tasks in Bavarian, and compare joint multi-task learning with intermediate-task training. We also compare three types of auxiliary tasks: token-level syntactic tasks, named entity recognition (NER), and language modelling. We find that the included auxiliary tasks have a more positive effect on slot filling than intent classification (with NER having the most positive effect), and that intermediate-task training yields more consistent performance gains. Our best-performing approach improves intent classification performance on Bavarian dialects by 5.1 and slot filling F1 by 8.4 percentage points.
可靠的槽位和意图检测(SID)在数字助理等应用程序的自然语言理解中至关重要。编码器-only变压器模型在高资源语言上进行微调后通常在SID任务上的表现良好。然而,这些模型在处理没有标准化形式且训练数据稀缺和昂贵的方言数据时会遇到困难。我们探索了零样本迁移学习在SID中的应用,重点是巴伐利亚语的各种方言,并为此发布了慕尼黑方言的新数据集。我们在巴伐利亚语上评估了经过辅助任务训练的模型,并将联合多任务学习与中间任务训练进行了比较。我们也对比了三种类型的辅助任务:词级句法任务、命名实体识别(NER)和语言建模。我们发现,所包含的辅助任务对槽位填充的影响比意图分类更积极(其中NER的影响最大),并且中间任务训练在性能提升方面更为一致。我们的最佳方法将巴伐利亚方言上的意图分类性能提高了5.1个百分点,并将槽位填充F1分数提升了8.4个百分点。
https://arxiv.org/abs/2501.03863
In practical sleep stage classification, a key challenge is the variability of EEG data across different subjects and environments. Differences in physiology, age, health status, and recording conditions can lead to domain shifts between data. These domain shifts often result in decreased model accuracy and reliability, particularly when the model is applied to new data with characteristics different from those it was originally trained on, which is a typical manifestation of negative transfer. To address this, we propose SelectiveFinetuning in this paper. Our method utilizes a pretrained Multi Resolution Convolutional Neural Network (MRCNN) to extract EEG features, capturing the distinctive characteristics of different sleep stages. To mitigate the effect of domain shifts, we introduce a domain aligning mechanism that employs Earth Mover Distance (EMD) to evaluate and select source domain data closely matching the target domain. By finetuning the model with selective source data, our SelectiveFinetuning enhances the model's performance on target domain that exhibits domain shifts compared to the data used for training. Experimental results show that our method outperforms existing baselines, offering greater robustness and adaptability in practical scenarios where data distributions are often unpredictable.
在实际的睡眠阶段分类中,一个关键挑战是不同受试者和环境之间脑电图(EEG)数据的变化性。生理差异、年龄、健康状况以及记录条件的不同会导致数据域之间的偏移(domain shift)。这些域偏移通常会降低模型的准确性和可靠性,尤其是在将模型应用于具有与原训练集特征不同的新数据时更为明显,这正是负向迁移(negative transfer)的一种典型表现。 为了解决这一问题,在本文中我们提出了一种选择性微调方法(SelectiveFinetuning)。我们的方法利用预训练的多分辨率卷积神经网络(MRCNN)提取EEG特征,并捕捉不同睡眠阶段的独特特性。为了减轻域偏移的影响,我们引入了通过使用地球移动距离(EMD)评估并选择与目标域紧密匹配的源域数据来实现的域对齐机制。通过对具有选择性的源数据进行微调,我们的选择性微调方法能够增强模型在相对于训练数据表现出域偏移的目标域上的性能。 实验结果表明,我们提出的方法优于现有的基准方法,在实际场景中提供了更强的鲁棒性和适应能力,特别是在数据分布经常不可预测的情况下。
https://arxiv.org/abs/2501.03764