Tables, figures, and listings (TFLs) are essential tools for summarizing clinical trial data. Creation of TFLs for reporting activities is often a time-consuming task encountered routinely during the execution of clinical trials. This study explored the use of large language models (LLMs) to automate the generation of TFLs through prompt engineering and few-shot transfer learning. Using public clinical trial data in ADaM format, our results demonstrated that LLMs can efficiently generate TFLs with prompt instructions, showcasing their potential in this domain. Furthermore, we developed a conservational agent named Clinical Trial TFL Generation Agent: An app that matches user queries to predefined prompts that produce customized programs to generate specific predefined TFLs.
表格、图表和列表(TFLs)是总结临床试验数据的关键工具。为报告活动创建TFL通常是临床试验执行过程中经常遇到的时间耗费的任务。本研究探讨了利用大型语言模型(LLMs)通过提示工程和少样本迁移学习自动生成TFL的可能性。使用ADaM格式的公开临床试验数据,我们的结果表明,LLMs可以有效地生成带有提示指令的TFL,展示了它们在这个领域的前景。此外,我们开发了一个名为Clinical Trial TFL生成代理:一个应用程序,将用户查询与预定义提示匹配,生成定制化的程序以生成特定的预定义TFL。
https://arxiv.org/abs/2409.12046
This work presents Spacecraft Pose Network v3 (SPNv3), a Neural Network (NN) for monocular pose estimation of a known, non-cooperative target spacecraft. As opposed to existing literature, SPNv3 is designed and trained to be computationally efficient while providing robustness to spaceborne images that have not been observed during offline training and validation on the ground. These characteristics are essential to deploying NNs on space-grade edge devices. They are achieved through careful NN design choices, and an extensive trade-off analysis reveals features such as data augmentation, transfer learning and vision transformer architecture as a few of those that contribute to simultaneously maximizing robustness and minimizing computational overhead. Experiments demonstrate that the final SPNv3 can achieve state-of-the-art pose accuracy on hardware-in-the-loop images from a robotic testbed while having trained exclusively on computer-generated synthetic images, effectively bridging the domain gap between synthetic and real imagery. At the same time, SPNv3 runs well above the update frequency of modern satellite navigation filters when tested on a representative graphical processing unit system with flight heritage. Overall, SPNv3 is an efficient, flight-ready NN model readily applicable to a wide range of close-range rendezvous and proximity operations with target resident space objects. The code implementation of SPNv3 will be made publicly available.
本文提出了一种名为Spacecraft Pose Network v3 (SPNv3)的神经网络模型,用于对已知、非合作目标太空船进行单目姿态估计。与现有文献不同,SPNv3经过精心设计并训练,以实现计算效率,同时对地面上没有观察到的太空图像具有鲁棒性。这些特性对于在太空级边缘设备上部署神经网络模型至关重要。通过仔细的神经网络设计选择,以及广泛的权衡分析,揭示了数据增强、迁移学习和视觉Transformer架构等一些有助于同时最大鲁棒性和最小计算开销的特征。实验证明,最后的SPNv3可以在机器人台站上实现与硬件在环图像相同的最先进的姿态精度,同时仅在计算机生成的合成图像上进行训练,有效地缩小了合成和真实图像之间的领域差距。同时,在代表飞行遗产的图形处理单元系统上测试SPNv3时,其更新频率远高于现代卫星导航滤波器的更新频率。总体而言,SPNv3是一种高效、适用于广泛的近距离会合和接近操作目标居民太空对象的飞行就绪的神经网络模型。SPNv3的代码实现将公开发布。
https://arxiv.org/abs/2409.11661
This paper introduces a novel reference-free (RF) audio quality metric called the RF-Generative Machine Listener (RF-GML), designed to evaluate coded mono, stereo, and binaural audio at a 48 kHz sample rate. RF-GML leverages transfer learning from a state-of-the-art full-reference (FR) Generative Machine Listener (GML) with minimal architectural modifications. The term "generative" refers to the model's ability to generate an arbitrary number of simulated listening scores. Unlike existing RF models, RF-GML accurately predicts subjective quality scores across diverse content types and codecs. Extensive evaluations demonstrate its superiority in rating unencoded audio and distinguishing different levels of coding artifacts. RF-GML's performance and versatility make it a valuable tool for coded audio quality assessment and monitoring in various applications, all without the need for a reference signal.
本文提出了一种名为RF-Generative Machine Listener (RF-GML)的新参考免费(RF)音频质量度量标准,旨在评估在48 kHz采样率下编解码单声道、立体声和双耳音频。RF-GML利用来自最先进的完整参考(FR)生成机器听者(GML)的迁移学习,并进行最小的架构修改。术语“生成”指的是模型能够生成任意数量的模拟听觉分数的能力。与现有的RF模型不同,RF-GML准确预测了各种内容和编解码器的主观质量分数。广泛的评估证明了其在评分未编码音频和区分编码噪声的不同级别方面的优越性。RF-GML的表现和多样性使其成为各种应用中有价值的编码音频质量和监测工具,所有 without the need for a reference signal.
https://arxiv.org/abs/2409.10210
In the realm of industrial manufacturing, Artificial Intelligence (AI) is playing an increasing role, from automating existing processes to aiding in the development of new materials and techniques. However, a significant challenge arises in smaller, experimental processes characterized by limited training data availability, questioning the possibility to train AI models in such small data contexts. In this work, we explore the potential of Transfer Learning to address this challenge, specifically investigating the minimum amount of data required to develop a functional AI model. For this purpose, we consider the use case of quality control of Carbon Fiber Reinforced Polymer (CFRP) tape laying in aerospace manufacturing using optical sensors. We investigate the behavior of different open-source computer vision models with a continuous reduction of the training data. Our results show that the amount of data required to successfully train an AI model can be drastically reduced, and the use of smaller models does not necessarily lead to a loss of performance.
在工业制造领域,人工智能(AI)正发挥着越来越重要的作用,从自动化现有流程到帮助开发新材料和新技术。然而,在训练数据有限的小规模实验过程中,一个重要挑战是质疑是否能在如此小的数据环境中训练AI模型。在这项工作中,我们探讨了迁移学习应对这一挑战的可能性,并特别研究了在开发功能AI模型所需的最小数据量。为此,我们考虑使用光学传感器对航空航天制造业中碳纤维增强聚合物(CFRP)带的质量控制。我们研究了随着训练数据持续减少,不同开源计算机视觉模型的行为。我们的研究结果表明,成功训练AI模型的数据量可以大幅减少,而使用较小的模型并不一定导致性能下降。
https://arxiv.org/abs/2409.10104
Foundation models pre-trained using self-supervised and weakly-supervised learning have shown powerful transfer learning capabilities on various downstream tasks, including language understanding, text generation, and image recognition. Recently, the Earth observation (EO) field has produced several foundation models pre-trained directly on multispectral satellite imagery (e.g., Sentinel-2) for applications like precision agriculture, wildfire and drought monitoring, and natural disaster response. However, few studies have investigated the ability of these models to generalize to new geographic locations, and potential concerns of geospatial bias -- models trained on data-rich developed countries not transferring well to data-scarce developing countries -- remain. We investigate the ability of popular EO foundation models to transfer to new geographic regions in the agricultural domain, where differences in farming practices and class imbalance make transfer learning particularly challenging. We first select six crop classification datasets across five continents, normalizing for dataset size and harmonizing classes to focus on four major cereal grains: maize, soybean, rice, and wheat. We then compare three popular foundation models, pre-trained on SSL4EO-S12, SatlasPretrain, and ImageNet, using in-distribution (ID) and out-of-distribution (OOD) evaluation. Experiments show that pre-trained weights designed explicitly for Sentinel-2, such as SSL4EO-S12, outperform general pre-trained weights like ImageNet. Furthermore, the benefits of pre-training on OOD data are the most significant when only 10--100 ID training samples are used. Transfer learning and pre-training with OOD and limited ID data show promising applications, as many developing regions have scarce crop type labels. All harmonized datasets and experimental code are open-source and available for download.
使用自监督和弱监督学习预训练的基金会模型在各种下游任务中表现出强大的迁移学习能力,包括语言理解、文本生成和图像识别。最近,地球观测(EO)领域已经为精密农业、野火和干旱监测以及自然灾害应对生产了几个基于多光谱卫星图像(如Sentinel-2)的预训练模型。然而,很少有研究调查这些模型在迁移到新的地理区域时的能力,以及可能的地理偏见——训练在数据丰富的发展中国家并不一定能很好地转移到数据有限的发展中国家。我们研究了使用流行EO基础模型在农业领域迁移到新的地理区域的能力,其中不同的农业实践和类别不平衡使得迁移学习尤为困难。我们首先选择了五个大洲上的六个 crop 分类数据集,按数据集大小和类别进行归一化,重点关注四大主要谷物:玉米、大豆、水稻和小麦。然后,我们使用在本地(ID)和离本地(OOD)评估比较了三个流行的基础模型,预训练在 SSL4EO-S12、SatlasPretrain 和 ImageNet 上。实验结果表明,专门为Sentinel-2 预训练的权重,如 SSL4EO-S12,在表现上优于类似于 ImageNet 的通用预训练权重。此外,在仅使用10-100个ID训练样本的情况下,基于OOD数据的预训练优势最为显著。迁移学习和基于OOD和有限ID数据的预训练表现出有前途的应用,因为许多发展地区缺乏作物类型标签。所有协调的数据集和实验代码都是开源的,可下载。
https://arxiv.org/abs/2409.09451
We present our system (denoted as T05) for the VoiceMOS Challenge (VMC) 2024. Our system was designed for the VMC 2024 Track 1, which focused on the accurate prediction of naturalness mean opinion score (MOS) for high-quality synthetic speech. In addition to a pretrained self-supervised learning (SSL)-based speech feature extractor, our system incorporates a pretrained image feature extractor to capture the difference of synthetic speech observed in speech spectrograms. We first separately train two MOS predictors that use either of an SSL-based or spectrogram-based feature. Then, we fine-tune the two predictors for better MOS prediction using the fusion of two extracted features. In the VMC 2024 Track 1, our T05 system achieved first place in 7 out of 16 evaluation metrics and second place in the remaining 9 metrics, with a significant difference compared to those ranked third and below. We also report the results of our ablation study to investigate essential factors of our system.
我们向VMC 2024挑战(VMC)2024提交我们的系统(称为T05)。我们的系统是为VMC 2024 Track 1设计的,该赛道重点关注高质量合成语音的自然度平均意见分数(MOS)准确预测。除了预训练的自监督学习(SSL)语音特征提取器之外,我们的系统还包括预训练的图像特征提取器,以捕捉合成语音在语音频谱图中的差异。 我们首先分别训练两个MOS预测器,它们使用基于SSL或频谱图的语音特征。然后,我们通过融合提取到的两个特征来微调这两个预测器,以实现更好的MOS预测。在VMC 2024 Track 1中,我们的T05系统在16个评估指标中获得了第1名,在剩下的9个指标中获得了第2名,与排名第三及以下的系统相比有显著的差异。我们还报告了我们的消融研究结果,以探究我们系统的基本要素。
https://arxiv.org/abs/2409.09305
This study investigates the efficacy of data augmentation techniques for low-resource automatic speech recognition (ASR), focusing on two endangered Austronesian languages, Amis and Seediq. Recognizing the potential of self-supervised learning (SSL) in low-resource settings, we explore the impact of data volume on the continued pre-training of SSL models. We propose a novel data-selection scheme leveraging a multilingual corpus to augment the limited target language data. This scheme utilizes a language classifier to extract utterance embeddings and employs one-class classifiers to identify utterances phonetically and phonologically proximate to the target languages. Utterances are ranked and selected based on their decision scores, ensuring the inclusion of highly relevant data in the SSL-ASR pipeline. Our experimental results demonstrate the effectiveness of this approach, yielding substantial improvements in ASR performance for both Amis and Seediq. These findings underscore the feasibility and promise of data augmentation through cross-lingual transfer learning for low-resource language ASR.
这项研究探讨了数据增强技术在低资源自动语音识别(ASR)中的有效性,重点关注两种濒危的亚热带语言Amis和Seediq。认识到在低资源环境中自监督学习(SSL)的潜力,我们探讨了数据量对 SSL 模型持续预训练的影响。我们提出了一个利用多语言语料库的新数据选择方案来增加有限目标语言数据。这个方案利用语言分类器提取句子嵌入,并使用一元分类器来识别与目标语言的句音和句调接近的句子。句子按其决策得分排序并选择,确保在 SSL-ASR 管道中包括高度相关数据。我们的实验结果证明了这种方法的有效性,为Amis和Seediq的ASR性能带来了显着改进。这些发现突出了通过跨语言迁移学习进行数据增强对低资源语言ASR的可行性和前景。
https://arxiv.org/abs/2409.08872
Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework SimMAT to study an open problem: the transferability from vision foundation models trained on natural RGB images to other image modalities of different physical properties (e.g., polarization). SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained foundation model. We apply SimMAT to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new image modality. Given the absence of relevant benchmarks, we construct a new benchmark to evaluate the transfer learning performance. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors' performance. Specifically, SimMAT can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. We hope that SimMAT can raise awareness of cross-modal transfer learning and benefit various fields for better results with vision foundation models.
训练在大量数据上的大型模型,如ChatGPT和Sora,已经取得了革命性的社会影响。然而,对于许多不同领域的传感器来说,收集与训练大型自然图像相似规模的图像是非常具有挑战性的。为此,本文提出了一种简单而有效的框架SimMAT来研究一个开放问题:将训练在自然RGB图像上的视觉基础模型转移到其他物理属性(如极化)的图像模块。SimMAT由一个模态无关的转移层(MAT)和一个预训练的基础模型组成。我们将SimMAT应用于一个代表性的视觉基础模型Segment Anything Model(SAM),以支持任何评估的新图像模块。由于缺乏相关基准,我们构建了一个新的基准来评估迁移学习性能。我们的实验证实了在增强其他传感器性能方面转移视觉基础模型的潜力。具体来说,SimMAT可以在评估的模块上将分割性能(mIoU)从22.15%提高至53.88%,并且 consistently优于其他基线。我们希望SimMAT能够引起跨模态转移学习的引人注目,并为各种领域提供更好的视觉基础模型性能。
https://arxiv.org/abs/2409.08083
Feedforward monocular face capture methods seek to reconstruct posed faces from a single image of a person. Current state of the art approaches have the ability to regress parametric 3D face models in real-time across a wide range of identities, lighting conditions and poses by leveraging large image datasets of human faces. These methods however suffer from clear limitations in that the underlying parametric face model only provides a coarse estimation of the face shape, thereby limiting their practical applicability in tasks that require precise 3D reconstruction (aging, face swapping, digital make-up, ...). In this paper, we propose a method for high-precision 3D face capture taking advantage of a collection of unconstrained videos of a subject as prior information. Our proposal builds on a two stage approach. We start with the reconstruction of a detailed 3D face avatar of the person, capturing both precise geometry and appearance from a collection of videos. We then use the encoder from a pre-trained monocular face reconstruction method, substituting its decoder with our personalized model, and proceed with transfer learning on the video collection. Using our pre-estimated image formation model, we obtain a more precise self-supervision objective, enabling improved expression and pose alignment. This results in a trained encoder capable of efficiently regressing pose and expression parameters in real-time from previously unseen images, which combined with our personalized geometry model yields more accurate and high fidelity mesh inference. Through extensive qualitative and quantitative evaluation, we showcase the superiority of our final model as compared to state-of-the-art baselines, and demonstrate its generalization ability to unseen pose, expression and lighting.
翻译: 单目相机人脸捕捉方法的目标是从一个人的单张照片中重构人体的姿态脸。目前的技术途径能够通过利用大量人面图像数据集在各种身份、照明条件和姿态下实时回归参数化3D人脸模型。然而,这些方法存在明显局限性,因为底层参数化的人面模型仅提供对脸部形状的粗略估计,从而限制了它们在需要精确3D重建的任务中的应用(衰老、换脸、数字化妆……)。在本文中,我们提出了一种高精度3D人脸捕捉方法,利用了一个主题的约束视频集合作为先验信息。我们的建议基于两个阶段。我们首先从一系列视频中重构人的详细3D人脸形象,同时捕捉精确的形状和外观。然后,我们将预训练的单目相机人脸重建方法的编码器替换为我们的自定义模型,并继续在视频集合上进行迁移学习。利用我们预先估计的图像形成模型,我们获得了更精确的自监督目标,使得可以从从未见过的图像中更有效地回归姿态和表情参数。这使得训练后的编码器能够实时高效地从 previously unseen images 中回归姿态和表情参数,与我们的自定义几何模型相结合,产生更准确、高保真的网格推理。通过广泛的定性和定量评估,我们展示了我们最终模型的优越性,并证明了其对未知姿态、表情和照明条件的 generalization 能力。
https://arxiv.org/abs/2409.07984
The field of Computer Vision (CV) has faced challenges. Initially, it relied on handcrafted features and rule-based algorithms, resulting in limited accuracy. The introduction of machine learning (ML) has brought progress, particularly Transfer Learning (TL), which addresses various CV problems by reusing pre-trained models. TL requires less data and computing while delivering nearly equal accuracy, making it a prominent technique in the CV landscape. Our research focuses on TL development and how CV applications use it to solve real-world problems. We discuss recent developments, limitations, and opportunities.
计算机视觉(CV)领域面临着挑战。最初,它依赖于手工制作的特征和基于规则的算法,导致准确性有限。机器学习(ML)的引入带来了进步,特别是迁移学习(TL),通过重用预训练模型来解决各种CV问题。TL需要更少的数据和计算,同时几乎实现相同的准确性,因此在CV领域具有突出的地位。我们的研究重点在于TL的开发以及CV应用如何使用TL解决现实世界问题。我们讨论了最近的进展、局限性和机会。
https://arxiv.org/abs/2409.07736
In the realm of digital music, using tags to efficiently organize and retrieve music from extensive databases is crucial for music catalog owners. Human tagging by experts is labor-intensive but mostly accurate, whereas automatic tagging through supervised learning has approached satisfying accuracy but is restricted to a predefined set of training tags. Few-shot learning offers a viable solution to expand beyond this small set of predefined tags by enabling models to learn from only a few human-provided examples to understand tag meanings and subsequently apply these tags autonomously. We propose to integrate few-shot learning methodology into multi-label music auto-tagging by using features from pre-trained models as inputs to a lightweight linear classifier, also known as a linear probe. We investigate different popular pre-trained features, as well as different few-shot parametrizations with varying numbers of classes and samples per class. Our experiments demonstrate that a simple model with pre-trained features can achieve performance close to state-of-the-art models while using significantly less training data, such as 20 samples per tag. Additionally, our linear probe performs competitively with leading models when trained on the entire training dataset. The results show that this transfer learning-based few-shot approach could effectively address the issue of automatically assigning long-tail tags with only limited labeled data.
在数字音乐领域,使用标签有效地组织和检索音乐数据对于音乐目录所有者来说至关重要。通过专家的人标签虽然劳动密集,但通常准确率高,而自动标签通过监督学习方法已经达到了令人满意的准确度,但限制在预定义的训练标签集内。少样本学习为扩展 beyond 预定义的标签集提供了可行的解决方案,通过让模型仅从几个人提供的示例中学习来理解标签意义,然后将这些标签自主应用。我们将少样本学习方法融入多标签音乐自动标记中,通过使用预训练模型的特征作为输入,使用轻量级线性分类器(也称为线性探针)。我们研究了不同流行预训练特征以及不同少样本参数化,包括不同类别的样本数量。我们的实验结果表明,具有预训练特征的简单模型可以在使用大量训练数据的同时,实现与最先进的模型的性能接近,例如20个样本每个标签。此外,我们的线性探针在与主要模型的竞争中具有竞争力,当这些模型在全部训练数据上进行训练时。结果表明,这种基于迁移学习的多标签少样本方法可以有效解决只有有限标记数据的情况下自动分配长尾标签的问题。
https://arxiv.org/abs/2409.07730
To promote inclusion and ensuring effective communication for those who rely on sign language as their main form of communication, sign language recognition (SLR) is crucial. Sign language recognition (SLR) seamlessly incorporates with diverse technology, enhancing accessibility for the deaf community by facilitating their use of digital platforms, video calls, and communication devices. To effectively solve this problem, we suggest a novel solution that uses a deep neural network to fully automate sign language recognition. This methodology integrates sophisticated preprocessing methodologies to optimise the overall performance. The architectures resnet, inception, xception, and vgg are utilised to selectively categorise images of sign language. We prepared a DNN architecture and merged it with the pre-processing architectures. In the post-processing phase, we utilised the SHAP deep explainer, which is based on cooperative game theory, to quantify the influence of specific features on the output of a machine learning model. Bhutanese-Sign-Language (BSL) dataset was used for training and testing the suggested technique. While training on Bhutanese-Sign-Language (BSL) dataset, overall ResNet50 with the DNN model performed better accuracy which is 98.90%. Our model's ability to provide informational clarity was assessed using the SHAP (SHapley Additive exPlanations) method. In part to its considerable robustness and reliability, the proposed methodological approach can be used to develop a fully automated system for sign language recognition.
为了促进依赖手语为主要交流方式的人的 inclusion 并确保有效的沟通,手语识别(SLR)至关重要。SLR 无缝地融入了各种技术,通过促进他们使用数字平台、视频通话和通信设备,提高了聋人群体的可访问性。要有效地解决这个问题,我们提出了一个新方法,该方法使用深度神经网络完全自动化手语识别。这种方法结合了复杂的预处理方法来优化整体性能。架构 resnet、inception、xception 和 vgg 用于选择性地分类手语图像。我们准备了一个 DNN 架构,并将其与预处理架构合并。在后处理阶段,我们利用了基于合作游戏理论的 SHAP 深度解释器,来量化特定特征对手轮输出影响。Bhutanese-Sign-Language(BSL)数据集用于训练和测试所建议的技术。在训练 Bhutanese-Sign-Language(BSL)数据集时,使用 DNN 模型的整体 ResNet50 的准确性为 98.90%。通过 SHAP(SHapley Additive exPlanations)方法评估了我们的模型的信息清晰度。由于其巨大的可靠性和稳健性,所提出的方法论可以用于开发完全自动化的手语识别系统。
https://arxiv.org/abs/2409.07426
Biometric authentication has garnered significant attention as a secure and efficient method of identity verification. Among the various modalities, hand vein biometrics, including finger vein, palm vein, and dorsal hand vein recognition, offer unique advantages due to their high accuracy, low susceptibility to forgery, and non-intrusiveness. The vein patterns within the hand are highly complex and distinct for each individual, making them an ideal biometric identifier. Additionally, hand vein recognition is contactless, enhancing user convenience and hygiene compared to other modalities such as fingerprint or iris recognition. Furthermore, the veins are internally located, rendering them less susceptible to damage or alteration, thus enhancing the security and reliability of the biometric system. The combination of these factors makes hand vein biometrics a highly effective and secure method for identity verification. This review paper delves into the latest advancements in deep learning techniques applied to finger vein, palm vein, and dorsal hand vein recognition. It encompasses all essential fundamentals of hand vein biometrics, summarizes publicly available datasets, and discusses state-of-the-art metrics used for evaluating the three modes. Moreover, it provides a comprehensive overview of suggested approaches for finger, palm, dorsal, and multimodal vein techniques, offering insights into the best performance achieved, data augmentation techniques, and effective transfer learning methods, along with associated pretrained deep learning models. Additionally, the review addresses research challenges faced and outlines future directions and perspectives, encouraging researchers to enhance existing methods and propose innovative techniques.
生物识别作为一种安全高效的的身份验证方法已经引起了广泛关注。在各种模式中,包括手指静脉、手掌静脉和桡动脉静脉识别,由于其高准确度、低伪造倾向和非侵入性,手部静脉模式具有独特的优势。此外,手部静脉识别是接触less的,比其他识别模式如指纹或虹膜识别更方便用户,同时提高了系统的可靠性和安全性。再者,静脉内部位置,使它们对损坏或改变的影响较小,从而提高了生物识别系统的安全性和可靠性。这些因素的结合使手部静脉生物识别成为一种非常有效且安全的方法。 本文回顾了在手指静脉、手掌静脉和桡动脉静脉识别中应用的深度学习技术的最新进展。它涵盖了所有手部静脉生物识别的基本原理,总结了已公开的数据集,并讨论了用于评估三种模式的现有最先进的指标。此外,它还提供了关于手指、手掌、桡动脉和多模态静脉技术的建议方法,以及实现最佳性能、数据增强技术和有效迁移学习方法的相关信息,以及相关的预训练深度学习模型。此外,本文还讨论了研究所面临的挑战和未来研究方向,鼓励研究人员改进现有方法并提出创新技术。
https://arxiv.org/abs/2409.07128
As humans can explore and understand the world through the sense of touch, tactile sensing is also an important aspect of robotic perception. In unstructured environments, robots can encounter both known and novel objects, this calls for a method to address both known and novel objects. In this study, we combine a particle filter (PF) and Gaussian process implicit surface (GPIS) in a unified Bayesian framework. The framework can differentiate between known and novel objects, perform object recognition, estimate pose for known objects, and reconstruct shapes for unknown objects, in an active learning fashion. By grounding the selection of the GPIS prior with the maximum-likelihood-estimation (MLE) shape from the PF, the knowledge about known objects' shapes can be transferred to learn novel shapes. An exploration procedure with global shape estimation is proposed to guide active data acquisition and conclude the exploration when sufficient information is obtained. The performance of the proposed Bayesian framework is evaluated through simulations on known and novel objects, initialized with random poses and is compared with a rapidly explore random tree (RRT).The results show that the proposed exploration procedure, utilizing global shape estimation, achieves faster exploration than the RRT-based local exploration procedure. Overall, results indicate that the proposed framework is effective and efficient in object recognition, pose estimation and shape reconstruction. Moreover, we show that a learned shape can be included as a new prior and used effectively for future object recognition and pose estimation of novel objects.
由于人类可以通过触觉来探索和理解世界,触觉感知也是机器人感知的一个重要方面。在无结构环境中,机器人可以遇到已知和未知的物体,这就需要一种方法来同时处理已知和未知的物体。在这项研究中,我们在统一贝叶斯框架中结合了粒子滤波器(PF)和隐式表面高斯过程(GPIS)。该框架可以在主动学习的方式下区分已知和未知的物体,进行物体识别,估计已知物体的姿态,并重构未知物体的形状。通过将GPIS的选择基于PF的最大似然估计(MLE)形状进行 grounded,可以从PF中学习已知物体的形状,从而学习新的形状。我们提出了一个全局形状估计的探索程序,用于指导主动数据采集,并在足够信息获得时结束探索。通过对已知和未知的物体进行仿真来评估所提出的贝叶斯框架的性能,并将其与快速探索随机树(RRT)进行比较。结果表明,与基于RRT的局部探索程序相比,所提出的探索程序具有更快的探索速度。总体而言,结果表明,所提出的框架在物体识别、姿态估计和形状重建方面都是有效的和高效的。此外,我们还证明了可以包含学习到的形状作为新的先验,并有效地用于未来的物体识别和姿态估计。
https://arxiv.org/abs/2409.06912
This paper presents Adaptive Meta-Domain Transfer Learning (AMDTL), a novel methodology that combines principles of meta-learning with domain-specific adaptations to enhance the transferability of artificial intelligence models across diverse and unknown domains. AMDTL aims to address the main challenges of transfer learning, such as domain misalignment, negative transfer, and catastrophic forgetting, through a hybrid framework that emphasizes both generalization and contextual specialization. The framework integrates a meta-learner trained on a diverse distribution of tasks, adversarial training techniques for aligning domain feature distributions, and dynamic feature regulation mechanisms based on contextual domain embeddings. Experimental results on benchmark datasets demonstrate that AMDTL outperforms existing transfer learning methodologies in terms of accuracy, adaptation efficiency, and robustness. This research provides a solid theoretical and practical foundation for the application of AMDTL in various fields, opening new perspectives for the development of more adaptable and inclusive AI systems.
本文介绍了一种名为自适应元领域迁移学习(AMDTL)的新方法,该方法将元学习的原则与领域特异性调整相结合,以增强人工智能模型在多样和未知领域的可迁移性。AMDTL旨在通过强调泛化与上下文专用性来解决迁移学习中的主要挑战,如领域不匹配、负迁移和灾难性遗忘。该框架基于元学习在多样性任务上的训练,采用对抗性训练技术对领域特征分布进行对齐,并基于上下文领域嵌入的动态特征调节机制。在基准数据集上的实验结果表明,AMDTL在准确度、适应效率和鲁棒性方面超过了现有的迁移学习方法。这项研究为AMDTL在各种领域的应用提供了坚实的基础,并为开发更具有适应性和包容性的人工智能系统打开了新的视野。
https://arxiv.org/abs/2409.06800
In deep learning, transfer learning and ensemble models have shown promise in improving computer-aided disease diagnosis. However, applying the transfer learning and ensemble model is still relatively limited. Moreover, the ensemble model's development is ad-hoc, overlooks redundant layers, and suffers from imbalanced datasets and inadequate augmentation. Lastly, significant Deep Convolutional Neural Networks (D-CNNs) have been introduced to detect and classify breast cancer. Still, very few comparative studies were conducted to investigate the accuracy and efficiency of existing CNN architectures. Realising the gaps, this study compares the performance of D-CNN, which includes the original CNN, transfer learning, and an ensemble model, in detecting breast cancer. The comparison study of this paper consists of comparison using six CNN-based deep learning architectures (SE-ResNet152, MobileNetV2, VGG19, ResNet18, InceptionV3, and DenseNet-121), a transfer learning, and an ensemble model on breast cancer detection. Among the comparison of these models, the ensemble model provides the highest detection and classification accuracy of 99.94% for breast cancer detection and classification. However, this study also provides a negative result in the case of transfer learning, as the transfer learning did not increase the accuracy of the original SE-ResNet152, MobileNetV2, VGG19, ResNet18, InceptionV3, and DenseNet-121 model. The high accuracy in detecting and categorising breast cancer detection using CNN suggests that the CNN model is promising in breast cancer disease detection. This research is significant in biomedical engineering, computer-aided disease diagnosis, and ML-based disease detection.
在深度学习中,迁移学习和集成模型在提高计算机辅助疾病诊断方面显示出前景。然而,应用迁移学习和集成模型仍然相对有限。此外,集成模型的开发是临时的,忽视了冗余层,并受到不平衡数据集和不足的增强的影响。最后,已经引入了用于检测和分类乳腺癌的大规模深度卷积神经网络(D-CNN)。然而,为研究现有CNN架构在乳腺癌检测和分类中的准确性和效率,尚未进行足够的比较研究。为了填补这一空白,本研究将比较使用六个基于CNN的深度学习架构(SE-ResNet152,MobileNetV2,VGG19,ResNet18,InceptionV3,和DenseNet-121)以及迁移学习和集成模型在乳腺癌检测和分类中的性能。在这些比较模型中,集成模型在乳腺癌检测和分类中的准确性和分类精度最高达到99.94%。然而,本研究在迁移学习方面的结果为负,因为迁移学习没有增加原始SE-ResNet152,MobileNetV2,VGG19,ResNet18,InceptionV3,和DenseNet-121模型的准确性。通过使用CNN模型在乳腺癌检测和分类中的高准确性和分类精度,可以看出CNN模型在乳腺癌疾病检测方面具有前景。这项研究在生物医学工程、计算机辅助疾病诊断和机器学习疾病诊断方面具有重要意义。
https://arxiv.org/abs/2409.06699
Over the years in object detection several efficient Convolutional Neural Networks (CNN) networks, such as DenseNet201, InceptionV3, ResNet152v2, SEresNet152, VGG19, Xception gained significant attention due to their performance. Moreover, CNN paradigms have expanded to transfer learning and ensemble models from original CNN architectures. Research studies suggest that transfer learning and ensemble models are capable of increasing the accuracy of deep learning (DL) models. However, very few studies have conducted comprehensive experiments utilizing these techniques in detecting and localizing blood malignancies. Realizing the gap, this study conducted three experiments; in the first experiment -- six original CNNs were used, in the second experiment -- transfer learning and, in the third experiment a novel ensemble model DIX (DenseNet201, InceptionV3, and Xception) was developed to detect and classify blood cancer. The statistical result suggests that DIX outperformed the original and transfer learning performance, providing an accuracy of 99.12%. However, this study also provides a negative result in the case of transfer learning, as the transfer learning did not increase the accuracy of the original CNNs. Like many other cancers, blood cancer diseases require timely identification for effective treatment plans and increased survival possibilities. The high accuracy in detecting and categorization blood cancer detection using CNN suggests that the CNN model is promising in blood cancer disease detection. This research is significant in the fields of biomedical engineering, computer-aided disease diagnosis, and ML-based disease detection.
在目标检测领域,几年内出现了多个高效的卷积神经网络(CNN)网络,如DenseNet201、InceptionV3、ResNet152v2、SEresNet152、VGG19和Xception,因它们的表现而备受关注。此外,CNN范式已扩展到从原始CNN架构中传输学习和元模型的研究。研究表明,传输学习和元模型在提高深度学习(DL)模型的准确性方面具有潜力。然而,迄今为止,还没有许多研究对 these 技术在检测和定位恶性肿瘤进行全面的实验。为填补这一空白,本研究进行了三个实验;在第一个实验中,使用了六个原始 CNN,在第二个实验中使用了传输学习和,在第三个实验中开发了一个名为DIX(DenseNet201、InceptionV3和Xception)的新元模型,用于检测和分类恶性肿瘤。统计结果表明,DIX超过了原始和传输学习的表现,具有99.12%的准确度。然而,本研究也表明了在传输学习方面的负结果,因为传输学习没有提高原始 CNN 的准确性。与许多其他癌症一样,血液癌症疾病需要及时的诊断方案以实现有效的治疗计划和提高生存可能性。使用 CNN 进行血液癌症检测和分类的高准确度表明,在血液癌症疾病检测方面,CNN 模型具有很大的潜力。这项研究在生物医学工程、计算机辅助疾病诊断和机器学习疾病诊断等领域具有重要的意义。
https://arxiv.org/abs/2409.06689
In recent years robots have become an important part of our day-to-day lives with various applications. Human-robot interaction creates a positive impact in the field of robotics to interact and communicate with the robots. Gesture recognition techniques combined with machine learning algorithms have shown remarkable progress in recent years, particularly in human-robot interaction (HRI). This paper comprehensively reviews the latest advancements in gesture recognition methods and their integration with machine learning approaches to enhance HRI. Furthermore, this paper represents the vision-based gesture recognition for safe and reliable human-robot-interaction with a depth-sensing system, analyses the role of machine learning algorithms such as deep learning, reinforcement learning, and transfer learning in improving the accuracy and robustness of gesture recognition systems for effective communication between humans and robots.
近年来,机器人已经成为我们日常生活中不可或缺的一部分,各种应用使其变得愈发重要。人机交互在机器人领域产生了积极的影响,使其能够与机器人进行互动和交流。近年来,结合手势识别技术和机器学习算法的手势识别方法取得了显著进步,特别是在人机交互(HRI)方面。本文全面回顾了手势识别方法的最新进展及其与机器学习方法的集成以提高HRI。此外,本文还代表了一种基于视觉的手势识别系统,用于与深度传感器系统进行安全可靠的机器人交互,并分析了机器学习算法(如 deep learning、强化学习和迁移学习)在提高手势识别系统准确性和稳健性方面的作用。
https://arxiv.org/abs/2409.06503
Traditional dialogue state tracking approaches heavily rely on extensive training data and handcrafted features, limiting their scalability and adaptability to new domains. In this paper, we propose a novel method that leverages inference and in-context learning with ChatGPT for domain transfer in dialogue state tracking, without any parameter updates. By guiding ChatGPT's chain of thought, we enable it to retrieve relevant examples and generalize knowledge to accurately infer dialogue states, solely through inference. Experimental results on the MultiWOZ dataset demonstrate competitive performance and promising generalization across domains. Our parameter-free approach offers a scalable and adaptable solution, opening new research directions in domain transfer learning.
传统对话状态跟踪方法在很大程度上依赖于广泛的训练数据和手动特征,这限制了它们对新领域的可扩展性和适应性。在本文中,我们提出了一种利用 ChatGPT 的推理和上下文学习方法进行领域迁移对话状态跟踪的新颖方法,无需参数更新。通过引导 ChatGPT 的思考过程,我们使其能够检索相关示例并推广知识以准确推断对话状态。在 MultiWOZ 数据集上的实验结果表明,这种方法在多个领域都具有竞争力的性能和有前景的泛化能力。这种无需参数的方法提供了一种可扩展和适应性的解决方案,为领域迁移学习打开了新的研究方向。
https://arxiv.org/abs/2409.06243
Cherenkov imaging enables real-time visualization of megavoltage X-ray or electron beam delivery to the patient during Radiation Therapy (RT). Bio-morphological features, such as vasculature, seen in these images are patient-specific signatures that can be used for verification of positioning and motion management that are essential to precise RT treatment. However until now, no concerted analysis of this biological feature-based tracking was utilized because of the slow speed and accuracy of conventional image processing for feature segmentation. This study demonstrated the first deep learning framework for such an application, achieving video frame rate processing. To address the challenge of limited annotation of these features in Cherenkov images, a transfer learning strategy was applied. A fundus photography dataset including 20,529 patch retina images with ground-truth vessel annotation was used to pre-train a ResNet segmentation framework. Subsequently, a small Cherenkov dataset (1,483 images from 212 treatment fractions of 19 breast cancer patients) with known annotated vasculature masks was used to fine-tune the model for accurate segmentation prediction. This deep learning framework achieved consistent and rapid segmentation of Cherenkov-imaged bio-morphological features on another 19 patients, including subcutaneous veins, scars, and pigmented skin. Average segmentation by the model achieved Dice score of 0.85 and required less than 0.7 milliseconds processing time per instance. The model demonstrated outstanding consistency against input image variances and speed compared to conventional manual segmentation methods, laying the foundation for online segmentation in real-time monitoring in a prospective setting.
Cherenkov成像技术能够实现对放射治疗(RT)过程中 megavoltage X-ray 或电子束的实时可视化。这些图像中看到的血管特征等生物形态学特征是患者特异性签名,可以用于验证精确的RT治疗中的位置和运动管理。然而,迄今为止,还没有对这种基于生物特征的跟踪进行系统的分析,因为传统图像处理方法对特征分割的速率和准确性较慢。这项研究展示了第一个为该应用使用深度学习框架的研究,并实现了视频帧率处理。为解决在Cherenkov图像中这些特征的标注有限的问题,采用了一种迁移学习策略。使用包括20,529个补丁的retina图像的基金模式摄影数据集进行了预训练,然后使用已知标注的血管特征掩模的 small Cherenkov数据集(1,483个来自19名乳腺癌患者的治疗片段)对模型进行微调,以实现精确分割预测。这种深度学习框架在另一个19名患者上的Cherenkov图像生物形态学特征上实现了一致且快速的分割,包括亚筋膜静脉、疤痕和晒斑皮肤。模型的平均分割得分达到了0.85,每个实例的分割时间不到0.7毫秒。与传统手动分割方法相比,该模型在输入图像的变异性上表现出色,速度更快,为实时监测场景中的在线分割奠定了基础。
https://arxiv.org/abs/2409.05666