Extracting medication names from handwritten doctor prescriptions is challenging due to the wide variability in handwriting styles and prescription formats. This paper presents a robust method for extracting medicine names using a combination of Mask R-CNN and Transformer-based Optical Character Recognition (TrOCR) with Multi-Head Attention and Positional Embeddings. A novel dataset, featuring diverse handwritten prescriptions from various regions of Pakistan, was utilized to fine-tune the model on different handwriting styles. The Mask R-CNN model segments the prescription images to focus on the medicinal sections, while the TrOCR model, enhanced by Multi-Head Attention and Positional Embeddings, transcribes the isolated text. The transcribed text is then matched against a pre-existing database for accurate identification. The proposed approach achieved a character error rate (CER) of 1.4% on standard benchmarks, highlighting its potential as a reliable and efficient tool for automating medicine name extraction.
从手写医生处方中提取药品名称具有挑战性,因为书写风格和处方格式的多样性。本文介绍了一种结合使用Mask R-CNN和基于Transformer的光学字符识别(TrOCR)的方法来提取药品名称,该方法采用了多头注意力机制和位置嵌入技术。研究中利用了一个新数据集,其中包含了来自巴基斯坦不同地区的多样化手写处方,用以调整模型以适应不同的书写风格。Mask R-CNN模型用于分割处方图像,专注于药物部分,而由多头注意力机制和位置嵌入增强的TrOCR模型则负责转录孤立文本。随后将转录后的文本与现有的数据库进行匹配,以实现准确识别。所提出的方法在标准基准测试中达到了1.4%的字符错误率(CER),显示出其作为自动提取药品名称可靠而高效的工具的巨大潜力。
https://arxiv.org/abs/2412.18199
The generation of images of realistic looking, readable handwritten text is a challenging task which is referred to as handwritten text generation (HTG). Given a string and examples from a writer, the goal is to synthesize an image depicting the correctly spelled word in handwriting with the calligraphic style of the desired writer. An important application of HTG is the generation of training images in order to adapt downstream models for new data sets. With their success in natural image generation, diffusion models (DMs) have become the state-of-the-art approach in HTG. In this work, we present an extension of a latent DM for HTG to enable generation of writing styles not seen during training by learning style conditioning with a masked auto encoder. Our proposed content encoder allows for different ways of conditioning the DM on textual and calligraphic features. Additionally, we employ classifier-free guidance and explore the influence on the quality of the generated training images. For adapting the model to a new unlabeled data set, we propose a semi-supervised training scheme. We evaluate our approach on the IAM-database and use the RIMES-database to examine the generation of data not seen during training achieving improvements in this particularly promising application of DMs for HTG.
生成看起来逼真且可读的手写文本图像是一个具有挑战性的任务,被称为手写文本生成(HTG)。给定一段字符串和某个书写者的样本,目标是合成一张图片,该图片以所期望的书写者风格展示正确拼写的单词。HTG的一个重要应用是在新数据集上训练下游模型时生成训练图像。由于在自然图像生成方面的成功,扩散模型(DMs)已成为HTG领域的领先方法。在此工作中,我们提出了一种扩展的潜在DM用于HTG,通过学习带有掩码自编码器的风格条件来实现生成未见过的书写风格。我们的内容编码器允许以不同的方式对DM进行文本和书法特征上的条件设置。此外,我们采用无分类器指导,并探讨其对生成训练图像质量的影响。为了将模型适应新的未标记数据集,我们提出了一种半监督训练方案。我们在IAM数据库上评估了我们的方法,并使用RIMES数据库来检查生成未见过的数据,从而在DMs用于HTG的这一特别有前景的应用中取得了改进。
https://arxiv.org/abs/2412.15853
Currently, the prevalence of online handwriting has spurred a critical need for effective retrieval systems to accurately search relevant handwriting instances from specific writers, known as online writer retrieval. Despite the growing demand, this field suffers from a scarcity of well-established methodologies and public large-scale datasets. This paper tackles these challenges with a focus on Chinese handwritten phrases. First, we propose DOLPHIN, a novel retrieval model designed to enhance handwriting representations through synergistic temporal-frequency analysis. For frequency feature learning, we propose the HFGA block, which performs gated cross-attention between the vanilla temporal handwriting sequence and its high-frequency sub-bands to amplify salient writing details. For temporal feature learning, we propose the CAIR block, tailored to promote channel interaction and reduce channel redundancy. Second, to address data deficit, we introduce OLIWER, a large-scale online writer retrieval dataset encompassing over 670,000 Chinese handwritten phrases from 1,731 individuals. Through extensive evaluations, we demonstrate the superior performance of DOLPHIN over existing methods. In addition, we explore cross-domain writer retrieval and reveal the pivotal role of increasing feature alignment in bridging the distributional gap between different handwriting data. Our findings emphasize the significance of point sampling frequency and pressure features in improving handwriting representation quality and retrieval performance. Code and dataset are available at this https URL.
当前,在线手写内容的普及引发了对于能够准确搜索特定作者相关手写实例的有效检索系统的关键需求,这被称为在线作家检索。尽管需求不断增长,该领域仍缺乏成熟的方法论和大规模公共数据集。本文针对这些挑战,重点研究中文手写短语的问题。首先,我们提出了DOLPHIN模型,这是一种新型的检索模型,旨在通过协同的时间-频率分析来增强手写表示。为了学习频率特征,我们提出了HFGA模块,该模块在普通时间手写序列与其高频子带之间执行门控交叉注意力机制,以放大显著的书写细节。对于时间特征的学习,我们提出了CAIR模块,专门设计用于促进通道交互并减少通道冗余。其次,为了解决数据不足的问题,我们引入了OLIWER,这是一个大型在线作家检索数据集,包含来自1731个个体的超过670,000个中文手写短语。通过广泛的评估,我们展示了DOLPHIN模型在现有方法上的优越性能。此外,我们也探索了跨域作家检索,并揭示了提高特征对齐在缩小不同手写数据分布差距中的关键作用。我们的发现强调了点采样频率和压力特征在提升手写表示质量和检索性能方面的意义。代码和数据集可在以下链接获取:[此https URL]。
https://arxiv.org/abs/2412.11668
The problem of converting images of text into plain text is a widely researched topic in both academia and industry. Arabic handwritten Text Recognation (AHTR) poses additional challenges due to diverse handwriting styles and limited labeled data. In this paper we present a complete OCR pipeline that starts with line segmentation using Differentiable Binarization and Adaptive Scale Fusion techniques to ensure accurate detection of text lines. Following segmentation, a CNN-BiLSTM-CTC architecture is applied to recognize characters. Our system, trained on the Arabic Multi-Fonts Dataset (AMFDS), achieves a Character Recognition Rate (CRR) of 99.20% and a Word Recognition Rate (WRR) of 93.75% on single-word samples containing 7 to 10 characters, along with a CRR of 83.76% for sentences. These results demonstrate the system's strong performance in handling Arabic scripts, establishing a new benchmark for AHTR systems.
https://arxiv.org/abs/2412.01601
Hand preference and degree of handedness (DoH) are two different aspects of human behavior which are often confused to be one. DoH is a person's inherent capability of the brain; affected by nature and nurture. In this study, we used dominant and non-dominant handwriting traits to assess DoH for the first time, on 43 subjects of three categories- Unidextrous, Partially Unidextrous, and Ambidextrous. Features extracted from the segmented handwriting signals called strokes were used for DoH quantification. Davies Bouldin Index, Multilayer perceptron, and Convolutional Neural Network (CNN) were used for automated grading of DoH. The outcomes of these methods were compared with the widely used DoH assessment questionnaires from Edinburgh Inventory (EI). The CNN based automated grading outperformed other computational methods with an average classification accuracy of 95.06% under stratified 10-fold cross-validation. The leave-one-subject-out strategy on this CNN resulted in a test individual's DoH score which was converted into a 4-point score. Around 90% of the obtained scores from all the implemented computational methods were found to be in accordance with the EI scores under 95% confidence interval. Automated grading of degree of handedness using handwriting signals can provide more resolution to the Edinburgh Inventory scores. This could be used in multiple applications concerned with neuroscience, rehabilitation, physiology, psychometry, behavioral sciences, and forensics.
https://arxiv.org/abs/2412.01587
The Virginia Tech University Libraries (VTUL) Digital Library Platform (DLP) hosts digital collections that offer our users access to a wide variety of documents of historical and cultural importance. These collections are not only of academic importance but also provide our users with a glance at local historical events. Our DLP contains collections comprising digital objects featuring complex layouts, faded imagery, and hard-to-read handwritten text, which makes providing online access to these materials challenging. To address these issues, we integrate AI into our DLP workflow and convert the text in the digital objects into a machine-readable format. To enhance the user experience with our historical collections, we use custom AI agents for handwriting recognition, text extraction, and large language models (LLMs) for summarization. This poster highlights three collections focusing on handwritten letters, newspapers, and digitized topographic maps. We discuss the challenges with each collection and detail our approaches to address them. Our proposed methods aim to enhance the user experience by making the contents in these collections easier to search and navigate.
弗吉尼亚理工大学图书馆(VTUL)的数字图书馆平台(DLP)托管了多种数字化藏品,为用户提供了访问具有历史和文化重要性的各种文档的机会。这些收藏不仅具有学术价值,还让用户能够了解本地的历史事件。我们的DLP包含了一系列复杂的数字对象,包括布局复杂、图像褪色以及难以辨认的手写文本等内容,这使得在线提供这些材料变得颇具挑战性。为解决这些问题,我们将在DLP工作流程中集成了人工智能,并将数字对象中的文字转换成机器可读的格式。为了增强用户对历史藏品的体验,我们使用了定制的人工智能代理进行手写识别、文本提取以及大型语言模型(LLMs)进行总结。此海报重点介绍了三个收藏项目:手写信件、报纸和数字化地形图。我们将讨论每个收藏项目的挑战,并详细说明我们的解决方法。所提出的方法旨在通过使这些藏品中的内容更易于搜索和导航,从而增强用户体验。
https://arxiv.org/abs/2411.17600
The generation of handwritten music sheets is a crucial step toward enhancing Optical Music Recognition (OMR) systems, which rely on large and diverse datasets for optimal performance. However, handwritten music sheets, often found in archives, present challenges for digitisation due to their fragility, varied handwriting styles, and image quality. This paper addresses the data scarcity problem by applying Generative Adversarial Networks (GANs) to synthesise realistic handwritten music sheets. We provide a comprehensive evaluation of three GAN models - DCGAN, ProGAN, and CycleWGAN - comparing their ability to generate diverse and high-quality handwritten music images. The proposed CycleWGAN model, which enhances style transfer and training stability, significantly outperforms DCGAN and ProGAN in both qualitative and quantitative evaluations. CycleWGAN achieves superior performance, with an FID score of 41.87, an IS of 2.29, and a KID of 0.05, making it a promising solution for improving OMR systems.
https://arxiv.org/abs/2411.16405
Ge'ez, an ancient Ethiopic script of cultural and historical significance, has been largely neglected in handwriting recognition research, hindering the digitization of valuable manuscripts. Our study addresses this gap by developing a state-of-the-art Ge'ez handwriting recognition system using Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks. Our approach uses a two-stage recognition process. First, a CNN is trained to recognize individual characters, which then acts as a feature extractor for an LSTM-based system for word recognition. Our dual-stage recognition approach achieves new top scores in Ge'ez handwriting recognition, outperforming eight state-of-the-art methods, which are SVTR, ASTER, and others as well as human performance, as measured in the HHD-Ethiopic dataset work. This research significantly advances the preservation and accessibility of Ge'ez cultural heritage, with implications for historical document digitization, educational tools, and cultural preservation. The code will be released upon acceptance.
古埃塞俄比亚文字盖兹文(Ge'ez),作为一种具有文化和历史意义的古老书写系统,长期以来在手写识别研究中被忽视,这阻碍了珍贵手稿的数字化进程。我们的研究通过开发一个基于卷积神经网络(CNNs)和长短时记忆(LSTM)网络的最先进的盖兹文手写识别系统来解决这一问题。我们的方法采用两阶段识别过程:首先训练一个 CNN 来识别单个字符,然后将该 CNN 作为特征提取器用于基于 LSTM 的单词识别系统。这种双阶段识别方法在盖兹文手写识别中达到了新的最高分,超越了包括 SVTR、ASTER 在内的八种最先进的方法以及人类的表现,这一结果是在 HHD-Ethiopic 数据集上进行测量得出的。这项研究显著推进了盖兹文文化遗产的保护和访问性,在历史文档数字化、教育工具及文化保存方面具有深远影响。代码将在接受后发布。
https://arxiv.org/abs/2411.13350
Dysgraphia is a learning disorder that affects handwriting abilities, making it challenging for children to write legibly and consistently. Early detection and monitoring are crucial for providing timely support and interventions. This study applies deep learning techniques to address the dual tasks of dysgraphia detection and optical character recognition (OCR) on handwriting samples from children with potential dysgraphic symptoms. Using a dataset of handwritten samples from Malaysian schoolchildren, we developed a custom Convolutional Neural Network (CNN) model, alongside VGG16 and ResNet50, to classify handwriting as dysgraphic or non-dysgraphic. The custom CNN model outperformed the pre-trained models, achieving a test accuracy of 91.8% with high precision, recall, and AUC, demonstrating its robustness in identifying dysgraphic handwriting features. Additionally, an OCR pipeline was created to segment and recognize individual characters in dysgraphic handwriting, achieving a character recognition accuracy of approximately 43.5%. This research highlights the potential of deep learning in supporting dysgraphia assessment, laying a foundation for tools that could assist educators and clinicians in identifying dysgraphia and tracking handwriting progress over time. The findings contribute to advancements in assistive technologies for learning disabilities, offering hope for more accessible and accurate diagnostic tools in educational and clinical settings.
发育性书写障碍是一种影响书写能力的学习障碍,使得儿童难以清晰且一致地书写。早期检测和监控对于提供及时的支持和干预至关重要。本研究应用深度学习技术来解决识别发育性书写障碍及手写样本光学字符识别(OCR)的双重任务。我们使用了一组来自马来西亚学生的手写样本数据集,开发了一个自定义卷积神经网络(CNN)模型,并与VGG16和ResNet50进行了对比,以对手写体进行分类,判断其是否为发育性书写障碍。自定义CNN模型的表现优于预训练模型,在测试中达到了91.8%的准确率,具有高精确度、召回率和AUC值,展示了其在识别发育性书写特征方面的稳健性。此外,还建立了一个OCR管道来分割并识别发育性书写障碍手写体中的单个字符,实现了大约43.5%的字符识别准确性。本研究强调了深度学习在支持发育性书写障碍评估方面的潜力,为开发有助于教育者和临床医生识别发育性书写障碍及跟踪随时间推移的书写进步情况的工具奠定了基础。该研究成果促进了辅助技术的发展,为学习障碍提供了更加便捷且准确的诊断工具,在教育和临床环境中带来了希望。
https://arxiv.org/abs/2411.13595
In recent years, brain-computer interfaces have made advances in decoding various motor-related tasks, including gesture recognition and movement classification, utilizing electroencephalogram (EEG) data. These developments are fundamental in exploring how neural signals can be interpreted to recognize specific physical actions. This study centers on a written alphabet classification task, where we aim to decode EEG signals associated with handwriting. To achieve this, we incorporate hand kinematics to guide the extraction of the consistent embeddings from high-dimensional neural recordings using auxiliary variables (CEBRA). These CEBRA embeddings, along with the EEG, are processed by a parallel convolutional neural network model that extracts features from both data sources simultaneously. The model classifies nine different handwritten characters, including symbols such as exclamation marks and commas, within the alphabet. We evaluate the model using a quantitative five-fold cross-validation approach and explore the structure of the embedding space through visualizations. Our approach achieves a classification accuracy of 91 % for the nine-class task, demonstrating the feasibility of fine-grained handwriting decoding from EEG.
近年来,脑机接口在解码各种与运动相关的任务方面取得了进展,包括通过脑电图(EEG)数据进行手势识别和动作分类。这些发展对于探索如何解读神经信号以识别特定的物理动作具有基础性意义。本研究专注于手写字母分类任务,旨在解码与书写相关的大脑电信号。为此,我们整合了手部运动学知识来指导利用辅助变量(CEBRA)从高维神经记录中提取一致嵌入的过程。这些CEBRA嵌入和EEG数据由一个并行卷积神经网络模型处理,该模型同时从两个数据源中提取特征。模型对包括感叹号和逗点等符号在内的九种不同手写字符进行分类。我们使用定量的五折交叉验证方法评估了模型,并通过可视化探索了嵌入空间的结构。我们的方法在九类任务上达到了91%的分类准确率,这证明了从EEG信号中精细解码手写动作的可行性。
https://arxiv.org/abs/2411.09170
Objective: We present the PaHaW Parkinson's disease handwriting database, consisting of handwriting samples from Parkinson's disease (PD) patients and healthy controls. Our goal is to show that kinematic features and pressure features in handwriting can be used for the differential diagnosis of PD. Methods and Material: The database contains records from 37 PD patients and 38 healthy controls performing eight different handwriting tasks. The tasks include drawing an Archimedean spiral, repetitively writing orthographically simple syllables and words, and writing of a sentence. In addition to the conventional kinematic features related to the dynamics of handwriting, we investigated new pressure features based on the pressure exerted on the writing surface. To discriminate between PD patients and healthy subjects, three different classifiers were compared: K-nearest neighbors (K-NN), ensemble AdaBoost classifier, and support vector machines (SVM). Results: For predicting PD based on kinematic and pressure features of handwriting, the best performing model was SVM with classification accuracy of Pacc = 81.3% (sensitivity Psen = 87.4% and specificity of Pspe = 80.9%). When evaluated separately, pressure features proved to be relevant for PD diagnosis, yielding Pacc = 82.5% compared to Pacc = 75.4% using kinematic features. Conclusion: Experimental results showed that an analysis of kinematic and pressure features during handwriting can help assess subtle characteristics of handwriting and discriminate between PD patients and healthy controls.
目标:我们提出了PaHaW帕金森病手写数据库,该数据库包含来自帕金森病(PD)患者和健康对照组的手写样本。我们的目标是展示手写的动力学特征和压力特征可用于帕金森病的鉴别诊断。 方法与材料:该数据库包括37名PD患者和38名健康对照者执行八种不同手写任务的记录。这些任务包括绘制阿基米德螺旋线,反复书写正字法简单的音节和单词,以及书写句子。除了常规的手写动力学特征外,我们还研究了基于对手写表面施加的压力的新压力特征。为了区分PD患者和健康受试者,比较了三种不同的分类器:K最近邻(K-NN)、集成AdaBoost分类器和支持向量机(SVM)。 结果:在根据手写的动力学和压力特征预测帕金森病方面,表现最好的模型是支持向量机(SVM),其分类准确率为Pacc = 81.3%(敏感性Psen = 87.4%,特异性Pspe = 80.9%)。单独评估时,压力特征对于PD诊断具有相关性,得出的准确率为Pacc = 82.5%,而使用动力学特征得到的准确率为Pacc = 75.4%。 结论:实验结果显示,在手写过程中对手写的动力学和压力特征进行分析可以帮助评估手写的细微特性,并区分帕金森病患者与健康对照组。
https://arxiv.org/abs/2411.03044
Generating context-adaptive manipulation and grasping actions is a challenging problem in robotics. Classical planning and control algorithms tend to be inflexible with regard to parameterization by external variables such as object shapes. In contrast, Learning from Demonstration (LfD) approaches, due to their nature as function approximators, allow for introducing external variables to modulate policies in response to the environment. In this paper, we utilize this property by introducing an LfD approach to acquire context-dependent grasping and manipulation strategies. We treat the problem as a kernel-based function approximation, where the kernel inputs include generic context variables describing task-dependent parameters such as the object shape. We build on existing work on policy fusion with uncertainty quantification to propose a state-dependent approach that automatically returns to demonstrations, avoiding unpredictable behavior while smoothly adapting to context changes. The approach is evaluated against the LASA handwriting dataset and on a real 7-DoF robot in two scenarios: adaptation to slippage while grasping and manipulating a deformable food item.
生成适应环境的操纵和抓取动作是机器人技术中的一个挑战性问题。传统的规划和控制算法在对外部变量(如物体形状)进行参数化时往往缺乏灵活性。相比之下,由于演示学习(LfD)方法本质上是函数近似器,它们允许引入外部变量来调节策略以应对环境变化。在这篇论文中,我们利用这一特性,采用一种演示学习方法来获取依赖于上下文的抓取和操纵策略。我们将问题视为基于核的函数逼近,其中核输入包括描述任务相关参数(如物体形状)的一般上下文变量。我们在现有政策融合及不确定性量化工作的基础上提出了一种状态相关的方案,该方案能自动返回演示,避免不可预测的行为,并平滑地适应环境变化。此方法在LASA手写数据集上进行了评估,并在一个真实的7自由度机器人上两种场景下得到了验证:抓取和操纵可变形食品项目时对打滑的适应性。
https://arxiv.org/abs/2410.24035
There is an immense quantity of historical and cultural documentation that exists only as handwritten manuscripts. At the same time, performing OCR across scripts and different handwriting styles has proven to be an enormously difficult problem relative to the process of digitizing print. While recent Transformer based models have achieved relatively strong performance, they rely heavily on manually transcribed training data and have difficulty generalizing across writers. Multimodal LLM, such as GPT-4v and Gemini, have demonstrated effectiveness in performing OCR and computer vision tasks with few shot prompting. In this paper, I evaluate the accuracy of handwritten document transcriptions generated by Gemini against the current state of the art Transformer based methods. Keywords: Optical Character Recognition, Multimodal Language Models, Cultural Preservation, Mass digitization, Handwriting Recognitio
有大量的历史和文化文档仅以手稿形式存在。同时,跨不同书写脚本和风格进行OCR(光学字符识别)证明比数字化印刷品的过程要困难得多。尽管最近基于Transformer的模型已经实现了相对较强的表现力,但它们高度依赖于手动转录的训练数据,并且在面对不同的写作者时难以泛化。多模态语言模型如GPT-4v和Gemini已经在使用少量样本提示的情况下展示了执行OCR和计算机视觉任务的有效性。在这篇论文中,我评估了由Gemini生成的手稿文档转录准确度与当前最先进的基于Transformer的方法之间的对比情况。关键词:光学字符识别,多模态语言模型,文化保存,大规模数字化,手写识别
https://arxiv.org/abs/2410.24034
Cognitive decline is a natural part of aging, often resulting in reduced cognitive abilities. In some cases, however, this decline is more pronounced, typically due to disorders such as Alzheimer's disease. Early detection of anomalous cognitive decline is crucial, as it can facilitate timely professional intervention. While medical data can help in this detection, it often involves invasive procedures. An alternative approach is to employ non-intrusive techniques such as speech or handwriting analysis, which do not necessarily affect daily activities. This survey reviews the most relevant methodologies that use deep learning techniques to automate the cognitive decline estimation task, including audio, text, and visual processing. We discuss the key features and advantages of each modality and methodology, including state-of-the-art approaches like Transformer architecture and foundation models. In addition, we present works that integrate different modalities to develop multimodal models. We also highlight the most significant datasets and the quantitative results from studies using these resources. From this review, several conclusions emerge. In most cases, the textual modality achieves the best results and is the most relevant for detecting cognitive decline. Moreover, combining various approaches from individual modalities into a multimodal model consistently enhances performance across nearly all scenarios.
认知衰退是老化过程中的自然现象,通常会导致认知能力的下降。然而,在某些情况下,这种衰退更加明显,通常是由于阿尔茨海默病等疾病所致。早期检测异常的认知衰退至关重要,因为它可以促进及时的专业干预。虽然医学数据有助于此类检测,但往往涉及侵入性程序。一种替代方法是使用非侵扰性的技术,如语音或笔迹分析,这些技术通常不会影响日常活动。本综述回顾了利用深度学习技术自动化认知衰退估计任务的最相关方法,包括音频、文本和视觉处理。我们讨论了每种模态及其方法的关键特征与优势,包括最先进的方法如Transformer架构和基础模型。此外,我们还介绍了整合不同模态以开发多模态模型的工作成果。我们也强调了一些最重要的数据集以及利用这些资源的研究的定量结果。从这一综述中,我们可以得出几个结论。在大多数情况下,文本模态获得了最佳效果,并且对于检测认知衰退最为相关。此外,在几乎所有场景下,将来自单个模态的各种方法结合到一个多模态模型中可以持续提升性能。
https://arxiv.org/abs/2410.18972
Dyslexia is one of the most common learning disorders, often characterized by distinct features in handwriting. Early detection is essential for effective intervention. In this paper, we propose an explainable AI (XAI) framework for dyslexia detection through handwriting analysis, utilizing transfer learning and transformer-based models. Our approach surpasses state-of-the-art methods, achieving a test accuracy of 0.9958, while ensuring model interpretability through Grad-CAM visualizations that highlight the critical handwriting features influencing model decisions. The main contributions of this work include the integration of XAI for enhanced interpretability, adaptation to diverse languages and writing systems, and demonstration of the method's global applicability. This framework not only improves diagnostic accuracy but also fosters trust and understanding among educators, clinicians, and parents, supporting earlier diagnoses and the development of personalized educational strategies.
阅读障碍是常见的学习障碍之一,通常通过手写特征表现出独特的特点。早期检测对于有效干预至关重要。本文提出了一种通过手写分析来检测阅读障碍的可解释人工智能(XAI)框架,该框架利用了迁移学习和基于变压器的模型。我们的方法超越了现有最佳方法,在测试中达到了0.9958的准确率,并且通过Grad-CAM可视化确保了模型的可解释性,突出了影响模型决策的关键手写特征。这项工作的主要贡献包括:为提高可解释性而整合XAI、适应多种语言和书写系统以及展示了该方法的全球适用性。此框架不仅提高了诊断准确性,而且促进了教育工作者、临床医生及家长之间的信任与理解,支持更早地进行诊断并开发个性化的教育策略。
https://arxiv.org/abs/2410.19821
Alzheimer's Disease (AD) is a prevalent neurodegenerative condition where early detection is vital. Handwriting, often affected early in AD, offers a non-invasive and cost-effective way to capture subtle motor changes. State-of-the-art research on handwriting, mostly online, based AD detection has predominantly relied on manually extracted features, fed as input to shallow machine learning models. Some recent works have proposed deep learning (DL)-based models, either 1D-CNN or 2D-CNN architectures, with performance comparing favorably to handcrafted schemes. These approaches, however, overlook the intrinsic relationship between the 2D spatial patterns of handwriting strokes and their 1D dynamic characteristics, thus limiting their capacity to capture the multimodal nature of handwriting data. Moreover, the application of Transformer models remains basically unexplored. To address these limitations, we propose a novel approach for AD detection, consisting of a learnable multimodal hybrid attention model that integrates simultaneously 2D handwriting images with 1D dynamic handwriting signals. Our model leverages a gated mechanism to combine similarity and difference attention, blending the two modalities and learning robust features by incorporating information at different scales. Our model achieved state-of-the-art performance on the DARWIN dataset, with an F1-score of 90.32\% and accuracy of 90.91\% in Task 8 ('L' writing), surpassing the previous best by 4.61% and 6.06% respectively.
https://arxiv.org/abs/2410.10547
Hospitals generate thousands of handwritten prescriptions, a practice that remains prevalent despite the availability of Electronic Medical Records (EMR). This method of record-keeping hinders the examination of long-term medication effects, impedes statistical analysis, and makes the retrieval of records challenging. Handwritten prescriptions pose a unique challenge, requiring specialized data for training models to recognize medications and their patterns of recommendation. While current handwriting recognition approaches typically employ 2-D LSTMs, recent studies have explored the use of Large Language Models (LLMs) for Optical Character Recognition (OCR). Building on this approach, we focus on extracting medication names from medical records. Our methodology MIRAGE (Multimodal Identification and Recognition of Annotations in indian GEneral prescriptions) involves fine-tuning the LLaVA 1.6 and Idefics2 models. Our research utilizes a dataset provided by Medyug Technology, consisting of 743,118 fully annotated high-resolution simulated medical records from 1,133 doctors across India. We demonstrate that our methodology exhibits 82% accuracy in medication name and dosage extraction. We provide a detailed account of our research methodology and results, notes about HWR with Multimodal LLMs, and release a small dataset of 100 medical records with labels.
https://arxiv.org/abs/2410.09729
Text plays a crucial role in the transmission of human civilization, and teaching machines to generate online handwritten text in various styles presents an interesting and significant challenge. However, most prior work has concentrated on generating individual Chinese fonts, leaving {complete text line generation largely unexplored}. In this paper, we identify that text lines can naturally be divided into two components: layout and glyphs. Based on this division, we designed a text line layout generator coupled with a diffusion-based stylized font synthesizer to address this challenge hierarchically. More concretely, the layout generator performs in-context-like learning based on the text content and the provided style references to generate positions for each glyph autoregressively. Meanwhile, the font synthesizer which consists of a character embedding dictionary, a multi-scale calligraphy style encoder, and a 1D U-Net based diffusion denoiser will generate each font on its position while imitating the calligraphy style extracted from the given style references. Qualitative and quantitative experiments on the CASIA-OLHWDB demonstrate that our method is capable of generating structurally correct and indistinguishable imitation samples.
文本在人类文明的传播中扮演着关键角色,并且教机器生成各种风格的在线手写文本提出了一个有趣且具有挑战性的问题。然而,大多数先前的研究都集中在生成单个中文字体上, leaving {完整的文本行生成大多没有被探索}。在本文中,我们发现文本行可以自然地分为两个部分:布局和字符。基于这一划分,我们设计了一个文本行布局生成器与扩散为基础的 stylized 字体合成器来解决这个问题。具体来说,布局生成器根据文本内容和提供的风格参考进行上下文类似的学习,以生成每个字符的自动位置。同时,由字符嵌入字典、多尺度书法风格编码器和基于 U-Net 的扩散去噪器组成的字体合成器将在其位置上模仿从给定风格参考中提取的书法风格。在 CASIA-OLHWDB 等数据集上进行的定性和定量实验证明,我们的方法能够生成结构正确且难以区分模仿样本。
https://arxiv.org/abs/2410.02309
Arabic handwritten text recognition (HTR) is challenging, especially for historical texts, due to diverse writing styles and the intrinsic features of Arabic script. Additionally, Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models. To address these challenges, we propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model. By leveraging the transformer's attention mechanism, HATFormer captures spatial contextual information to address the intrinsic challenges of Arabic script through differentiating cursive characters, decomposing visual representations, and identifying diacritics. Our customization to historical handwritten Arabic includes an image processor for effective ViT information preprocessing, a text tokenizer for compact Arabic text representation, and a training pipeline that accounts for a limited amount of historic Arabic handwriting data. HATFormer achieves a character error rate (CER) of 8.6% on the largest public historical handwritten Arabic dataset, with a 51% improvement over the best baseline in the literature. HATFormer also attains a comparable CER of 4.2% on the largest private non-historical dataset. Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges, contributing to advancements in document digitization, information retrieval, and cultural preservation.
阿拉伯手写文本识别(HTR)具有挑战性,尤其是在历史文本中,因为阿拉伯文本具有多样性的书写风格和阿拉伯文字本特征。此外,阿拉伯手写数据集比英语数据集要小,这使得训练具有泛化能力的阿拉伯HTR模型变得困难。为了应对这些挑战,我们提出了HATFormer,一种基于最先进的英语HTR模型的Transformer编码器-解码器架构。通过利用Transformer的注意力机制,HATFormer捕捉到阿拉伯文本的空间上下文信息,通过区分手写字符、分解视觉表示和识别变体来解决阿拉伯文字本固有的挑战。我们对历史手写阿拉伯的定制包括一个有效的ViT信息预处理图像处理器、一个紧凑的阿拉伯文本词条izer和一个考虑有限历史阿拉伯手写数据训练工作流的训练管道。HATFormer在最大的公共历史手写阿拉伯数据集上的字符错误率(CER)为8.6%,在文献中的最佳基线上的性能提高了51%。HATFormer还在最大的私人非历史数据集上获得了与文献中类似且可比的CER,为将英语HTR方法应用于具有复杂、语言特定挑战的低资源语言奠定了基础,有助于促进文档数字化、信息检索和文化遗产保护的发展。
https://arxiv.org/abs/2410.02179
Learning from Demonstration (LfD) is a useful paradigm for training policies that solve tasks involving complex motions. In practice, the successful application of LfD requires overcoming error accumulation during policy execution, i.e. the problem of drift due to errors compounding over time and the consequent out-of-distribution behaviours. Existing works seek to address this problem through scaling data collection, correcting policy errors with a human-in-the-loop, temporally ensembling policy predictions or through learning the parameters of a dynamical system model. In this work, we propose and validate an alternative approach to overcoming this issue. Inspired by reservoir computing, we develop a novel neural network layer that includes a fixed nonlinear dynamical system with tunable dynamical properties. We validate the efficacy of our neural network layer on the task of reproducing human handwriting motions using the LASA Human Handwriting Dataset. Through empirical experiments we demonstrate that incorporating our layer into existing neural network architectures addresses the issue of compounding errors in LfD. Furthermore, we perform a comparative evaluation against existing approaches including a temporal ensemble of policy predictions and an Echo State Networks (ESNs) implementation. We find that our approach yields greater policy precision and robustness on the handwriting task while also generalising to multiple dynamics regimes and maintaining competitive latency scores.
学习演示(LfD)是一种有效的解决涉及复杂动作任务的策略训练范式。在实践中,成功应用LfD需要通过在策略执行过程中克服误差累积问题,即误差随时间累积导致的漂移问题,以及随之而来的离散行为。现有作品试图通过缩放数据收集、通过人机交互来纠正策略错误、通过学习动态系统模型的参数来解决这个问题。在这篇工作中,我们提出了并验证了一种克服这个问题的替代方法。受到水库计算的启发,我们开发了一个新神经网络层,包括一个固定非线性动力学系统,具有可调的动态特性。我们通过使用LASA人类手写字符数据集来验证我们神经网络层的有效性。通过实验我们证明了将我们的层应用于现有的神经网络架构可以解决LfD中累积误差的問題。此外,我们还对包括基于策略预测的时间聚类和Echo State Networks(ESNs)实现的现有方法进行了比较评估。我们发现,与其他方法相比,我们的方法在手写字符任务中具有更高的策略精度、鲁棒性和扩展性,同时保持竞争延迟分数。
https://arxiv.org/abs/2409.18768