Large Language Models (LLMs) have been extensively applied in time series analysis. Yet, their utility in the few-shot classification (i.e., a crucial training scenario due to the limited training data available in industrial applications) concerning multivariate time series data remains underexplored. We aim to leverage the extensive pre-trained knowledge in LLMs to overcome the data scarcity problem within multivariate time series. Specifically, we propose LLMFew, an LLM-enhanced framework to investigate the feasibility and capacity of LLMs for few-shot multivariate time series classification. This model introduces a Patch-wise Temporal Convolution Encoder (PTCEnc) to align time series data with the textual embedding input of LLMs. We further fine-tune the pre-trained LLM decoder with Low-rank Adaptations (LoRA) to enhance its feature representation learning ability in time series data. Experimental results show that our model outperformed state-of-the-art baselines by a large margin, achieving 125.2% and 50.2% improvement in classification accuracy on Handwriting and EthanolConcentration datasets, respectively. Moreover, our experimental results demonstrate that LLM-based methods perform well across a variety of datasets in few-shot MTSC, delivering reliable results compared to traditional models. This success paves the way for their deployment in industrial environments where data are limited.
大型语言模型(LLMs)已被广泛应用于时间序列分析中。然而,它们在少量样本分类中的效用(即由于工业应用中可用的训练数据有限,这是一个至关重要的训练场景),尤其是在处理多变量时间序列数据时,仍然有待探索。我们旨在利用大规模预训练的语言模型的知识来解决多变量时间序列数据中存在的数据稀缺问题。为此,我们提出了LLMFew框架,这是基于大型语言模型增强的一种方法,用于研究大型语言模型在少量样本的多变量时间序列分类中的可行性和能力。 该模型引入了“Patch-wise Temporal Convolution Encoder (PTCEnc)”来对齐时间序列数据与大型语言模型输入文本嵌入。此外,我们还通过低秩适应(LoRA)微调预训练的语言模型解码器,以增强其在处理时间序列数据时的特征表示学习能力。 实验结果表明,我们的模型相比于最先进的基准方法,在分类准确度上分别提高了125.2%和50.2%,具体体现在Handwriting和EthanolConcentration数据集上的表现。此外,我们的实验证明了基于大型语言模型的方法在少量样本多变量时间序列分类的各种数据集中都表现出色,并且与传统模型相比提供了可靠的结果。 这一成功为这些方法在工业环境中部署铺平了道路,在这些环境中由于数据有限而难以采用传统的机器学习技术。
https://arxiv.org/abs/2502.00059
Brain-computer interfaces (BCIs) present a promising avenue by translating neural activity directly into text, eliminating the need for physical actions. However, existing non-invasive BCI systems have not successfully covered the entire alphabet, limiting their practicality. In this paper, we propose a novel non-invasive EEG-based BCI system with Curriculum-based Neural Spelling Framework, which recognizes all 26 alphabet letters by decoding neural signals associated with handwriting first, and then apply a Generative AI (GenAI) to enhance spell-based neural language decoding tasks. Our approach combines the ease of handwriting with the accessibility of EEG technology, utilizing advanced neural decoding algorithms and pre-trained large language models (LLMs) to translate EEG patterns into text with high accuracy. This system show how GenAI can improve the performance of typical spelling-based neural language decoding task, and addresses the limitations of previous methods, offering a scalable and user-friendly solution for individuals with communication impairments, thereby enhancing inclusive communication options.
脑机接口(BCI)通过直接将神经活动转换为文本,无需物理动作,提供了一种有前景的方法。然而,现有的非侵入性 BCI 系统尚未成功覆盖整个字母表,限制了其实用性。在本文中,我们提出了一种基于课程学习的新型非侵入式 EEG(脑电图)BCI 系统,该系统首先通过解码与手写相关的神经信号来识别所有 26 个英文字母,然后应用生成人工智能(GenAI)来增强基于拼写的神经语言解码任务。我们的方法结合了书写简便性和 EEG 技术的易用性,并利用先进的神经解码算法和预训练的大规模语言模型(LLM),将 EEG 模式准确地转换为文本。该系统展示了 GenAI 如何提升典型的基于拼写式的神经语言解码任务的表现,解决了先前方法的局限性,为有沟通障碍的人提供了一种可扩展且用户友好的解决方案,从而增强了包容性的沟通选项。
https://arxiv.org/abs/2501.17489
Dyslexia affects reading and writing skills across many languages. This work describes a new application of YOLO-based object detection to isolate and label handwriting patterns (Normal, Reversal, Corrected) within synthetic images that resemble real words. Individual letters are first collected, preprocessed into 32x32 samples, then assembled into larger synthetic 'words' to simulate realistic handwriting. Our YOLOv11 framework simultaneously localizes each letter and classifies it into one of three categories, reflecting key dyslexia traits. Empirically, we achieve near-perfect performance, with precision, recall, and F1 metrics typically exceeding 0.999. This surpasses earlier single-letter approaches that rely on conventional CNNs or transfer-learning classifiers (for example, MobileNet-based methods in Robaa et al. arXiv:2410.19821). Unlike simpler pipelines that consider each letter in isolation, our solution processes complete word images, resulting in more authentic representations of handwriting. Although relying on synthetic data raises concerns about domain gaps, these experiments highlight the promise of YOLO-based detection for faster and more interpretable dyslexia screening. Future work will expand to real-world handwriting, other languages, and deeper explainability methods to build confidence among educators, clinicians, and families.
阅读障碍会影响多语言的读写技能。这项工作描述了一种基于YOLO(You Only Look Once)目标检测的新应用,该应用旨在从类似于真实单词的合成图像中分离和标记书写模式(正常、反转、修正)。首先收集单个字母,预处理为32x32样本,然后组装成更大的合成“单词”,以模拟真实的书写方式。我们的YOLOv11框架同时定位每个字母并将其分类到三个类别之一,反映关键的阅读障碍特征。从经验上看,我们达到了接近完美的性能,精度、召回率和F1指标通常超过0.999。这超过了依赖传统CNN(卷积神经网络)或迁移学习分类器(例如Robaa等人提出的基于MobileNet的方法 arXiv:2410.19821)的早期单字母方法。与只考虑每个字母的简单流程不同,我们的解决方案处理完整的单词图像,从而生成更真实的书写表示形式。尽管依赖于合成数据会引发领域差距的问题,但这些实验突显了基于YOLO检测在阅读障碍筛查中实现更快和更具解释性的潜力。未来的工作将扩展到现实世界中的手写、其他语言以及更深的可解释性方法,以增强教育者、临床医生和家庭的信心。
https://arxiv.org/abs/2501.15263
Despite recent significant advancements in Handwritten Document Recognition (HDR), the efficient and accurate recognition of text against complex backgrounds, diverse handwriting styles, and varying document layouts remains a practical challenge. Moreover, this issue is seldom addressed in academic research, particularly in scenarios with minimal annotated data available. In this paper, we introduce the DocTTT framework to address these challenges. The key innovation of our approach is that it uses test-time training to adapt the model to each specific input during testing. We propose a novel Meta-Auxiliary learning approach that combines Meta-learning and self-supervised Masked Autoencoder~(MAE). During testing, we adapt the visual representation parameters using a self-supervised MAE loss. During training, we learn the model parameters using a meta-learning framework, so that the model parameters are learned to adapt to a new input effectively. Experimental results show that our proposed method significantly outperforms existing state-of-the-art approaches on benchmark datasets.
尽管手写文档识别(HDR)领域近年来取得了显著进展,但在复杂背景、多样的书写风格和不同的文档布局下高效准确地识别文本仍然是一项实际挑战。此外,在学术研究中,尤其是在缺乏充足标注数据的情况下解决这些问题的尝试很少。在本文中,我们介绍了DocTTT框架来应对这些挑战。我们的方法的关键创新在于它使用测试时训练(test-time training)来自适应调整模型以针对每个特定输入进行优化。 具体而言,我们提出了一种新颖的元辅助学习方法,该方法结合了元学习和自监督掩码自动编码器(MAE)。在测试阶段,通过自监督的MAE损失函数来调整视觉表示参数。而在训练过程中,则利用一个元学习框架来学习模型参数,使模型能够在面对新输入时有效地进行适应。 实验结果表明,在基准数据集上,我们提出的方法显著优于现有的最先进的方法。
https://arxiv.org/abs/2501.12898
In the realm of digital forensics and document authentication, writer identification plays a crucial role in determining the authors of documents based on handwriting styles. The primary challenge in writer-id is the "open-set scenario", where the goal is accurately recognizing writers unseen during the model training. To overcome this challenge, representation learning is the key. This method can capture unique handwriting features, enabling it to recognize styles not previously encountered during training. Building on this concept, this paper introduces the Contrastive Masked Auto-Encoders (CMAE) for Character-level Open-Set Writer Identification. We merge Masked Auto-Encoders (MAE) with Contrastive Learning (CL) to simultaneously and respectively capture sequential information and distinguish diverse handwriting styles. Demonstrating its effectiveness, our model achieves state-of-the-art (SOTA) results on the CASIA online handwriting dataset, reaching an impressive precision rate of 89.7%. Our study advances universal writer-id with a sophisticated representation learning approach, contributing substantially to the ever-evolving landscape of digital handwriting analysis, and catering to the demands of an increasingly interconnected world.
在数字取证和文档认证领域,作者识别通过分析书写风格来确定文档的作者身份,扮演着至关重要的角色。作者识别(writer-id)的主要挑战在于“开放集场景”,即目标是准确地识别出那些未在模型训练期间见过的作者。为应对这一挑战,表示学习方法至关重要,该方法能够捕捉到独特的手写特征,从而能够在未曾遇到过的书写风格中进行识别。 在此基础上,本文介绍了字符级开放集作者识别中的对比掩码自动编码器(Contrastive Masked Auto-Encoders, CMAE)。我们结合了掩码自动编码器(Masked Auto-Encoders, MAE)与对比学习(Contrastive Learning, CL),以同时且分别地捕捉序列信息和区分多样化的书写风格。通过在CASIA在线手写数据集上的实验,我们的模型取得了最先进的精度率89.7%的成绩,证明了其有效性。 本研究通过一种复杂的表示学习方法推进了通用作者识别技术的发展,并为不断演变的数字笔迹分析领域做出了重要贡献,同时也满足了一个日益互联的世界的需求。
https://arxiv.org/abs/2501.11895
This paper introduces a cost-effective robotic handwriting system designed to replicate human-like handwriting with high precision. Combining a Raspberry Pi Pico microcontroller, 3D-printed components, and a machine learning-based handwriting generation model implemented via this http URL, the system converts user-supplied text into realistic stroke trajectories. By leveraging lightweight 3D-printed materials and efficient mechanical designs, the system achieves a total hardware cost of approximately \$56, significantly undercutting commercial alternatives. Experimental evaluations demonstrate handwriting precision within $\pm$0.3 millimeters and a writing speed of approximately 200 mm/min, positioning the system as a viable solution for educational, research, and assistive applications. This study seeks to lower the barriers to personalized handwriting technologies, making them accessible to a broader audience.
这篇论文介绍了一种成本效益高的机器人书写系统,旨在以高精度复制类似人类的笔迹。该系统结合了Raspberry Pi Pico微控制器、3D打印部件以及通过此链接(请将“this http URL”替换为实际链接)实现的基于机器学习的手写生成模型,能够将用户提供的文本转换成逼真的笔画轨迹。利用轻量级3D打印材料和高效的机械设计,该系统实现了约56美元的总硬件成本,显著低于商用替代品的价格。实验评估表明,系统的书写精度在±0.3毫米范围内,并且书写速度约为每分钟200毫米,使其成为教育、研究及辅助应用的一个可行解决方案。本研究旨在降低个性化手写技术的门槛,让更广泛的受众能够使用这些技术。
https://arxiv.org/abs/2501.06783
Extracting medication names from handwritten doctor prescriptions is challenging due to the wide variability in handwriting styles and prescription formats. This paper presents a robust method for extracting medicine names using a combination of Mask R-CNN and Transformer-based Optical Character Recognition (TrOCR) with Multi-Head Attention and Positional Embeddings. A novel dataset, featuring diverse handwritten prescriptions from various regions of Pakistan, was utilized to fine-tune the model on different handwriting styles. The Mask R-CNN model segments the prescription images to focus on the medicinal sections, while the TrOCR model, enhanced by Multi-Head Attention and Positional Embeddings, transcribes the isolated text. The transcribed text is then matched against a pre-existing database for accurate identification. The proposed approach achieved a character error rate (CER) of 1.4% on standard benchmarks, highlighting its potential as a reliable and efficient tool for automating medicine name extraction.
从手写医生处方中提取药品名称具有挑战性,因为书写风格和处方格式的多样性。本文介绍了一种结合使用Mask R-CNN和基于Transformer的光学字符识别(TrOCR)的方法来提取药品名称,该方法采用了多头注意力机制和位置嵌入技术。研究中利用了一个新数据集,其中包含了来自巴基斯坦不同地区的多样化手写处方,用以调整模型以适应不同的书写风格。Mask R-CNN模型用于分割处方图像,专注于药物部分,而由多头注意力机制和位置嵌入增强的TrOCR模型则负责转录孤立文本。随后将转录后的文本与现有的数据库进行匹配,以实现准确识别。所提出的方法在标准基准测试中达到了1.4%的字符错误率(CER),显示出其作为自动提取药品名称可靠而高效的工具的巨大潜力。
https://arxiv.org/abs/2412.18199
The generation of images of realistic looking, readable handwritten text is a challenging task which is referred to as handwritten text generation (HTG). Given a string and examples from a writer, the goal is to synthesize an image depicting the correctly spelled word in handwriting with the calligraphic style of the desired writer. An important application of HTG is the generation of training images in order to adapt downstream models for new data sets. With their success in natural image generation, diffusion models (DMs) have become the state-of-the-art approach in HTG. In this work, we present an extension of a latent DM for HTG to enable generation of writing styles not seen during training by learning style conditioning with a masked auto encoder. Our proposed content encoder allows for different ways of conditioning the DM on textual and calligraphic features. Additionally, we employ classifier-free guidance and explore the influence on the quality of the generated training images. For adapting the model to a new unlabeled data set, we propose a semi-supervised training scheme. We evaluate our approach on the IAM-database and use the RIMES-database to examine the generation of data not seen during training achieving improvements in this particularly promising application of DMs for HTG.
生成看起来逼真且可读的手写文本图像是一个具有挑战性的任务,被称为手写文本生成(HTG)。给定一段字符串和某个书写者的样本,目标是合成一张图片,该图片以所期望的书写者风格展示正确拼写的单词。HTG的一个重要应用是在新数据集上训练下游模型时生成训练图像。由于在自然图像生成方面的成功,扩散模型(DMs)已成为HTG领域的领先方法。在此工作中,我们提出了一种扩展的潜在DM用于HTG,通过学习带有掩码自编码器的风格条件来实现生成未见过的书写风格。我们的内容编码器允许以不同的方式对DM进行文本和书法特征上的条件设置。此外,我们采用无分类器指导,并探讨其对生成训练图像质量的影响。为了将模型适应新的未标记数据集,我们提出了一种半监督训练方案。我们在IAM数据库上评估了我们的方法,并使用RIMES数据库来检查生成未见过的数据,从而在DMs用于HTG的这一特别有前景的应用中取得了改进。
https://arxiv.org/abs/2412.15853
Currently, the prevalence of online handwriting has spurred a critical need for effective retrieval systems to accurately search relevant handwriting instances from specific writers, known as online writer retrieval. Despite the growing demand, this field suffers from a scarcity of well-established methodologies and public large-scale datasets. This paper tackles these challenges with a focus on Chinese handwritten phrases. First, we propose DOLPHIN, a novel retrieval model designed to enhance handwriting representations through synergistic temporal-frequency analysis. For frequency feature learning, we propose the HFGA block, which performs gated cross-attention between the vanilla temporal handwriting sequence and its high-frequency sub-bands to amplify salient writing details. For temporal feature learning, we propose the CAIR block, tailored to promote channel interaction and reduce channel redundancy. Second, to address data deficit, we introduce OLIWER, a large-scale online writer retrieval dataset encompassing over 670,000 Chinese handwritten phrases from 1,731 individuals. Through extensive evaluations, we demonstrate the superior performance of DOLPHIN over existing methods. In addition, we explore cross-domain writer retrieval and reveal the pivotal role of increasing feature alignment in bridging the distributional gap between different handwriting data. Our findings emphasize the significance of point sampling frequency and pressure features in improving handwriting representation quality and retrieval performance. Code and dataset are available at this https URL.
当前,在线手写内容的普及引发了对于能够准确搜索特定作者相关手写实例的有效检索系统的关键需求,这被称为在线作家检索。尽管需求不断增长,该领域仍缺乏成熟的方法论和大规模公共数据集。本文针对这些挑战,重点研究中文手写短语的问题。首先,我们提出了DOLPHIN模型,这是一种新型的检索模型,旨在通过协同的时间-频率分析来增强手写表示。为了学习频率特征,我们提出了HFGA模块,该模块在普通时间手写序列与其高频子带之间执行门控交叉注意力机制,以放大显著的书写细节。对于时间特征的学习,我们提出了CAIR模块,专门设计用于促进通道交互并减少通道冗余。其次,为了解决数据不足的问题,我们引入了OLIWER,这是一个大型在线作家检索数据集,包含来自1731个个体的超过670,000个中文手写短语。通过广泛的评估,我们展示了DOLPHIN模型在现有方法上的优越性能。此外,我们也探索了跨域作家检索,并揭示了提高特征对齐在缩小不同手写数据分布差距中的关键作用。我们的发现强调了点采样频率和压力特征在提升手写表示质量和检索性能方面的意义。代码和数据集可在以下链接获取:[此https URL]。
https://arxiv.org/abs/2412.11668
The problem of converting images of text into plain text is a widely researched topic in both academia and industry. Arabic handwritten Text Recognation (AHTR) poses additional challenges due to diverse handwriting styles and limited labeled data. In this paper we present a complete OCR pipeline that starts with line segmentation using Differentiable Binarization and Adaptive Scale Fusion techniques to ensure accurate detection of text lines. Following segmentation, a CNN-BiLSTM-CTC architecture is applied to recognize characters. Our system, trained on the Arabic Multi-Fonts Dataset (AMFDS), achieves a Character Recognition Rate (CRR) of 99.20% and a Word Recognition Rate (WRR) of 93.75% on single-word samples containing 7 to 10 characters, along with a CRR of 83.76% for sentences. These results demonstrate the system's strong performance in handling Arabic scripts, establishing a new benchmark for AHTR systems.
https://arxiv.org/abs/2412.01601
Hand preference and degree of handedness (DoH) are two different aspects of human behavior which are often confused to be one. DoH is a person's inherent capability of the brain; affected by nature and nurture. In this study, we used dominant and non-dominant handwriting traits to assess DoH for the first time, on 43 subjects of three categories- Unidextrous, Partially Unidextrous, and Ambidextrous. Features extracted from the segmented handwriting signals called strokes were used for DoH quantification. Davies Bouldin Index, Multilayer perceptron, and Convolutional Neural Network (CNN) were used for automated grading of DoH. The outcomes of these methods were compared with the widely used DoH assessment questionnaires from Edinburgh Inventory (EI). The CNN based automated grading outperformed other computational methods with an average classification accuracy of 95.06% under stratified 10-fold cross-validation. The leave-one-subject-out strategy on this CNN resulted in a test individual's DoH score which was converted into a 4-point score. Around 90% of the obtained scores from all the implemented computational methods were found to be in accordance with the EI scores under 95% confidence interval. Automated grading of degree of handedness using handwriting signals can provide more resolution to the Edinburgh Inventory scores. This could be used in multiple applications concerned with neuroscience, rehabilitation, physiology, psychometry, behavioral sciences, and forensics.
https://arxiv.org/abs/2412.01587
The Virginia Tech University Libraries (VTUL) Digital Library Platform (DLP) hosts digital collections that offer our users access to a wide variety of documents of historical and cultural importance. These collections are not only of academic importance but also provide our users with a glance at local historical events. Our DLP contains collections comprising digital objects featuring complex layouts, faded imagery, and hard-to-read handwritten text, which makes providing online access to these materials challenging. To address these issues, we integrate AI into our DLP workflow and convert the text in the digital objects into a machine-readable format. To enhance the user experience with our historical collections, we use custom AI agents for handwriting recognition, text extraction, and large language models (LLMs) for summarization. This poster highlights three collections focusing on handwritten letters, newspapers, and digitized topographic maps. We discuss the challenges with each collection and detail our approaches to address them. Our proposed methods aim to enhance the user experience by making the contents in these collections easier to search and navigate.
弗吉尼亚理工大学图书馆(VTUL)的数字图书馆平台(DLP)托管了多种数字化藏品,为用户提供了访问具有历史和文化重要性的各种文档的机会。这些收藏不仅具有学术价值,还让用户能够了解本地的历史事件。我们的DLP包含了一系列复杂的数字对象,包括布局复杂、图像褪色以及难以辨认的手写文本等内容,这使得在线提供这些材料变得颇具挑战性。为解决这些问题,我们将在DLP工作流程中集成了人工智能,并将数字对象中的文字转换成机器可读的格式。为了增强用户对历史藏品的体验,我们使用了定制的人工智能代理进行手写识别、文本提取以及大型语言模型(LLMs)进行总结。此海报重点介绍了三个收藏项目:手写信件、报纸和数字化地形图。我们将讨论每个收藏项目的挑战,并详细说明我们的解决方法。所提出的方法旨在通过使这些藏品中的内容更易于搜索和导航,从而增强用户体验。
https://arxiv.org/abs/2411.17600
The generation of handwritten music sheets is a crucial step toward enhancing Optical Music Recognition (OMR) systems, which rely on large and diverse datasets for optimal performance. However, handwritten music sheets, often found in archives, present challenges for digitisation due to their fragility, varied handwriting styles, and image quality. This paper addresses the data scarcity problem by applying Generative Adversarial Networks (GANs) to synthesise realistic handwritten music sheets. We provide a comprehensive evaluation of three GAN models - DCGAN, ProGAN, and CycleWGAN - comparing their ability to generate diverse and high-quality handwritten music images. The proposed CycleWGAN model, which enhances style transfer and training stability, significantly outperforms DCGAN and ProGAN in both qualitative and quantitative evaluations. CycleWGAN achieves superior performance, with an FID score of 41.87, an IS of 2.29, and a KID of 0.05, making it a promising solution for improving OMR systems.
https://arxiv.org/abs/2411.16405
Ge'ez, an ancient Ethiopic script of cultural and historical significance, has been largely neglected in handwriting recognition research, hindering the digitization of valuable manuscripts. Our study addresses this gap by developing a state-of-the-art Ge'ez handwriting recognition system using Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks. Our approach uses a two-stage recognition process. First, a CNN is trained to recognize individual characters, which then acts as a feature extractor for an LSTM-based system for word recognition. Our dual-stage recognition approach achieves new top scores in Ge'ez handwriting recognition, outperforming eight state-of-the-art methods, which are SVTR, ASTER, and others as well as human performance, as measured in the HHD-Ethiopic dataset work. This research significantly advances the preservation and accessibility of Ge'ez cultural heritage, with implications for historical document digitization, educational tools, and cultural preservation. The code will be released upon acceptance.
古埃塞俄比亚文字盖兹文(Ge'ez),作为一种具有文化和历史意义的古老书写系统,长期以来在手写识别研究中被忽视,这阻碍了珍贵手稿的数字化进程。我们的研究通过开发一个基于卷积神经网络(CNNs)和长短时记忆(LSTM)网络的最先进的盖兹文手写识别系统来解决这一问题。我们的方法采用两阶段识别过程:首先训练一个 CNN 来识别单个字符,然后将该 CNN 作为特征提取器用于基于 LSTM 的单词识别系统。这种双阶段识别方法在盖兹文手写识别中达到了新的最高分,超越了包括 SVTR、ASTER 在内的八种最先进的方法以及人类的表现,这一结果是在 HHD-Ethiopic 数据集上进行测量得出的。这项研究显著推进了盖兹文文化遗产的保护和访问性,在历史文档数字化、教育工具及文化保存方面具有深远影响。代码将在接受后发布。
https://arxiv.org/abs/2411.13350
Dysgraphia is a learning disorder that affects handwriting abilities, making it challenging for children to write legibly and consistently. Early detection and monitoring are crucial for providing timely support and interventions. This study applies deep learning techniques to address the dual tasks of dysgraphia detection and optical character recognition (OCR) on handwriting samples from children with potential dysgraphic symptoms. Using a dataset of handwritten samples from Malaysian schoolchildren, we developed a custom Convolutional Neural Network (CNN) model, alongside VGG16 and ResNet50, to classify handwriting as dysgraphic or non-dysgraphic. The custom CNN model outperformed the pre-trained models, achieving a test accuracy of 91.8% with high precision, recall, and AUC, demonstrating its robustness in identifying dysgraphic handwriting features. Additionally, an OCR pipeline was created to segment and recognize individual characters in dysgraphic handwriting, achieving a character recognition accuracy of approximately 43.5%. This research highlights the potential of deep learning in supporting dysgraphia assessment, laying a foundation for tools that could assist educators and clinicians in identifying dysgraphia and tracking handwriting progress over time. The findings contribute to advancements in assistive technologies for learning disabilities, offering hope for more accessible and accurate diagnostic tools in educational and clinical settings.
发育性书写障碍是一种影响书写能力的学习障碍,使得儿童难以清晰且一致地书写。早期检测和监控对于提供及时的支持和干预至关重要。本研究应用深度学习技术来解决识别发育性书写障碍及手写样本光学字符识别(OCR)的双重任务。我们使用了一组来自马来西亚学生的手写样本数据集,开发了一个自定义卷积神经网络(CNN)模型,并与VGG16和ResNet50进行了对比,以对手写体进行分类,判断其是否为发育性书写障碍。自定义CNN模型的表现优于预训练模型,在测试中达到了91.8%的准确率,具有高精确度、召回率和AUC值,展示了其在识别发育性书写特征方面的稳健性。此外,还建立了一个OCR管道来分割并识别发育性书写障碍手写体中的单个字符,实现了大约43.5%的字符识别准确性。本研究强调了深度学习在支持发育性书写障碍评估方面的潜力,为开发有助于教育者和临床医生识别发育性书写障碍及跟踪随时间推移的书写进步情况的工具奠定了基础。该研究成果促进了辅助技术的发展,为学习障碍提供了更加便捷且准确的诊断工具,在教育和临床环境中带来了希望。
https://arxiv.org/abs/2411.13595
In recent years, brain-computer interfaces have made advances in decoding various motor-related tasks, including gesture recognition and movement classification, utilizing electroencephalogram (EEG) data. These developments are fundamental in exploring how neural signals can be interpreted to recognize specific physical actions. This study centers on a written alphabet classification task, where we aim to decode EEG signals associated with handwriting. To achieve this, we incorporate hand kinematics to guide the extraction of the consistent embeddings from high-dimensional neural recordings using auxiliary variables (CEBRA). These CEBRA embeddings, along with the EEG, are processed by a parallel convolutional neural network model that extracts features from both data sources simultaneously. The model classifies nine different handwritten characters, including symbols such as exclamation marks and commas, within the alphabet. We evaluate the model using a quantitative five-fold cross-validation approach and explore the structure of the embedding space through visualizations. Our approach achieves a classification accuracy of 91 % for the nine-class task, demonstrating the feasibility of fine-grained handwriting decoding from EEG.
近年来,脑机接口在解码各种与运动相关的任务方面取得了进展,包括通过脑电图(EEG)数据进行手势识别和动作分类。这些发展对于探索如何解读神经信号以识别特定的物理动作具有基础性意义。本研究专注于手写字母分类任务,旨在解码与书写相关的大脑电信号。为此,我们整合了手部运动学知识来指导利用辅助变量(CEBRA)从高维神经记录中提取一致嵌入的过程。这些CEBRA嵌入和EEG数据由一个并行卷积神经网络模型处理,该模型同时从两个数据源中提取特征。模型对包括感叹号和逗点等符号在内的九种不同手写字符进行分类。我们使用定量的五折交叉验证方法评估了模型,并通过可视化探索了嵌入空间的结构。我们的方法在九类任务上达到了91%的分类准确率,这证明了从EEG信号中精细解码手写动作的可行性。
https://arxiv.org/abs/2411.09170
Objective: We present the PaHaW Parkinson's disease handwriting database, consisting of handwriting samples from Parkinson's disease (PD) patients and healthy controls. Our goal is to show that kinematic features and pressure features in handwriting can be used for the differential diagnosis of PD. Methods and Material: The database contains records from 37 PD patients and 38 healthy controls performing eight different handwriting tasks. The tasks include drawing an Archimedean spiral, repetitively writing orthographically simple syllables and words, and writing of a sentence. In addition to the conventional kinematic features related to the dynamics of handwriting, we investigated new pressure features based on the pressure exerted on the writing surface. To discriminate between PD patients and healthy subjects, three different classifiers were compared: K-nearest neighbors (K-NN), ensemble AdaBoost classifier, and support vector machines (SVM). Results: For predicting PD based on kinematic and pressure features of handwriting, the best performing model was SVM with classification accuracy of Pacc = 81.3% (sensitivity Psen = 87.4% and specificity of Pspe = 80.9%). When evaluated separately, pressure features proved to be relevant for PD diagnosis, yielding Pacc = 82.5% compared to Pacc = 75.4% using kinematic features. Conclusion: Experimental results showed that an analysis of kinematic and pressure features during handwriting can help assess subtle characteristics of handwriting and discriminate between PD patients and healthy controls.
目标:我们提出了PaHaW帕金森病手写数据库,该数据库包含来自帕金森病(PD)患者和健康对照组的手写样本。我们的目标是展示手写的动力学特征和压力特征可用于帕金森病的鉴别诊断。 方法与材料:该数据库包括37名PD患者和38名健康对照者执行八种不同手写任务的记录。这些任务包括绘制阿基米德螺旋线,反复书写正字法简单的音节和单词,以及书写句子。除了常规的手写动力学特征外,我们还研究了基于对手写表面施加的压力的新压力特征。为了区分PD患者和健康受试者,比较了三种不同的分类器:K最近邻(K-NN)、集成AdaBoost分类器和支持向量机(SVM)。 结果:在根据手写的动力学和压力特征预测帕金森病方面,表现最好的模型是支持向量机(SVM),其分类准确率为Pacc = 81.3%(敏感性Psen = 87.4%,特异性Pspe = 80.9%)。单独评估时,压力特征对于PD诊断具有相关性,得出的准确率为Pacc = 82.5%,而使用动力学特征得到的准确率为Pacc = 75.4%。 结论:实验结果显示,在手写过程中对手写的动力学和压力特征进行分析可以帮助评估手写的细微特性,并区分帕金森病患者与健康对照组。
https://arxiv.org/abs/2411.03044
Generating context-adaptive manipulation and grasping actions is a challenging problem in robotics. Classical planning and control algorithms tend to be inflexible with regard to parameterization by external variables such as object shapes. In contrast, Learning from Demonstration (LfD) approaches, due to their nature as function approximators, allow for introducing external variables to modulate policies in response to the environment. In this paper, we utilize this property by introducing an LfD approach to acquire context-dependent grasping and manipulation strategies. We treat the problem as a kernel-based function approximation, where the kernel inputs include generic context variables describing task-dependent parameters such as the object shape. We build on existing work on policy fusion with uncertainty quantification to propose a state-dependent approach that automatically returns to demonstrations, avoiding unpredictable behavior while smoothly adapting to context changes. The approach is evaluated against the LASA handwriting dataset and on a real 7-DoF robot in two scenarios: adaptation to slippage while grasping and manipulating a deformable food item.
生成适应环境的操纵和抓取动作是机器人技术中的一个挑战性问题。传统的规划和控制算法在对外部变量(如物体形状)进行参数化时往往缺乏灵活性。相比之下,由于演示学习(LfD)方法本质上是函数近似器,它们允许引入外部变量来调节策略以应对环境变化。在这篇论文中,我们利用这一特性,采用一种演示学习方法来获取依赖于上下文的抓取和操纵策略。我们将问题视为基于核的函数逼近,其中核输入包括描述任务相关参数(如物体形状)的一般上下文变量。我们在现有政策融合及不确定性量化工作的基础上提出了一种状态相关的方案,该方案能自动返回演示,避免不可预测的行为,并平滑地适应环境变化。此方法在LASA手写数据集上进行了评估,并在一个真实的7自由度机器人上两种场景下得到了验证:抓取和操纵可变形食品项目时对打滑的适应性。
https://arxiv.org/abs/2410.24035
There is an immense quantity of historical and cultural documentation that exists only as handwritten manuscripts. At the same time, performing OCR across scripts and different handwriting styles has proven to be an enormously difficult problem relative to the process of digitizing print. While recent Transformer based models have achieved relatively strong performance, they rely heavily on manually transcribed training data and have difficulty generalizing across writers. Multimodal LLM, such as GPT-4v and Gemini, have demonstrated effectiveness in performing OCR and computer vision tasks with few shot prompting. In this paper, I evaluate the accuracy of handwritten document transcriptions generated by Gemini against the current state of the art Transformer based methods. Keywords: Optical Character Recognition, Multimodal Language Models, Cultural Preservation, Mass digitization, Handwriting Recognitio
有大量的历史和文化文档仅以手稿形式存在。同时,跨不同书写脚本和风格进行OCR(光学字符识别)证明比数字化印刷品的过程要困难得多。尽管最近基于Transformer的模型已经实现了相对较强的表现力,但它们高度依赖于手动转录的训练数据,并且在面对不同的写作者时难以泛化。多模态语言模型如GPT-4v和Gemini已经在使用少量样本提示的情况下展示了执行OCR和计算机视觉任务的有效性。在这篇论文中,我评估了由Gemini生成的手稿文档转录准确度与当前最先进的基于Transformer的方法之间的对比情况。关键词:光学字符识别,多模态语言模型,文化保存,大规模数字化,手写识别
https://arxiv.org/abs/2410.24034
Cognitive decline is a natural part of aging, often resulting in reduced cognitive abilities. In some cases, however, this decline is more pronounced, typically due to disorders such as Alzheimer's disease. Early detection of anomalous cognitive decline is crucial, as it can facilitate timely professional intervention. While medical data can help in this detection, it often involves invasive procedures. An alternative approach is to employ non-intrusive techniques such as speech or handwriting analysis, which do not necessarily affect daily activities. This survey reviews the most relevant methodologies that use deep learning techniques to automate the cognitive decline estimation task, including audio, text, and visual processing. We discuss the key features and advantages of each modality and methodology, including state-of-the-art approaches like Transformer architecture and foundation models. In addition, we present works that integrate different modalities to develop multimodal models. We also highlight the most significant datasets and the quantitative results from studies using these resources. From this review, several conclusions emerge. In most cases, the textual modality achieves the best results and is the most relevant for detecting cognitive decline. Moreover, combining various approaches from individual modalities into a multimodal model consistently enhances performance across nearly all scenarios.
认知衰退是老化过程中的自然现象,通常会导致认知能力的下降。然而,在某些情况下,这种衰退更加明显,通常是由于阿尔茨海默病等疾病所致。早期检测异常的认知衰退至关重要,因为它可以促进及时的专业干预。虽然医学数据有助于此类检测,但往往涉及侵入性程序。一种替代方法是使用非侵扰性的技术,如语音或笔迹分析,这些技术通常不会影响日常活动。本综述回顾了利用深度学习技术自动化认知衰退估计任务的最相关方法,包括音频、文本和视觉处理。我们讨论了每种模态及其方法的关键特征与优势,包括最先进的方法如Transformer架构和基础模型。此外,我们还介绍了整合不同模态以开发多模态模型的工作成果。我们也强调了一些最重要的数据集以及利用这些资源的研究的定量结果。从这一综述中,我们可以得出几个结论。在大多数情况下,文本模态获得了最佳效果,并且对于检测认知衰退最为相关。此外,在几乎所有场景下,将来自单个模态的各种方法结合到一个多模态模型中可以持续提升性能。
https://arxiv.org/abs/2410.18972