Language students can increase their effectiveness in learning written Japanese by mastering the visual structure and written technique of Japanese kanji. Yet, existing kanji handwriting recognition systems do not assess the written technique sufficiently enough to discourage students from developing bad learning habits. In this paper, we describe our work on Hashigo, a kanji sketch interactive system which achieves human instructor-level critique and feedback on both the visual structure and written technique of students' sketched kanji. This type of automated critique and feedback allows students to target and correct specific deficiencies in their sketches that, if left untreated, are detrimental to effective long-term kanji learning.
语言学习者可以通过掌握日语汉字的视觉结构和书写技巧来提高他们学习书面日语的效果。然而,现有的日语手写识别系统未能充分评估书写的技巧,不足以阻止学生养成不良的学习习惯。在本文中,我们介绍了我们在Hashigo项目上的工作,这是一个日语汉字草图互动系统,能够对学生的草图进行视觉结构和书写技巧方面的评判与反馈,其水平达到了人类教师的标准。这种自动化的评判和反馈可以让学生针对并纠正他们草图中的具体缺陷,这些缺陷如果不加以改正,在长期有效学习汉字的过程中将产生不利影响。
https://arxiv.org/abs/2504.13940
Handwritten Text Recognition (HTR) is essential for document analysis and digitization. However, handwritten data often contains user-identifiable information, such as unique handwriting styles and personal lexicon choices, which can compromise privacy and erode trust in AI services. Legislation like the ``right to be forgotten'' underscores the necessity for methods that can expunge sensitive information from trained models. Machine unlearning addresses this by selectively removing specific data from models without necessitating complete retraining. Yet, it frequently encounters a privacy-accuracy tradeoff, where safeguarding privacy leads to diminished model performance. In this paper, we introduce a novel two-stage unlearning strategy for a multi-head transformer-based HTR model, integrating pruning and random labeling. Our proposed method utilizes a writer classification head both as an indicator and a trigger for unlearning, while maintaining the efficacy of the recognition head. To our knowledge, this represents the first comprehensive exploration of machine unlearning within HTR tasks. We further employ Membership Inference Attacks (MIA) to evaluate the effectiveness of unlearning user-identifiable information. Extensive experiments demonstrate that our approach effectively preserves privacy while maintaining model accuracy, paving the way for new research directions in the document analysis community. Our code will be publicly available upon acceptance.
手写文本识别(HTR)在文档分析和数字化中至关重要。然而,手写数据常常包含可辨识个人的信息,例如独特的书写风格和个人词汇选择,这可能危及隐私并削弱用户对AI服务的信任。类似于“被遗忘权”这样的立法强调了从训练模型中清除敏感信息的方法的必要性。机器卸载通过有选择地从模型中移除特定数据而不必重新进行全面训练来解决这个问题。然而,在保护隐私和保证准确性之间,经常存在一种折衷,即保护隐私会导致模型性能下降。 本文介绍了针对基于多头变换器的手写文本识别(HTR)模型的一种新的两阶段卸载策略,该策略结合了修剪和随机标签处理。我们提出的方法利用书写者分类头作为卸载的指标和触发器,同时保持识别头的有效性。据我们所知,这是首次全面探讨在HTR任务中机器卸载的研究。 此外,我们使用成员资格推理攻击(MIA)来评估删除可辨识用户信息的效果。广泛的实验表明,我们的方法能够有效地保护隐私的同时保持模型的准确性,为文档分析社区的新研究方向铺平了道路。我们的代码将在接受后公开提供。
https://arxiv.org/abs/2504.08616
Physical human-robot interaction (pHRI) remains a key challenge for achieving intuitive and safe interaction with robots. Current advancements often rely on external tactile sensors as interface, which increase the complexity of robotic systems. In this study, we leverage the intrinsic tactile sensing capabilities of collaborative robots to recognize digits drawn by humans on an uninstrumented touchpad mounted to the robot's flange. We propose a dataset of robot joint torque signals along with corresponding end-effector (EEF) forces and moments, captured from the robot's integrated torque sensors in each joint, as users draw handwritten digits (0-9) on the touchpad. The pHRI-DIGI-TACT dataset was collected from different users to capture natural variations in handwriting. To enhance classification robustness, we developed a data augmentation technique to account for reversed and rotated digits inputs. A Bidirectional Long Short-Term Memory (Bi-LSTM) network, leveraging the spatiotemporal nature of the data, performs online digit classification with an overall accuracy of 94\% across various test scenarios, including those involving users who did not participate in training the system. This methodology is implemented on a real robot in a fruit delivery task, demonstrating its potential to assist individuals in everyday life. Dataset and video demonstrations are available at: this https URL.
物理人机交互(pHRI)仍然是实现直观和安全的人机互动的关键挑战。当前的进展通常依赖于外部触觉传感器作为接口,这增加了机器人的复杂性。在这项研究中,我们利用协作机器人本身的内在触觉感应能力来识别人类在未配备任何仪器的安装于机械臂法兰上的触摸板上书写的数字(0-9)。我们提出了一组数据集,该数据集包括从每个关节集成扭矩传感器捕获的机器人的关节扭矩信号以及相应的末端执行器(EEF)力和力矩。该pHRI-DIGI-TACT数据集来自不同的用户,以捕捉手写风格中的自然变化。为了增强分类的鲁棒性,我们开发了一种数据增强技术来应对倒置和旋转数字输入的情况。 采用双向长短期记忆(Bi-LSTM)网络利用数据的空间时间和序列特性,在各种测试场景中实现了在线数字识别,整体准确率达到了94%,包括那些未参与训练系统用户的测试。该方法在水果配送任务上通过真实机器人进行了实施,展示了其在未来日常生活中帮助个人的潜力。 数据集和视频演示可以在以下网址找到:this https URL.
https://arxiv.org/abs/2504.00167
Transformers trained on tokenized text, audio, and images can generate high-quality autoregressive samples. But handwriting data, represented as sequences of pen coordinates, remains underexplored. We introduce a novel tokenization scheme that converts pen stroke offsets to polar coordinates, discretizes them into bins, and then turns them into sequences of tokens with which to train a standard GPT model. This allows us to capture complex stroke distributions without using any specialized architectures (eg. the mixture density network or the self-advancing ASCII attention head from Graves 2014). With just 3,500 handwritten words and a few simple data augmentations, we are able to train a model that can generate realistic cursive handwriting. Our approach is simpler and more performant than previous RNN-based methods.
基于标记化的文本、音频和图像训练的Transformers能够生成高质量的自回归样本。然而,手写数据以笔迹坐标序列的形式表示时,则较少被探索。我们引入了一种新颖的标记化方案,该方案将笔划偏移转换为极坐标,并将其离散化为若干区间,然后将这些区间转化为用于训练标准GPT模型的令牌序列。这使我们在不使用任何专门架构(如Graves 2014年提出的混合密度网络或自我推进ASCII注意头)的情况下捕捉复杂的笔迹分布成为可能。仅通过3,500个手写单词以及一些简单的数据增强,我们就能训练出能够生成逼真的连贯书写的手写模型。我们的方法比以前基于RNN的方法更简单且性能更好。
https://arxiv.org/abs/2504.00051
Styled Handwritten Text Generation (HTG) has recently received attention from the computer vision and document analysis communities, which have developed several solutions, either GAN- or diffusion-based, that achieved promising results. Nonetheless, these strategies fail to generalize to novel styles and have technical constraints, particularly in terms of maximum output length and training efficiency. To overcome these limitations, in this work, we propose a novel framework for text image generation, dubbed Emuru. Our approach leverages a powerful text image representation model (a variational autoencoder) combined with an autoregressive Transformer. Our approach enables the generation of styled text images conditioned on textual content and style examples, such as specific fonts or handwriting styles. We train our model solely on a diverse, synthetic dataset of English text rendered in over 100,000 typewritten and calligraphy fonts, which gives it the capability to reproduce unseen styles (both fonts and users' handwriting) in zero-shot. To the best of our knowledge, Emuru is the first autoregressive model for HTG, and the first designed specifically for generalization to novel styles. Moreover, our model generates images without background artifacts, which are easier to use for downstream applications. Extensive evaluation on both typewritten and handwritten, any-length text image generation scenarios demonstrates the effectiveness of our approach.
近期,带有风格的手写文本生成(HTG)技术受到了计算机视觉和文档分析社区的广泛关注。这些社区已经开发出几种解决方案,无论是基于GAN还是扩散模型的方法,都取得了相当不错的成果。然而,这些策略在面对新型样式时无法实现泛化,并且存在诸如最大输出长度和技术限制等具体问题。 为了解决这些问题,在这项工作中我们提出了一种新的文本图像生成框架,命名为Emuru。我们的方法结合了一个强大的文本图像表示模型(一种变分自编码器)与一个自回归的Transformer。这种方法能够基于文字内容和风格示例来生成带有特定字体或手写样式的文本图像。我们仅使用包含超过10万种打字机和书法字体的合成多样化英语文本数据集训练我们的模型,这使得它能够在零样本情况下再现未见过的样式(包括字体和用户的手写)。据我们所知,Emuru是第一个用于HTG的自回归模型,并且也是首个专门设计以泛化到新风格的模型。此外,我们的模型生成的图像不包含背景伪影,更容易应用于下游任务。 在打字机文本及手写文本(任意长度)的图像生成场景中进行的广泛评估显示了我们方法的有效性。
https://arxiv.org/abs/2503.17074
Traditional machine learning models for Handwritten Text Recognition (HTR) rely on supervised training, requiring extensive manual annotations, and often produce errors due to the separation between layout and text processing. In contrast, Multimodal Large Language Models (MLLMs) offer a general approach to recognizing diverse handwriting styles without the need for model-specific training. The study benchmarks various proprietary and open-source LLMs against Transkribus models, evaluating their performance on both modern and historical datasets written in English, French, German, and Italian. In addition, emphasis is placed on testing the models' ability to autonomously correct previously generated outputs. Findings indicate that proprietary models, especially Claude 3.5 Sonnet, outperform open-source alternatives in zero-shot settings. MLLMs achieve excellent results in recognizing modern handwriting and exhibit a preference for the English language due to their pre-training dataset composition. Comparisons with Transkribus show no consistent advantage for either approach. Moreover, LLMs demonstrate limited ability to autonomously correct errors in zero-shot transcriptions.
传统的手写文本识别(HTR)模型依赖于监督训练,需要大量的手动注释,并且由于布局处理和文字处理的分离常常会产生错误。相比之下,多模态大型语言模型(MLLMs)提供了一种通用的方法来识别多样化的书写风格,而无需特定于模型的训练。这项研究对各种专有和开源的大规模语言模型与Transkribus模型进行了基准测试,并评估了它们在现代和历史数据集上的表现,这些数据集包含英文、法文、德文和意大利文的手写文本。此外,该研究还强调了测试模型自主纠正先前生成的输出的能力。 研究结果表明,在零样本设置中,专有模型(特别是Claude 3.5 Sonnet)优于开源替代品。MLLMs在识别现代手写方面表现出色,并且由于其预训练数据集的构成而对英语表现出偏好。与Transkribus相比,没有一种方法显示出持续的优势。此外,在零样本转录中,LLMs自主纠正错误的能力有限。 简而言之,这项研究展示了MLLMs在处理多样化书写风格和现代手写方面的优越性能,并揭示了这些模型在不同语言背景下存在的局限性以及它们纠正自身错误能力的不足之处。
https://arxiv.org/abs/2503.15195
Handwritten Arabic script recognition is a challenging task due to the script's dynamic letter forms and contextual variations. This paper proposes a hybrid approach combining convolutional neural networks (CNNs) and Transformer-based architectures to address these complexities. We evaluated custom and fine-tuned models, including EfficientNet-B7 and Vision Transformer (ViT-B16), and introduced an ensemble model that leverages confidence-based fusion to integrate their strengths. Our ensemble achieves remarkable performance on the IFN/ENIT dataset, with 96.38% accuracy for letter classification and 97.22% for positional classification. The results highlight the complementary nature of CNNs and Transformers, demonstrating their combined potential for robust Arabic handwriting recognition. This work advances OCR systems, offering a scalable solution for real-world applications.
手写阿拉伯文字识别是一项具有挑战性的任务,因为阿拉伯文的字母形式和上下文变化非常动态。本文提出了一种结合卷积神经网络(CNNs)和基于Transformer架构的混合方法来解决这些复杂性问题。我们评估了自定义及微调过的模型,包括EfficientNet-B7和Vision Transformer (ViT-B16),并引入了一个集成模型,该模型利用基于置信度融合的方法整合它们的优势。我们的集成模型在IFN/ENIT数据集上实现了卓越的性能,字母分类准确率达到了96.38%,位置分类准确率达到97.22%。实验结果突显了CNNs和Transformers之间的互补性,并展示了其结合使用对手写阿拉伯文识别的强大潜力。这项工作促进了OCR系统的进步,为现实世界应用提供了可扩展的解决方案。
https://arxiv.org/abs/2503.15023
Handwriting verification has stood as a steadfast identity authentication method for decades. However, this technique risks potential privacy breaches due to the inclusion of personal information in handwritten biometrics such as signatures. To address this concern, we propose using the Random Digit String (RDS) for privacy-preserving handwriting verification. This approach allows users to authenticate themselves by writing an arbitrary digit sequence, effectively ensuring privacy protection. To evaluate the effectiveness of RDS, we construct a new HRDS4BV dataset composed of online naturally handwritten RDS. Unlike conventional handwriting, RDS encompasses unconstrained and variable content, posing significant challenges for modeling consistent personal writing style. To surmount this, we propose the Pattern Attentive VErification Network (PAVENet), along with a Discriminative Pattern Mining (DPM) module. DPM adaptively enhances the recognition of consistent and discriminative writing patterns, thus refining handwriting style representation. Through comprehensive evaluations, we scrutinize the applicability of online RDS verification and showcase a pronounced outperformance of our model over existing methods. Furthermore, we discover a noteworthy forgery phenomenon that deviates from prior findings and discuss its positive impact in countering malicious impostor attacks. Substantially, our work underscores the feasibility of privacy-preserving biometric verification and propels the prospects of its broader acceptance and application.
几十年来,手写验证一直是一种可靠的身份认证方法。然而,由于在诸如签名等手写生物特征中包含个人信息,这种方法存在潜在的隐私泄露风险。为了解决这一问题,我们提出使用随机数字字符串(RDS)来进行保护隐私的手写验证。通过这种方式,用户可以凭书写任意一串数字来完成身份验证,从而有效地保障了个人隐私安全。为了评估RDS的有效性,我们构建了一个新的HRDS4BV数据集,该数据集中包含了在线自然手写的随机数字字符串。与传统手写不同的是,RDS包含不受约束且可变的内容,这对手写风格的一致性建模提出了重大挑战。 为克服这些困难,我们提出了一种模式注意验证网络(PAVENet),并引入了一个区别式模式挖掘模块(DPM)。该模块能够自适应地增强对一致性和区分性强的书写模式的识别能力,从而优化手写样式的表示。通过全面评估,在线RDS验证的应用性得到了检验,并且我们的模型在现有方法中展现了显著优势。 此外,我们还发现了一种值得注意的伪造现象,这一现象偏离了以往的研究结果,并讨论了它对恶意冒充攻击的正面影响。总体而言,这项工作强调了隐私保护生物特征认证的可能性,并推动了其更广泛接受和应用的发展前景。
https://arxiv.org/abs/2503.12786
Handwritten digit recognition remains a fundamental challenge in computer vision, with applications ranging from postal code reading to document digitization. This paper presents an ensemble-based approach that combines Convolutional Neural Networks (CNNs) with traditional machine learning techniques to improve recognition accuracy and robustness. We evaluate our method on the MNIST dataset, comprising 70,000 handwritten digit images. Our hybrid model, which uses CNNs for feature extraction and Support Vector Machines (SVMs) for classification, achieves an accuracy of 99.30%. We also explore the effectiveness of data augmentation and various ensemble techniques in enhancing model performance. Our results demonstrate that this approach not only achieves high accuracy but also shows improved generalization across diverse handwriting styles. The findings contribute to the development of more reliable handwritten digit recognition systems and highlight the potential of combining deep learning with traditional machine learning methods in pattern recognition tasks.
手写数字识别是计算机视觉中的一个基本挑战,其应用范围从邮政编码读取到文档数字化。本文提出了一种基于集成的方法,该方法结合了卷积神经网络(CNN)和传统机器学习技术,以提高识别的准确性和鲁棒性。我们在MNIST数据集上评估了我们的方法,该数据集包含70,000张手写数字图像。我们混合模型使用CNN进行特征提取,并用支持向量机(SVM)进行分类,在测试中达到了99.30%的准确率。此外,我们还探讨了数据增强和各种集成技术在提高模型性能方面的有效性。我们的结果表明,该方法不仅实现了高精度,而且对不同书写风格也表现出更好的泛化能力。这些发现为开发更可靠的手写数字识别系统做出了贡献,并突显了结合深度学习与传统机器学习方法在模式识别任务中的潜力。
https://arxiv.org/abs/2503.06104
Handwritten text recognition (HTR) remains a challenging task, particularly for multi-page documents where pages share common formatting and contextual features. While modern optical character recognition (OCR) engines are proficient with printed text, their performance on handwriting is limited, often requiring costly labeled data for fine-tuning. In this paper, we explore the use of multi-modal large language models (MLLMs) for transcribing multi-page handwritten documents in a zero-shot setting. We investigate various configurations of commercial OCR engines and MLLMs, utilizing the latter both as end-to-end transcribers and as post-processors, with and without image components. We propose a novel method, '+first page', which enhances MLLM transcription by providing the OCR output of the entire document along with just the first page image. This approach leverages shared document features without incurring the high cost of processing all images. Experiments on a multi-page version of the IAM Handwriting Database demonstrate that '+first page' improves transcription accuracy, balances cost with performance, and even enhances results on out-of-sample text by extrapolating formatting and OCR error patterns from a single page.
手写文本识别(HTR)仍然是一个具有挑战性的任务,特别是在多页文档中,这些文档共享相同的格式和上下文特征。尽管现代光学字符识别(OCR)引擎在处理印刷文本方面表现出色,但它们对手写文本的性能有限,并且通常需要昂贵的标注数据来进行微调。在这篇论文中,我们探讨了使用多模态大型语言模型(MLLMs)以零样本设置转录多页手写文档的方法。我们研究了各种商业OCR引擎和MLLM配置,利用后者作为端到端转录器以及有或没有图像组件的后处理工具。 我们提出了一种新颖的方法,“+首页面”,通过提供整个文档的OCR输出加上仅第一张图片来增强MLLM转录效果。这种方法利用了共享的文档特征,同时避免了处理所有图像所需的高昂成本。在IAM手写数据库的多页版本上的实验表明,“+首页面”方法提高了转录准确性,并且能够在不增加成本的情况下平衡性能;此外,在样本外文本上,该方法还能通过从单张图片中推断格式和OCR错误模式来提高结果质量。
https://arxiv.org/abs/2502.20295
Computer vision is a critical component in a wide range of real-world applications, including plant monitoring in agriculture and handwriting classification in digital systems. However, developing high-performance computer vision models traditionally demands both machine learning (ML) expertise and domain-specific knowledge, making the process costly, labor-intensive, and inaccessible to many. Large language model (LLM) agents have emerged as a promising solution to automate this workflow, but most existing methods share a common limitation: they attempt to optimize entire pipelines in a single step before evaluation, making it difficult to attribute improvements to specific changes. This lack of granularity leads to unstable optimization and slower convergence, limiting their effectiveness. To address this, we introduce Iterative Refinement, a novel strategy for LLM-driven ML pipeline design inspired by how human ML experts iteratively refine models, focusing on one component at a time rather than making sweeping changes all at once. By systematically updating individual components based on real training feedback, Iterative Refinement improves stability, interpretability, and overall model performance. We implement this strategy in IMPROVE, an end-to-end LLM agent framework for automating and optimizing object classification pipelines. Through extensive evaluations across datasets of varying sizes and domains, including standard benchmarks and Kaggle competition datasets, we demonstrate that Iterative Refinement enables IMPROVE to consistently achieve better performance over existing zero-shot LLM-based approaches. These findings establish Iterative Refinement as an effective new strategy for LLM-driven ML automation and position IMPROVE as an accessible solution for building high-quality computer vision models without requiring ML expertise.
计算机视觉是农业植物监测和数字系统手写识别等众多现实世界应用中的关键组成部分。然而,开发高性能的计算机视觉模型通常需要机器学习(ML)专业知识和特定领域的知识,这使得这一过程成本高昂、劳动密集且对许多人来说难以企及。大型语言模型(LLM)代理作为自动化此工作流程的一种有前途的方法应运而生,但大多数现有的方法都存在一个共同的局限性:它们试图在评估前一次性优化整个管道,因此很难将改进归因于特定的变化。这种缺乏细节导致了不稳定的优化和较慢的收敛速度,从而限制了其有效性。 为了解决这个问题,我们引入了一种名为迭代细化的新策略,这是一种由LLM驱动的机器学习管道设计方法,灵感来自人类ML专家如何逐步完善模型——专注于一次改进一个组件而不是一次性做出重大改变。通过基于实际训练反馈系统地更新单个组件,迭代细化增强了稳定性、可解释性和整体模型性能。 我们在IMPROVE框架中实现了这一策略,这是一个端到端的LLM代理框架,用于自动化和优化对象分类管道。通过在不同大小和领域数据集上的广泛评估,包括标准基准测试和Kaggle竞赛数据集,我们证明了迭代细化使IMPROVE能够持续优于现有的零样本LLM方法。 这些发现确立了迭代细化作为LLM驱动的ML自动化的一种有效新策略,并将IMPROVE定位为无需机器学习专业知识即可构建高质量计算机视觉模型的一个可行解决方案。
https://arxiv.org/abs/2502.18530
Handwritten Text Recognition (HTR) has become an essential field within pattern recognition and machine learning, with applications spanning historical document preservation to modern data entry and accessibility solutions. The complexity of HTR lies in the high variability of handwriting, which makes it challenging to develop robust recognition systems. This survey examines the evolution of HTR models, tracing their progression from early heuristic-based approaches to contemporary state-of-the-art neural models, which leverage deep learning techniques. The scope of the field has also expanded, with models initially capable of recognizing only word-level content progressing to recent end-to-end document-level approaches. Our paper categorizes existing work into two primary levels of recognition: (1) \emph{up to line-level}, encompassing word and line recognition, and (2) \emph{beyond line-level}, addressing paragraph- and document-level challenges. We provide a unified framework that examines research methodologies, recent advances in benchmarking, key datasets in the field, and a discussion of the results reported in the literature. Finally, we identify pressing research challenges and outline promising future directions, aiming to equip researchers and practitioners with a roadmap for advancing the field.
手写文本识别(HTR)已成为模式识别和机器学习领域的一个重要分支,其应用范围从历史文档的保存到现代数据录入及无障碍解决方案。HTR 的复杂性在于手写风格的高度变异性,这使得开发稳健的识别系统具有挑战性。本文综述了 HTR 模型的发展历程,追溯其从早期基于启发式的方法演进至当前最先进的神经网络模型的过程,后者利用深度学习技术来提升性能。该领域的研究范围也已扩大,初期只能识别单词级内容的模型逐渐发展为现今涵盖整个文档级别的端到端方法。我们在论文中将现有的工作分类为两个主要的识别层次:(1)**线级别及以下**,包括词和行的识别;以及 (2)**超出行级别**,解决段落和整篇文档层面的问题。我们提供了一个统一的研究框架,涵盖了研究方法、近期基准测试的进步、领域中的关键数据集,以及对文献中报告结果的讨论。最后,我们指出了亟待解决的研究挑战,并概述了未来有前景的发展方向,旨在为研究人员及从业者提供一份推进该领域的路线图。
https://arxiv.org/abs/2502.08417
In this study, we explored the use of spectrograms to represent handwriting signals for assessing neurodegenerative diseases, including 42 healthy controls (CTL), 35 subjects with Parkinson's Disease (PD), 21 with Alzheimer's Disease (AD), and 15 with Parkinson's Disease Mimics (PDM). We applied CNN and CNN-BLSTM models for binary classification using both multi-channel fixed-size and frame-based spectrograms. Our results showed that handwriting tasks and spectrogram channel combinations significantly impacted classification performance. The highest F1-score (89.8%) was achieved for AD vs. CTL, while PD vs. CTL reached 74.5%, and PD vs. PDM scored 77.97%. CNN consistently outperformed CNN-BLSTM. Different sliding window lengths were tested for constructing frame-based spectrograms. A 1-second window worked best for AD, longer windows improved PD classification, and window length had little effect on PD vs. PDM.
在这项研究中,我们探索了使用频谱图来表示手写信号以评估神经退行性疾病的方法。参与本研究的有42名健康对照组(CTL),35名帕金森病患者(PD),21名阿尔茨海默病患者(AD)和15名帕金森病模拟病例(PDM)。我们使用了卷积神经网络(CNN)和结合了长短时记忆层的卷积神经网络(CNN-BLSTM)模型,采用多通道固定大小频谱图及基于帧的频谱图进行二元分类。我们的结果表明,手写任务以及频谱图信道组合对分类性能有显著影响。在AD vs. CTL 的分类中取得了最高的F1分数(89.8%),PD vs. CTL 分类得分为74.5%,而 PD vs. PDM 的分类得分则为77.97%。CNN模型始终优于CNN-BLSTM模型。 我们测试了不同的滑动窗口长度,以构建基于帧的频谱图。对于AD病例,1秒的窗口表现最佳;而对于PD病例,较长的窗口可以改善其分类效果;至于 PD vs. PDM 的分类任务,则窗口长度对结果影响不大。
https://arxiv.org/abs/2502.07025
Converting images of Arabic text into plain text is a widely researched topic in academia and industry. However, recognition of Arabic handwritten and printed text presents difficult challenges due to the complex nature of variations of the Arabic script. This work proposes an end-to-end solution for recognizing Arabic handwritten, printed, and Arabic numbers and presents the data in a structured manner. We reached 81.66% precision, 78.82% Recall, and 79.07% F-measure on a Text Detection task that powers the proposed solution. The proposed recognition model incorporates state-of-the-art CNN-based feature extraction, and Transformer-based sequence modeling to accommodate variations in handwriting styles, stroke thicknesses, alignments, and noise conditions. The evaluation of the model suggests its strong performances on both printed and handwritten texts, yielding 0.59% CER and & 1.72% WER on printed text, and 7.91% CER and 31.41% WER on handwritten text. The overall proposed solution has proven to be relied on in real-life OCR tasks. Equipped with both detection and recognition models as well as other Feature Extraction and Matching helping algorithms. With the general purpose implementation, making the solution valid for any given document or receipt that is Arabic handwritten or printed. Thus, it is practical and useful for any given context.
将阿拉伯文图像转换为纯文本是学术界和工业界广泛研究的主题。然而,由于阿拉伯文字体的复杂性及其变体,识别手写和印刷的阿拉伯文文本面临着巨大的挑战。本文提出了一种端到端解决方案,用于识别阿拉伯手写、印刷及阿拉伯数字,并以结构化的方式呈现数据。在支撑该方案的文字检测任务上,我们达到了81.66%的精度(Precision)、78.82%的召回率(Recall)和79.07%的F值(F-measure)。提出的识别模型结合了最先进的基于CNN的特征提取技术和基于Transformer的序列建模技术,以适应书写风格、笔画粗细、对齐方式以及噪音条件的变化。模型评估显示其在印刷文本上表现良好,CER(字符错误率)为0.59%,WER(词错误率)为1.72%;在手写文本上,CER达到7.91%,WER则为31.41%。整体解决方案已被证明在实际OCR任务中可靠有效,并配备了检测和识别模型以及其他特征提取与匹配算法。通过通用实施方式,使其适用于任何给定的阿拉伯文手写或印刷文档。因此,在任何情境下都具有实用性和价值。
https://arxiv.org/abs/2502.05277
Large Language Models (LLMs) have been extensively applied in time series analysis. Yet, their utility in the few-shot classification (i.e., a crucial training scenario due to the limited training data available in industrial applications) concerning multivariate time series data remains underexplored. We aim to leverage the extensive pre-trained knowledge in LLMs to overcome the data scarcity problem within multivariate time series. Specifically, we propose LLMFew, an LLM-enhanced framework to investigate the feasibility and capacity of LLMs for few-shot multivariate time series classification. This model introduces a Patch-wise Temporal Convolution Encoder (PTCEnc) to align time series data with the textual embedding input of LLMs. We further fine-tune the pre-trained LLM decoder with Low-rank Adaptations (LoRA) to enhance its feature representation learning ability in time series data. Experimental results show that our model outperformed state-of-the-art baselines by a large margin, achieving 125.2% and 50.2% improvement in classification accuracy on Handwriting and EthanolConcentration datasets, respectively. Moreover, our experimental results demonstrate that LLM-based methods perform well across a variety of datasets in few-shot MTSC, delivering reliable results compared to traditional models. This success paves the way for their deployment in industrial environments where data are limited.
大型语言模型(LLMs)已被广泛应用于时间序列分析中。然而,它们在少量样本分类中的效用(即由于工业应用中可用的训练数据有限,这是一个至关重要的训练场景),尤其是在处理多变量时间序列数据时,仍然有待探索。我们旨在利用大规模预训练的语言模型的知识来解决多变量时间序列数据中存在的数据稀缺问题。为此,我们提出了LLMFew框架,这是基于大型语言模型增强的一种方法,用于研究大型语言模型在少量样本的多变量时间序列分类中的可行性和能力。 该模型引入了“Patch-wise Temporal Convolution Encoder (PTCEnc)”来对齐时间序列数据与大型语言模型输入文本嵌入。此外,我们还通过低秩适应(LoRA)微调预训练的语言模型解码器,以增强其在处理时间序列数据时的特征表示学习能力。 实验结果表明,我们的模型相比于最先进的基准方法,在分类准确度上分别提高了125.2%和50.2%,具体体现在Handwriting和EthanolConcentration数据集上的表现。此外,我们的实验证明了基于大型语言模型的方法在少量样本多变量时间序列分类的各种数据集中都表现出色,并且与传统模型相比提供了可靠的结果。 这一成功为这些方法在工业环境中部署铺平了道路,在这些环境中由于数据有限而难以采用传统的机器学习技术。
https://arxiv.org/abs/2502.00059
Brain-computer interfaces (BCIs) present a promising avenue by translating neural activity directly into text, eliminating the need for physical actions. However, existing non-invasive BCI systems have not successfully covered the entire alphabet, limiting their practicality. In this paper, we propose a novel non-invasive EEG-based BCI system with Curriculum-based Neural Spelling Framework, which recognizes all 26 alphabet letters by decoding neural signals associated with handwriting first, and then apply a Generative AI (GenAI) to enhance spell-based neural language decoding tasks. Our approach combines the ease of handwriting with the accessibility of EEG technology, utilizing advanced neural decoding algorithms and pre-trained large language models (LLMs) to translate EEG patterns into text with high accuracy. This system show how GenAI can improve the performance of typical spelling-based neural language decoding task, and addresses the limitations of previous methods, offering a scalable and user-friendly solution for individuals with communication impairments, thereby enhancing inclusive communication options.
脑机接口(BCI)通过直接将神经活动转换为文本,无需物理动作,提供了一种有前景的方法。然而,现有的非侵入性 BCI 系统尚未成功覆盖整个字母表,限制了其实用性。在本文中,我们提出了一种基于课程学习的新型非侵入式 EEG(脑电图)BCI 系统,该系统首先通过解码与手写相关的神经信号来识别所有 26 个英文字母,然后应用生成人工智能(GenAI)来增强基于拼写的神经语言解码任务。我们的方法结合了书写简便性和 EEG 技术的易用性,并利用先进的神经解码算法和预训练的大规模语言模型(LLM),将 EEG 模式准确地转换为文本。该系统展示了 GenAI 如何提升典型的基于拼写式的神经语言解码任务的表现,解决了先前方法的局限性,为有沟通障碍的人提供了一种可扩展且用户友好的解决方案,从而增强了包容性的沟通选项。
https://arxiv.org/abs/2501.17489
Dyslexia affects reading and writing skills across many languages. This work describes a new application of YOLO-based object detection to isolate and label handwriting patterns (Normal, Reversal, Corrected) within synthetic images that resemble real words. Individual letters are first collected, preprocessed into 32x32 samples, then assembled into larger synthetic 'words' to simulate realistic handwriting. Our YOLOv11 framework simultaneously localizes each letter and classifies it into one of three categories, reflecting key dyslexia traits. Empirically, we achieve near-perfect performance, with precision, recall, and F1 metrics typically exceeding 0.999. This surpasses earlier single-letter approaches that rely on conventional CNNs or transfer-learning classifiers (for example, MobileNet-based methods in Robaa et al. arXiv:2410.19821). Unlike simpler pipelines that consider each letter in isolation, our solution processes complete word images, resulting in more authentic representations of handwriting. Although relying on synthetic data raises concerns about domain gaps, these experiments highlight the promise of YOLO-based detection for faster and more interpretable dyslexia screening. Future work will expand to real-world handwriting, other languages, and deeper explainability methods to build confidence among educators, clinicians, and families.
阅读障碍会影响多语言的读写技能。这项工作描述了一种基于YOLO(You Only Look Once)目标检测的新应用,该应用旨在从类似于真实单词的合成图像中分离和标记书写模式(正常、反转、修正)。首先收集单个字母,预处理为32x32样本,然后组装成更大的合成“单词”,以模拟真实的书写方式。我们的YOLOv11框架同时定位每个字母并将其分类到三个类别之一,反映关键的阅读障碍特征。从经验上看,我们达到了接近完美的性能,精度、召回率和F1指标通常超过0.999。这超过了依赖传统CNN(卷积神经网络)或迁移学习分类器(例如Robaa等人提出的基于MobileNet的方法 arXiv:2410.19821)的早期单字母方法。与只考虑每个字母的简单流程不同,我们的解决方案处理完整的单词图像,从而生成更真实的书写表示形式。尽管依赖于合成数据会引发领域差距的问题,但这些实验突显了基于YOLO检测在阅读障碍筛查中实现更快和更具解释性的潜力。未来的工作将扩展到现实世界中的手写、其他语言以及更深的可解释性方法,以增强教育者、临床医生和家庭的信心。
https://arxiv.org/abs/2501.15263
Despite recent significant advancements in Handwritten Document Recognition (HDR), the efficient and accurate recognition of text against complex backgrounds, diverse handwriting styles, and varying document layouts remains a practical challenge. Moreover, this issue is seldom addressed in academic research, particularly in scenarios with minimal annotated data available. In this paper, we introduce the DocTTT framework to address these challenges. The key innovation of our approach is that it uses test-time training to adapt the model to each specific input during testing. We propose a novel Meta-Auxiliary learning approach that combines Meta-learning and self-supervised Masked Autoencoder~(MAE). During testing, we adapt the visual representation parameters using a self-supervised MAE loss. During training, we learn the model parameters using a meta-learning framework, so that the model parameters are learned to adapt to a new input effectively. Experimental results show that our proposed method significantly outperforms existing state-of-the-art approaches on benchmark datasets.
尽管手写文档识别(HDR)领域近年来取得了显著进展,但在复杂背景、多样的书写风格和不同的文档布局下高效准确地识别文本仍然是一项实际挑战。此外,在学术研究中,尤其是在缺乏充足标注数据的情况下解决这些问题的尝试很少。在本文中,我们介绍了DocTTT框架来应对这些挑战。我们的方法的关键创新在于它使用测试时训练(test-time training)来自适应调整模型以针对每个特定输入进行优化。 具体而言,我们提出了一种新颖的元辅助学习方法,该方法结合了元学习和自监督掩码自动编码器(MAE)。在测试阶段,通过自监督的MAE损失函数来调整视觉表示参数。而在训练过程中,则利用一个元学习框架来学习模型参数,使模型能够在面对新输入时有效地进行适应。 实验结果表明,在基准数据集上,我们提出的方法显著优于现有的最先进的方法。
https://arxiv.org/abs/2501.12898
In the realm of digital forensics and document authentication, writer identification plays a crucial role in determining the authors of documents based on handwriting styles. The primary challenge in writer-id is the "open-set scenario", where the goal is accurately recognizing writers unseen during the model training. To overcome this challenge, representation learning is the key. This method can capture unique handwriting features, enabling it to recognize styles not previously encountered during training. Building on this concept, this paper introduces the Contrastive Masked Auto-Encoders (CMAE) for Character-level Open-Set Writer Identification. We merge Masked Auto-Encoders (MAE) with Contrastive Learning (CL) to simultaneously and respectively capture sequential information and distinguish diverse handwriting styles. Demonstrating its effectiveness, our model achieves state-of-the-art (SOTA) results on the CASIA online handwriting dataset, reaching an impressive precision rate of 89.7%. Our study advances universal writer-id with a sophisticated representation learning approach, contributing substantially to the ever-evolving landscape of digital handwriting analysis, and catering to the demands of an increasingly interconnected world.
在数字取证和文档认证领域,作者识别通过分析书写风格来确定文档的作者身份,扮演着至关重要的角色。作者识别(writer-id)的主要挑战在于“开放集场景”,即目标是准确地识别出那些未在模型训练期间见过的作者。为应对这一挑战,表示学习方法至关重要,该方法能够捕捉到独特的手写特征,从而能够在未曾遇到过的书写风格中进行识别。 在此基础上,本文介绍了字符级开放集作者识别中的对比掩码自动编码器(Contrastive Masked Auto-Encoders, CMAE)。我们结合了掩码自动编码器(Masked Auto-Encoders, MAE)与对比学习(Contrastive Learning, CL),以同时且分别地捕捉序列信息和区分多样化的书写风格。通过在CASIA在线手写数据集上的实验,我们的模型取得了最先进的精度率89.7%的成绩,证明了其有效性。 本研究通过一种复杂的表示学习方法推进了通用作者识别技术的发展,并为不断演变的数字笔迹分析领域做出了重要贡献,同时也满足了一个日益互联的世界的需求。
https://arxiv.org/abs/2501.11895
This paper introduces a cost-effective robotic handwriting system designed to replicate human-like handwriting with high precision. Combining a Raspberry Pi Pico microcontroller, 3D-printed components, and a machine learning-based handwriting generation model implemented via this http URL, the system converts user-supplied text into realistic stroke trajectories. By leveraging lightweight 3D-printed materials and efficient mechanical designs, the system achieves a total hardware cost of approximately \$56, significantly undercutting commercial alternatives. Experimental evaluations demonstrate handwriting precision within $\pm$0.3 millimeters and a writing speed of approximately 200 mm/min, positioning the system as a viable solution for educational, research, and assistive applications. This study seeks to lower the barriers to personalized handwriting technologies, making them accessible to a broader audience.
这篇论文介绍了一种成本效益高的机器人书写系统,旨在以高精度复制类似人类的笔迹。该系统结合了Raspberry Pi Pico微控制器、3D打印部件以及通过此链接(请将“this http URL”替换为实际链接)实现的基于机器学习的手写生成模型,能够将用户提供的文本转换成逼真的笔画轨迹。利用轻量级3D打印材料和高效的机械设计,该系统实现了约56美元的总硬件成本,显著低于商用替代品的价格。实验评估表明,系统的书写精度在±0.3毫米范围内,并且书写速度约为每分钟200毫米,使其成为教育、研究及辅助应用的一个可行解决方案。本研究旨在降低个性化手写技术的门槛,让更广泛的受众能够使用这些技术。
https://arxiv.org/abs/2501.06783