Handwriting verification has stood as a steadfast identity authentication method for decades. However, this technique risks potential privacy breaches due to the inclusion of personal information in handwritten biometrics such as signatures. To address this concern, we propose using the Random Digit String (RDS) for privacy-preserving handwriting verification. This approach allows users to authenticate themselves by writing an arbitrary digit sequence, effectively ensuring privacy protection. To evaluate the effectiveness of RDS, we construct a new HRDS4BV dataset composed of online naturally handwritten RDS. Unlike conventional handwriting, RDS encompasses unconstrained and variable content, posing significant challenges for modeling consistent personal writing style. To surmount this, we propose the Pattern Attentive VErification Network (PAVENet), along with a Discriminative Pattern Mining (DPM) module. DPM adaptively enhances the recognition of consistent and discriminative writing patterns, thus refining handwriting style representation. Through comprehensive evaluations, we scrutinize the applicability of online RDS verification and showcase a pronounced outperformance of our model over existing methods. Furthermore, we discover a noteworthy forgery phenomenon that deviates from prior findings and discuss its positive impact in countering malicious impostor attacks. Substantially, our work underscores the feasibility of privacy-preserving biometric verification and propels the prospects of its broader acceptance and application.
几十年来,手写验证一直是一种可靠的身份认证方法。然而,由于在诸如签名等手写生物特征中包含个人信息,这种方法存在潜在的隐私泄露风险。为了解决这一问题,我们提出使用随机数字字符串(RDS)来进行保护隐私的手写验证。通过这种方式,用户可以凭书写任意一串数字来完成身份验证,从而有效地保障了个人隐私安全。为了评估RDS的有效性,我们构建了一个新的HRDS4BV数据集,该数据集中包含了在线自然手写的随机数字字符串。与传统手写不同的是,RDS包含不受约束且可变的内容,这对手写风格的一致性建模提出了重大挑战。 为克服这些困难,我们提出了一种模式注意验证网络(PAVENet),并引入了一个区别式模式挖掘模块(DPM)。该模块能够自适应地增强对一致性和区分性强的书写模式的识别能力,从而优化手写样式的表示。通过全面评估,在线RDS验证的应用性得到了检验,并且我们的模型在现有方法中展现了显著优势。 此外,我们还发现了一种值得注意的伪造现象,这一现象偏离了以往的研究结果,并讨论了它对恶意冒充攻击的正面影响。总体而言,这项工作强调了隐私保护生物特征认证的可能性,并推动了其更广泛接受和应用的发展前景。
https://arxiv.org/abs/2503.12786
Handwritten digit recognition remains a fundamental challenge in computer vision, with applications ranging from postal code reading to document digitization. This paper presents an ensemble-based approach that combines Convolutional Neural Networks (CNNs) with traditional machine learning techniques to improve recognition accuracy and robustness. We evaluate our method on the MNIST dataset, comprising 70,000 handwritten digit images. Our hybrid model, which uses CNNs for feature extraction and Support Vector Machines (SVMs) for classification, achieves an accuracy of 99.30%. We also explore the effectiveness of data augmentation and various ensemble techniques in enhancing model performance. Our results demonstrate that this approach not only achieves high accuracy but also shows improved generalization across diverse handwriting styles. The findings contribute to the development of more reliable handwritten digit recognition systems and highlight the potential of combining deep learning with traditional machine learning methods in pattern recognition tasks.
手写数字识别是计算机视觉中的一个基本挑战,其应用范围从邮政编码读取到文档数字化。本文提出了一种基于集成的方法,该方法结合了卷积神经网络(CNN)和传统机器学习技术,以提高识别的准确性和鲁棒性。我们在MNIST数据集上评估了我们的方法,该数据集包含70,000张手写数字图像。我们混合模型使用CNN进行特征提取,并用支持向量机(SVM)进行分类,在测试中达到了99.30%的准确率。此外,我们还探讨了数据增强和各种集成技术在提高模型性能方面的有效性。我们的结果表明,该方法不仅实现了高精度,而且对不同书写风格也表现出更好的泛化能力。这些发现为开发更可靠的手写数字识别系统做出了贡献,并突显了结合深度学习与传统机器学习方法在模式识别任务中的潜力。
https://arxiv.org/abs/2503.06104
Handwritten text recognition (HTR) remains a challenging task, particularly for multi-page documents where pages share common formatting and contextual features. While modern optical character recognition (OCR) engines are proficient with printed text, their performance on handwriting is limited, often requiring costly labeled data for fine-tuning. In this paper, we explore the use of multi-modal large language models (MLLMs) for transcribing multi-page handwritten documents in a zero-shot setting. We investigate various configurations of commercial OCR engines and MLLMs, utilizing the latter both as end-to-end transcribers and as post-processors, with and without image components. We propose a novel method, '+first page', which enhances MLLM transcription by providing the OCR output of the entire document along with just the first page image. This approach leverages shared document features without incurring the high cost of processing all images. Experiments on a multi-page version of the IAM Handwriting Database demonstrate that '+first page' improves transcription accuracy, balances cost with performance, and even enhances results on out-of-sample text by extrapolating formatting and OCR error patterns from a single page.
手写文本识别(HTR)仍然是一个具有挑战性的任务,特别是在多页文档中,这些文档共享相同的格式和上下文特征。尽管现代光学字符识别(OCR)引擎在处理印刷文本方面表现出色,但它们对手写文本的性能有限,并且通常需要昂贵的标注数据来进行微调。在这篇论文中,我们探讨了使用多模态大型语言模型(MLLMs)以零样本设置转录多页手写文档的方法。我们研究了各种商业OCR引擎和MLLM配置,利用后者作为端到端转录器以及有或没有图像组件的后处理工具。 我们提出了一种新颖的方法,“+首页面”,通过提供整个文档的OCR输出加上仅第一张图片来增强MLLM转录效果。这种方法利用了共享的文档特征,同时避免了处理所有图像所需的高昂成本。在IAM手写数据库的多页版本上的实验表明,“+首页面”方法提高了转录准确性,并且能够在不增加成本的情况下平衡性能;此外,在样本外文本上,该方法还能通过从单张图片中推断格式和OCR错误模式来提高结果质量。
https://arxiv.org/abs/2502.20295
Computer vision is a critical component in a wide range of real-world applications, including plant monitoring in agriculture and handwriting classification in digital systems. However, developing high-performance computer vision models traditionally demands both machine learning (ML) expertise and domain-specific knowledge, making the process costly, labor-intensive, and inaccessible to many. Large language model (LLM) agents have emerged as a promising solution to automate this workflow, but most existing methods share a common limitation: they attempt to optimize entire pipelines in a single step before evaluation, making it difficult to attribute improvements to specific changes. This lack of granularity leads to unstable optimization and slower convergence, limiting their effectiveness. To address this, we introduce Iterative Refinement, a novel strategy for LLM-driven ML pipeline design inspired by how human ML experts iteratively refine models, focusing on one component at a time rather than making sweeping changes all at once. By systematically updating individual components based on real training feedback, Iterative Refinement improves stability, interpretability, and overall model performance. We implement this strategy in IMPROVE, an end-to-end LLM agent framework for automating and optimizing object classification pipelines. Through extensive evaluations across datasets of varying sizes and domains, including standard benchmarks and Kaggle competition datasets, we demonstrate that Iterative Refinement enables IMPROVE to consistently achieve better performance over existing zero-shot LLM-based approaches. These findings establish Iterative Refinement as an effective new strategy for LLM-driven ML automation and position IMPROVE as an accessible solution for building high-quality computer vision models without requiring ML expertise.
计算机视觉是农业植物监测和数字系统手写识别等众多现实世界应用中的关键组成部分。然而,开发高性能的计算机视觉模型通常需要机器学习(ML)专业知识和特定领域的知识,这使得这一过程成本高昂、劳动密集且对许多人来说难以企及。大型语言模型(LLM)代理作为自动化此工作流程的一种有前途的方法应运而生,但大多数现有的方法都存在一个共同的局限性:它们试图在评估前一次性优化整个管道,因此很难将改进归因于特定的变化。这种缺乏细节导致了不稳定的优化和较慢的收敛速度,从而限制了其有效性。 为了解决这个问题,我们引入了一种名为迭代细化的新策略,这是一种由LLM驱动的机器学习管道设计方法,灵感来自人类ML专家如何逐步完善模型——专注于一次改进一个组件而不是一次性做出重大改变。通过基于实际训练反馈系统地更新单个组件,迭代细化增强了稳定性、可解释性和整体模型性能。 我们在IMPROVE框架中实现了这一策略,这是一个端到端的LLM代理框架,用于自动化和优化对象分类管道。通过在不同大小和领域数据集上的广泛评估,包括标准基准测试和Kaggle竞赛数据集,我们证明了迭代细化使IMPROVE能够持续优于现有的零样本LLM方法。 这些发现确立了迭代细化作为LLM驱动的ML自动化的一种有效新策略,并将IMPROVE定位为无需机器学习专业知识即可构建高质量计算机视觉模型的一个可行解决方案。
https://arxiv.org/abs/2502.18530
Handwritten Text Recognition (HTR) has become an essential field within pattern recognition and machine learning, with applications spanning historical document preservation to modern data entry and accessibility solutions. The complexity of HTR lies in the high variability of handwriting, which makes it challenging to develop robust recognition systems. This survey examines the evolution of HTR models, tracing their progression from early heuristic-based approaches to contemporary state-of-the-art neural models, which leverage deep learning techniques. The scope of the field has also expanded, with models initially capable of recognizing only word-level content progressing to recent end-to-end document-level approaches. Our paper categorizes existing work into two primary levels of recognition: (1) \emph{up to line-level}, encompassing word and line recognition, and (2) \emph{beyond line-level}, addressing paragraph- and document-level challenges. We provide a unified framework that examines research methodologies, recent advances in benchmarking, key datasets in the field, and a discussion of the results reported in the literature. Finally, we identify pressing research challenges and outline promising future directions, aiming to equip researchers and practitioners with a roadmap for advancing the field.
手写文本识别(HTR)已成为模式识别和机器学习领域的一个重要分支,其应用范围从历史文档的保存到现代数据录入及无障碍解决方案。HTR 的复杂性在于手写风格的高度变异性,这使得开发稳健的识别系统具有挑战性。本文综述了 HTR 模型的发展历程,追溯其从早期基于启发式的方法演进至当前最先进的神经网络模型的过程,后者利用深度学习技术来提升性能。该领域的研究范围也已扩大,初期只能识别单词级内容的模型逐渐发展为现今涵盖整个文档级别的端到端方法。我们在论文中将现有的工作分类为两个主要的识别层次:(1)**线级别及以下**,包括词和行的识别;以及 (2)**超出行级别**,解决段落和整篇文档层面的问题。我们提供了一个统一的研究框架,涵盖了研究方法、近期基准测试的进步、领域中的关键数据集,以及对文献中报告结果的讨论。最后,我们指出了亟待解决的研究挑战,并概述了未来有前景的发展方向,旨在为研究人员及从业者提供一份推进该领域的路线图。
https://arxiv.org/abs/2502.08417
In this study, we explored the use of spectrograms to represent handwriting signals for assessing neurodegenerative diseases, including 42 healthy controls (CTL), 35 subjects with Parkinson's Disease (PD), 21 with Alzheimer's Disease (AD), and 15 with Parkinson's Disease Mimics (PDM). We applied CNN and CNN-BLSTM models for binary classification using both multi-channel fixed-size and frame-based spectrograms. Our results showed that handwriting tasks and spectrogram channel combinations significantly impacted classification performance. The highest F1-score (89.8%) was achieved for AD vs. CTL, while PD vs. CTL reached 74.5%, and PD vs. PDM scored 77.97%. CNN consistently outperformed CNN-BLSTM. Different sliding window lengths were tested for constructing frame-based spectrograms. A 1-second window worked best for AD, longer windows improved PD classification, and window length had little effect on PD vs. PDM.
在这项研究中,我们探索了使用频谱图来表示手写信号以评估神经退行性疾病的方法。参与本研究的有42名健康对照组(CTL),35名帕金森病患者(PD),21名阿尔茨海默病患者(AD)和15名帕金森病模拟病例(PDM)。我们使用了卷积神经网络(CNN)和结合了长短时记忆层的卷积神经网络(CNN-BLSTM)模型,采用多通道固定大小频谱图及基于帧的频谱图进行二元分类。我们的结果表明,手写任务以及频谱图信道组合对分类性能有显著影响。在AD vs. CTL 的分类中取得了最高的F1分数(89.8%),PD vs. CTL 分类得分为74.5%,而 PD vs. PDM 的分类得分则为77.97%。CNN模型始终优于CNN-BLSTM模型。 我们测试了不同的滑动窗口长度,以构建基于帧的频谱图。对于AD病例,1秒的窗口表现最佳;而对于PD病例,较长的窗口可以改善其分类效果;至于 PD vs. PDM 的分类任务,则窗口长度对结果影响不大。
https://arxiv.org/abs/2502.07025
Converting images of Arabic text into plain text is a widely researched topic in academia and industry. However, recognition of Arabic handwritten and printed text presents difficult challenges due to the complex nature of variations of the Arabic script. This work proposes an end-to-end solution for recognizing Arabic handwritten, printed, and Arabic numbers and presents the data in a structured manner. We reached 81.66% precision, 78.82% Recall, and 79.07% F-measure on a Text Detection task that powers the proposed solution. The proposed recognition model incorporates state-of-the-art CNN-based feature extraction, and Transformer-based sequence modeling to accommodate variations in handwriting styles, stroke thicknesses, alignments, and noise conditions. The evaluation of the model suggests its strong performances on both printed and handwritten texts, yielding 0.59% CER and & 1.72% WER on printed text, and 7.91% CER and 31.41% WER on handwritten text. The overall proposed solution has proven to be relied on in real-life OCR tasks. Equipped with both detection and recognition models as well as other Feature Extraction and Matching helping algorithms. With the general purpose implementation, making the solution valid for any given document or receipt that is Arabic handwritten or printed. Thus, it is practical and useful for any given context.
将阿拉伯文图像转换为纯文本是学术界和工业界广泛研究的主题。然而,由于阿拉伯文字体的复杂性及其变体,识别手写和印刷的阿拉伯文文本面临着巨大的挑战。本文提出了一种端到端解决方案,用于识别阿拉伯手写、印刷及阿拉伯数字,并以结构化的方式呈现数据。在支撑该方案的文字检测任务上,我们达到了81.66%的精度(Precision)、78.82%的召回率(Recall)和79.07%的F值(F-measure)。提出的识别模型结合了最先进的基于CNN的特征提取技术和基于Transformer的序列建模技术,以适应书写风格、笔画粗细、对齐方式以及噪音条件的变化。模型评估显示其在印刷文本上表现良好,CER(字符错误率)为0.59%,WER(词错误率)为1.72%;在手写文本上,CER达到7.91%,WER则为31.41%。整体解决方案已被证明在实际OCR任务中可靠有效,并配备了检测和识别模型以及其他特征提取与匹配算法。通过通用实施方式,使其适用于任何给定的阿拉伯文手写或印刷文档。因此,在任何情境下都具有实用性和价值。
https://arxiv.org/abs/2502.05277
Large Language Models (LLMs) have been extensively applied in time series analysis. Yet, their utility in the few-shot classification (i.e., a crucial training scenario due to the limited training data available in industrial applications) concerning multivariate time series data remains underexplored. We aim to leverage the extensive pre-trained knowledge in LLMs to overcome the data scarcity problem within multivariate time series. Specifically, we propose LLMFew, an LLM-enhanced framework to investigate the feasibility and capacity of LLMs for few-shot multivariate time series classification. This model introduces a Patch-wise Temporal Convolution Encoder (PTCEnc) to align time series data with the textual embedding input of LLMs. We further fine-tune the pre-trained LLM decoder with Low-rank Adaptations (LoRA) to enhance its feature representation learning ability in time series data. Experimental results show that our model outperformed state-of-the-art baselines by a large margin, achieving 125.2% and 50.2% improvement in classification accuracy on Handwriting and EthanolConcentration datasets, respectively. Moreover, our experimental results demonstrate that LLM-based methods perform well across a variety of datasets in few-shot MTSC, delivering reliable results compared to traditional models. This success paves the way for their deployment in industrial environments where data are limited.
大型语言模型(LLMs)已被广泛应用于时间序列分析中。然而,它们在少量样本分类中的效用(即由于工业应用中可用的训练数据有限,这是一个至关重要的训练场景),尤其是在处理多变量时间序列数据时,仍然有待探索。我们旨在利用大规模预训练的语言模型的知识来解决多变量时间序列数据中存在的数据稀缺问题。为此,我们提出了LLMFew框架,这是基于大型语言模型增强的一种方法,用于研究大型语言模型在少量样本的多变量时间序列分类中的可行性和能力。 该模型引入了“Patch-wise Temporal Convolution Encoder (PTCEnc)”来对齐时间序列数据与大型语言模型输入文本嵌入。此外,我们还通过低秩适应(LoRA)微调预训练的语言模型解码器,以增强其在处理时间序列数据时的特征表示学习能力。 实验结果表明,我们的模型相比于最先进的基准方法,在分类准确度上分别提高了125.2%和50.2%,具体体现在Handwriting和EthanolConcentration数据集上的表现。此外,我们的实验证明了基于大型语言模型的方法在少量样本多变量时间序列分类的各种数据集中都表现出色,并且与传统模型相比提供了可靠的结果。 这一成功为这些方法在工业环境中部署铺平了道路,在这些环境中由于数据有限而难以采用传统的机器学习技术。
https://arxiv.org/abs/2502.00059
Brain-computer interfaces (BCIs) present a promising avenue by translating neural activity directly into text, eliminating the need for physical actions. However, existing non-invasive BCI systems have not successfully covered the entire alphabet, limiting their practicality. In this paper, we propose a novel non-invasive EEG-based BCI system with Curriculum-based Neural Spelling Framework, which recognizes all 26 alphabet letters by decoding neural signals associated with handwriting first, and then apply a Generative AI (GenAI) to enhance spell-based neural language decoding tasks. Our approach combines the ease of handwriting with the accessibility of EEG technology, utilizing advanced neural decoding algorithms and pre-trained large language models (LLMs) to translate EEG patterns into text with high accuracy. This system show how GenAI can improve the performance of typical spelling-based neural language decoding task, and addresses the limitations of previous methods, offering a scalable and user-friendly solution for individuals with communication impairments, thereby enhancing inclusive communication options.
脑机接口(BCI)通过直接将神经活动转换为文本,无需物理动作,提供了一种有前景的方法。然而,现有的非侵入性 BCI 系统尚未成功覆盖整个字母表,限制了其实用性。在本文中,我们提出了一种基于课程学习的新型非侵入式 EEG(脑电图)BCI 系统,该系统首先通过解码与手写相关的神经信号来识别所有 26 个英文字母,然后应用生成人工智能(GenAI)来增强基于拼写的神经语言解码任务。我们的方法结合了书写简便性和 EEG 技术的易用性,并利用先进的神经解码算法和预训练的大规模语言模型(LLM),将 EEG 模式准确地转换为文本。该系统展示了 GenAI 如何提升典型的基于拼写式的神经语言解码任务的表现,解决了先前方法的局限性,为有沟通障碍的人提供了一种可扩展且用户友好的解决方案,从而增强了包容性的沟通选项。
https://arxiv.org/abs/2501.17489
Dyslexia affects reading and writing skills across many languages. This work describes a new application of YOLO-based object detection to isolate and label handwriting patterns (Normal, Reversal, Corrected) within synthetic images that resemble real words. Individual letters are first collected, preprocessed into 32x32 samples, then assembled into larger synthetic 'words' to simulate realistic handwriting. Our YOLOv11 framework simultaneously localizes each letter and classifies it into one of three categories, reflecting key dyslexia traits. Empirically, we achieve near-perfect performance, with precision, recall, and F1 metrics typically exceeding 0.999. This surpasses earlier single-letter approaches that rely on conventional CNNs or transfer-learning classifiers (for example, MobileNet-based methods in Robaa et al. arXiv:2410.19821). Unlike simpler pipelines that consider each letter in isolation, our solution processes complete word images, resulting in more authentic representations of handwriting. Although relying on synthetic data raises concerns about domain gaps, these experiments highlight the promise of YOLO-based detection for faster and more interpretable dyslexia screening. Future work will expand to real-world handwriting, other languages, and deeper explainability methods to build confidence among educators, clinicians, and families.
阅读障碍会影响多语言的读写技能。这项工作描述了一种基于YOLO(You Only Look Once)目标检测的新应用,该应用旨在从类似于真实单词的合成图像中分离和标记书写模式(正常、反转、修正)。首先收集单个字母,预处理为32x32样本,然后组装成更大的合成“单词”,以模拟真实的书写方式。我们的YOLOv11框架同时定位每个字母并将其分类到三个类别之一,反映关键的阅读障碍特征。从经验上看,我们达到了接近完美的性能,精度、召回率和F1指标通常超过0.999。这超过了依赖传统CNN(卷积神经网络)或迁移学习分类器(例如Robaa等人提出的基于MobileNet的方法 arXiv:2410.19821)的早期单字母方法。与只考虑每个字母的简单流程不同,我们的解决方案处理完整的单词图像,从而生成更真实的书写表示形式。尽管依赖于合成数据会引发领域差距的问题,但这些实验突显了基于YOLO检测在阅读障碍筛查中实现更快和更具解释性的潜力。未来的工作将扩展到现实世界中的手写、其他语言以及更深的可解释性方法,以增强教育者、临床医生和家庭的信心。
https://arxiv.org/abs/2501.15263
Despite recent significant advancements in Handwritten Document Recognition (HDR), the efficient and accurate recognition of text against complex backgrounds, diverse handwriting styles, and varying document layouts remains a practical challenge. Moreover, this issue is seldom addressed in academic research, particularly in scenarios with minimal annotated data available. In this paper, we introduce the DocTTT framework to address these challenges. The key innovation of our approach is that it uses test-time training to adapt the model to each specific input during testing. We propose a novel Meta-Auxiliary learning approach that combines Meta-learning and self-supervised Masked Autoencoder~(MAE). During testing, we adapt the visual representation parameters using a self-supervised MAE loss. During training, we learn the model parameters using a meta-learning framework, so that the model parameters are learned to adapt to a new input effectively. Experimental results show that our proposed method significantly outperforms existing state-of-the-art approaches on benchmark datasets.
尽管手写文档识别(HDR)领域近年来取得了显著进展,但在复杂背景、多样的书写风格和不同的文档布局下高效准确地识别文本仍然是一项实际挑战。此外,在学术研究中,尤其是在缺乏充足标注数据的情况下解决这些问题的尝试很少。在本文中,我们介绍了DocTTT框架来应对这些挑战。我们的方法的关键创新在于它使用测试时训练(test-time training)来自适应调整模型以针对每个特定输入进行优化。 具体而言,我们提出了一种新颖的元辅助学习方法,该方法结合了元学习和自监督掩码自动编码器(MAE)。在测试阶段,通过自监督的MAE损失函数来调整视觉表示参数。而在训练过程中,则利用一个元学习框架来学习模型参数,使模型能够在面对新输入时有效地进行适应。 实验结果表明,在基准数据集上,我们提出的方法显著优于现有的最先进的方法。
https://arxiv.org/abs/2501.12898
In the realm of digital forensics and document authentication, writer identification plays a crucial role in determining the authors of documents based on handwriting styles. The primary challenge in writer-id is the "open-set scenario", where the goal is accurately recognizing writers unseen during the model training. To overcome this challenge, representation learning is the key. This method can capture unique handwriting features, enabling it to recognize styles not previously encountered during training. Building on this concept, this paper introduces the Contrastive Masked Auto-Encoders (CMAE) for Character-level Open-Set Writer Identification. We merge Masked Auto-Encoders (MAE) with Contrastive Learning (CL) to simultaneously and respectively capture sequential information and distinguish diverse handwriting styles. Demonstrating its effectiveness, our model achieves state-of-the-art (SOTA) results on the CASIA online handwriting dataset, reaching an impressive precision rate of 89.7%. Our study advances universal writer-id with a sophisticated representation learning approach, contributing substantially to the ever-evolving landscape of digital handwriting analysis, and catering to the demands of an increasingly interconnected world.
在数字取证和文档认证领域,作者识别通过分析书写风格来确定文档的作者身份,扮演着至关重要的角色。作者识别(writer-id)的主要挑战在于“开放集场景”,即目标是准确地识别出那些未在模型训练期间见过的作者。为应对这一挑战,表示学习方法至关重要,该方法能够捕捉到独特的手写特征,从而能够在未曾遇到过的书写风格中进行识别。 在此基础上,本文介绍了字符级开放集作者识别中的对比掩码自动编码器(Contrastive Masked Auto-Encoders, CMAE)。我们结合了掩码自动编码器(Masked Auto-Encoders, MAE)与对比学习(Contrastive Learning, CL),以同时且分别地捕捉序列信息和区分多样化的书写风格。通过在CASIA在线手写数据集上的实验,我们的模型取得了最先进的精度率89.7%的成绩,证明了其有效性。 本研究通过一种复杂的表示学习方法推进了通用作者识别技术的发展,并为不断演变的数字笔迹分析领域做出了重要贡献,同时也满足了一个日益互联的世界的需求。
https://arxiv.org/abs/2501.11895
This paper introduces a cost-effective robotic handwriting system designed to replicate human-like handwriting with high precision. Combining a Raspberry Pi Pico microcontroller, 3D-printed components, and a machine learning-based handwriting generation model implemented via this http URL, the system converts user-supplied text into realistic stroke trajectories. By leveraging lightweight 3D-printed materials and efficient mechanical designs, the system achieves a total hardware cost of approximately \$56, significantly undercutting commercial alternatives. Experimental evaluations demonstrate handwriting precision within $\pm$0.3 millimeters and a writing speed of approximately 200 mm/min, positioning the system as a viable solution for educational, research, and assistive applications. This study seeks to lower the barriers to personalized handwriting technologies, making them accessible to a broader audience.
这篇论文介绍了一种成本效益高的机器人书写系统,旨在以高精度复制类似人类的笔迹。该系统结合了Raspberry Pi Pico微控制器、3D打印部件以及通过此链接(请将“this http URL”替换为实际链接)实现的基于机器学习的手写生成模型,能够将用户提供的文本转换成逼真的笔画轨迹。利用轻量级3D打印材料和高效的机械设计,该系统实现了约56美元的总硬件成本,显著低于商用替代品的价格。实验评估表明,系统的书写精度在±0.3毫米范围内,并且书写速度约为每分钟200毫米,使其成为教育、研究及辅助应用的一个可行解决方案。本研究旨在降低个性化手写技术的门槛,让更广泛的受众能够使用这些技术。
https://arxiv.org/abs/2501.06783
Extracting medication names from handwritten doctor prescriptions is challenging due to the wide variability in handwriting styles and prescription formats. This paper presents a robust method for extracting medicine names using a combination of Mask R-CNN and Transformer-based Optical Character Recognition (TrOCR) with Multi-Head Attention and Positional Embeddings. A novel dataset, featuring diverse handwritten prescriptions from various regions of Pakistan, was utilized to fine-tune the model on different handwriting styles. The Mask R-CNN model segments the prescription images to focus on the medicinal sections, while the TrOCR model, enhanced by Multi-Head Attention and Positional Embeddings, transcribes the isolated text. The transcribed text is then matched against a pre-existing database for accurate identification. The proposed approach achieved a character error rate (CER) of 1.4% on standard benchmarks, highlighting its potential as a reliable and efficient tool for automating medicine name extraction.
从手写医生处方中提取药品名称具有挑战性,因为书写风格和处方格式的多样性。本文介绍了一种结合使用Mask R-CNN和基于Transformer的光学字符识别(TrOCR)的方法来提取药品名称,该方法采用了多头注意力机制和位置嵌入技术。研究中利用了一个新数据集,其中包含了来自巴基斯坦不同地区的多样化手写处方,用以调整模型以适应不同的书写风格。Mask R-CNN模型用于分割处方图像,专注于药物部分,而由多头注意力机制和位置嵌入增强的TrOCR模型则负责转录孤立文本。随后将转录后的文本与现有的数据库进行匹配,以实现准确识别。所提出的方法在标准基准测试中达到了1.4%的字符错误率(CER),显示出其作为自动提取药品名称可靠而高效的工具的巨大潜力。
https://arxiv.org/abs/2412.18199
The generation of images of realistic looking, readable handwritten text is a challenging task which is referred to as handwritten text generation (HTG). Given a string and examples from a writer, the goal is to synthesize an image depicting the correctly spelled word in handwriting with the calligraphic style of the desired writer. An important application of HTG is the generation of training images in order to adapt downstream models for new data sets. With their success in natural image generation, diffusion models (DMs) have become the state-of-the-art approach in HTG. In this work, we present an extension of a latent DM for HTG to enable generation of writing styles not seen during training by learning style conditioning with a masked auto encoder. Our proposed content encoder allows for different ways of conditioning the DM on textual and calligraphic features. Additionally, we employ classifier-free guidance and explore the influence on the quality of the generated training images. For adapting the model to a new unlabeled data set, we propose a semi-supervised training scheme. We evaluate our approach on the IAM-database and use the RIMES-database to examine the generation of data not seen during training achieving improvements in this particularly promising application of DMs for HTG.
生成看起来逼真且可读的手写文本图像是一个具有挑战性的任务,被称为手写文本生成(HTG)。给定一段字符串和某个书写者的样本,目标是合成一张图片,该图片以所期望的书写者风格展示正确拼写的单词。HTG的一个重要应用是在新数据集上训练下游模型时生成训练图像。由于在自然图像生成方面的成功,扩散模型(DMs)已成为HTG领域的领先方法。在此工作中,我们提出了一种扩展的潜在DM用于HTG,通过学习带有掩码自编码器的风格条件来实现生成未见过的书写风格。我们的内容编码器允许以不同的方式对DM进行文本和书法特征上的条件设置。此外,我们采用无分类器指导,并探讨其对生成训练图像质量的影响。为了将模型适应新的未标记数据集,我们提出了一种半监督训练方案。我们在IAM数据库上评估了我们的方法,并使用RIMES数据库来检查生成未见过的数据,从而在DMs用于HTG的这一特别有前景的应用中取得了改进。
https://arxiv.org/abs/2412.15853
Currently, the prevalence of online handwriting has spurred a critical need for effective retrieval systems to accurately search relevant handwriting instances from specific writers, known as online writer retrieval. Despite the growing demand, this field suffers from a scarcity of well-established methodologies and public large-scale datasets. This paper tackles these challenges with a focus on Chinese handwritten phrases. First, we propose DOLPHIN, a novel retrieval model designed to enhance handwriting representations through synergistic temporal-frequency analysis. For frequency feature learning, we propose the HFGA block, which performs gated cross-attention between the vanilla temporal handwriting sequence and its high-frequency sub-bands to amplify salient writing details. For temporal feature learning, we propose the CAIR block, tailored to promote channel interaction and reduce channel redundancy. Second, to address data deficit, we introduce OLIWER, a large-scale online writer retrieval dataset encompassing over 670,000 Chinese handwritten phrases from 1,731 individuals. Through extensive evaluations, we demonstrate the superior performance of DOLPHIN over existing methods. In addition, we explore cross-domain writer retrieval and reveal the pivotal role of increasing feature alignment in bridging the distributional gap between different handwriting data. Our findings emphasize the significance of point sampling frequency and pressure features in improving handwriting representation quality and retrieval performance. Code and dataset are available at this https URL.
当前,在线手写内容的普及引发了对于能够准确搜索特定作者相关手写实例的有效检索系统的关键需求,这被称为在线作家检索。尽管需求不断增长,该领域仍缺乏成熟的方法论和大规模公共数据集。本文针对这些挑战,重点研究中文手写短语的问题。首先,我们提出了DOLPHIN模型,这是一种新型的检索模型,旨在通过协同的时间-频率分析来增强手写表示。为了学习频率特征,我们提出了HFGA模块,该模块在普通时间手写序列与其高频子带之间执行门控交叉注意力机制,以放大显著的书写细节。对于时间特征的学习,我们提出了CAIR模块,专门设计用于促进通道交互并减少通道冗余。其次,为了解决数据不足的问题,我们引入了OLIWER,这是一个大型在线作家检索数据集,包含来自1731个个体的超过670,000个中文手写短语。通过广泛的评估,我们展示了DOLPHIN模型在现有方法上的优越性能。此外,我们也探索了跨域作家检索,并揭示了提高特征对齐在缩小不同手写数据分布差距中的关键作用。我们的发现强调了点采样频率和压力特征在提升手写表示质量和检索性能方面的意义。代码和数据集可在以下链接获取:[此https URL]。
https://arxiv.org/abs/2412.11668
The problem of converting images of text into plain text is a widely researched topic in both academia and industry. Arabic handwritten Text Recognation (AHTR) poses additional challenges due to diverse handwriting styles and limited labeled data. In this paper we present a complete OCR pipeline that starts with line segmentation using Differentiable Binarization and Adaptive Scale Fusion techniques to ensure accurate detection of text lines. Following segmentation, a CNN-BiLSTM-CTC architecture is applied to recognize characters. Our system, trained on the Arabic Multi-Fonts Dataset (AMFDS), achieves a Character Recognition Rate (CRR) of 99.20% and a Word Recognition Rate (WRR) of 93.75% on single-word samples containing 7 to 10 characters, along with a CRR of 83.76% for sentences. These results demonstrate the system's strong performance in handling Arabic scripts, establishing a new benchmark for AHTR systems.
https://arxiv.org/abs/2412.01601
Hand preference and degree of handedness (DoH) are two different aspects of human behavior which are often confused to be one. DoH is a person's inherent capability of the brain; affected by nature and nurture. In this study, we used dominant and non-dominant handwriting traits to assess DoH for the first time, on 43 subjects of three categories- Unidextrous, Partially Unidextrous, and Ambidextrous. Features extracted from the segmented handwriting signals called strokes were used for DoH quantification. Davies Bouldin Index, Multilayer perceptron, and Convolutional Neural Network (CNN) were used for automated grading of DoH. The outcomes of these methods were compared with the widely used DoH assessment questionnaires from Edinburgh Inventory (EI). The CNN based automated grading outperformed other computational methods with an average classification accuracy of 95.06% under stratified 10-fold cross-validation. The leave-one-subject-out strategy on this CNN resulted in a test individual's DoH score which was converted into a 4-point score. Around 90% of the obtained scores from all the implemented computational methods were found to be in accordance with the EI scores under 95% confidence interval. Automated grading of degree of handedness using handwriting signals can provide more resolution to the Edinburgh Inventory scores. This could be used in multiple applications concerned with neuroscience, rehabilitation, physiology, psychometry, behavioral sciences, and forensics.
https://arxiv.org/abs/2412.01587
The Virginia Tech University Libraries (VTUL) Digital Library Platform (DLP) hosts digital collections that offer our users access to a wide variety of documents of historical and cultural importance. These collections are not only of academic importance but also provide our users with a glance at local historical events. Our DLP contains collections comprising digital objects featuring complex layouts, faded imagery, and hard-to-read handwritten text, which makes providing online access to these materials challenging. To address these issues, we integrate AI into our DLP workflow and convert the text in the digital objects into a machine-readable format. To enhance the user experience with our historical collections, we use custom AI agents for handwriting recognition, text extraction, and large language models (LLMs) for summarization. This poster highlights three collections focusing on handwritten letters, newspapers, and digitized topographic maps. We discuss the challenges with each collection and detail our approaches to address them. Our proposed methods aim to enhance the user experience by making the contents in these collections easier to search and navigate.
弗吉尼亚理工大学图书馆(VTUL)的数字图书馆平台(DLP)托管了多种数字化藏品,为用户提供了访问具有历史和文化重要性的各种文档的机会。这些收藏不仅具有学术价值,还让用户能够了解本地的历史事件。我们的DLP包含了一系列复杂的数字对象,包括布局复杂、图像褪色以及难以辨认的手写文本等内容,这使得在线提供这些材料变得颇具挑战性。为解决这些问题,我们将在DLP工作流程中集成了人工智能,并将数字对象中的文字转换成机器可读的格式。为了增强用户对历史藏品的体验,我们使用了定制的人工智能代理进行手写识别、文本提取以及大型语言模型(LLMs)进行总结。此海报重点介绍了三个收藏项目:手写信件、报纸和数字化地形图。我们将讨论每个收藏项目的挑战,并详细说明我们的解决方法。所提出的方法旨在通过使这些藏品中的内容更易于搜索和导航,从而增强用户体验。
https://arxiv.org/abs/2411.17600
The generation of handwritten music sheets is a crucial step toward enhancing Optical Music Recognition (OMR) systems, which rely on large and diverse datasets for optimal performance. However, handwritten music sheets, often found in archives, present challenges for digitisation due to their fragility, varied handwriting styles, and image quality. This paper addresses the data scarcity problem by applying Generative Adversarial Networks (GANs) to synthesise realistic handwritten music sheets. We provide a comprehensive evaluation of three GAN models - DCGAN, ProGAN, and CycleWGAN - comparing their ability to generate diverse and high-quality handwritten music images. The proposed CycleWGAN model, which enhances style transfer and training stability, significantly outperforms DCGAN and ProGAN in both qualitative and quantitative evaluations. CycleWGAN achieves superior performance, with an FID score of 41.87, an IS of 2.29, and a KID of 0.05, making it a promising solution for improving OMR systems.
https://arxiv.org/abs/2411.16405