Text plays a crucial role in the transmission of human civilization, and teaching machines to generate online handwritten text in various styles presents an interesting and significant challenge. However, most prior work has concentrated on generating individual Chinese fonts, leaving {complete text line generation largely unexplored}. In this paper, we identify that text lines can naturally be divided into two components: layout and glyphs. Based on this division, we designed a text line layout generator coupled with a diffusion-based stylized font synthesizer to address this challenge hierarchically. More concretely, the layout generator performs in-context-like learning based on the text content and the provided style references to generate positions for each glyph autoregressively. Meanwhile, the font synthesizer which consists of a character embedding dictionary, a multi-scale calligraphy style encoder, and a 1D U-Net based diffusion denoiser will generate each font on its position while imitating the calligraphy style extracted from the given style references. Qualitative and quantitative experiments on the CASIA-OLHWDB demonstrate that our method is capable of generating structurally correct and indistinguishable imitation samples.
文本在人类文明的传播中扮演着关键角色,并且教机器生成各种风格的在线手写文本提出了一个有趣且具有挑战性的问题。然而,大多数先前的研究都集中在生成单个中文字体上, leaving {完整的文本行生成大多没有被探索}。在本文中,我们发现文本行可以自然地分为两个部分:布局和字符。基于这一划分,我们设计了一个文本行布局生成器与扩散为基础的 stylized 字体合成器来解决这个问题。具体来说,布局生成器根据文本内容和提供的风格参考进行上下文类似的学习,以生成每个字符的自动位置。同时,由字符嵌入字典、多尺度书法风格编码器和基于 U-Net 的扩散去噪器组成的字体合成器将在其位置上模仿从给定风格参考中提取的书法风格。在 CASIA-OLHWDB 等数据集上进行的定性和定量实验证明,我们的方法能够生成结构正确且难以区分模仿样本。
https://arxiv.org/abs/2410.02309
Arabic handwritten text recognition (HTR) is challenging, especially for historical texts, due to diverse writing styles and the intrinsic features of Arabic script. Additionally, Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models. To address these challenges, we propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model. By leveraging the transformer's attention mechanism, HATFormer captures spatial contextual information to address the intrinsic challenges of Arabic script through differentiating cursive characters, decomposing visual representations, and identifying diacritics. Our customization to historical handwritten Arabic includes an image processor for effective ViT information preprocessing, a text tokenizer for compact Arabic text representation, and a training pipeline that accounts for a limited amount of historic Arabic handwriting data. HATFormer achieves a character error rate (CER) of 8.6% on the largest public historical handwritten Arabic dataset, with a 51% improvement over the best baseline in the literature. HATFormer also attains a comparable CER of 4.2% on the largest private non-historical dataset. Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges, contributing to advancements in document digitization, information retrieval, and cultural preservation.
阿拉伯手写文本识别(HTR)具有挑战性,尤其是在历史文本中,因为阿拉伯文本具有多样性的书写风格和阿拉伯文字本特征。此外,阿拉伯手写数据集比英语数据集要小,这使得训练具有泛化能力的阿拉伯HTR模型变得困难。为了应对这些挑战,我们提出了HATFormer,一种基于最先进的英语HTR模型的Transformer编码器-解码器架构。通过利用Transformer的注意力机制,HATFormer捕捉到阿拉伯文本的空间上下文信息,通过区分手写字符、分解视觉表示和识别变体来解决阿拉伯文字本固有的挑战。我们对历史手写阿拉伯的定制包括一个有效的ViT信息预处理图像处理器、一个紧凑的阿拉伯文本词条izer和一个考虑有限历史阿拉伯手写数据训练工作流的训练管道。HATFormer在最大的公共历史手写阿拉伯数据集上的字符错误率(CER)为8.6%,在文献中的最佳基线上的性能提高了51%。HATFormer还在最大的私人非历史数据集上获得了与文献中类似且可比的CER,为将英语HTR方法应用于具有复杂、语言特定挑战的低资源语言奠定了基础,有助于促进文档数字化、信息检索和文化遗产保护的发展。
https://arxiv.org/abs/2410.02179
Learning from Demonstration (LfD) is a useful paradigm for training policies that solve tasks involving complex motions. In practice, the successful application of LfD requires overcoming error accumulation during policy execution, i.e. the problem of drift due to errors compounding over time and the consequent out-of-distribution behaviours. Existing works seek to address this problem through scaling data collection, correcting policy errors with a human-in-the-loop, temporally ensembling policy predictions or through learning the parameters of a dynamical system model. In this work, we propose and validate an alternative approach to overcoming this issue. Inspired by reservoir computing, we develop a novel neural network layer that includes a fixed nonlinear dynamical system with tunable dynamical properties. We validate the efficacy of our neural network layer on the task of reproducing human handwriting motions using the LASA Human Handwriting Dataset. Through empirical experiments we demonstrate that incorporating our layer into existing neural network architectures addresses the issue of compounding errors in LfD. Furthermore, we perform a comparative evaluation against existing approaches including a temporal ensemble of policy predictions and an Echo State Networks (ESNs) implementation. We find that our approach yields greater policy precision and robustness on the handwriting task while also generalising to multiple dynamics regimes and maintaining competitive latency scores.
学习演示(LfD)是一种有效的解决涉及复杂动作任务的策略训练范式。在实践中,成功应用LfD需要通过在策略执行过程中克服误差累积问题,即误差随时间累积导致的漂移问题,以及随之而来的离散行为。现有作品试图通过缩放数据收集、通过人机交互来纠正策略错误、通过学习动态系统模型的参数来解决这个问题。在这篇工作中,我们提出了并验证了一种克服这个问题的替代方法。受到水库计算的启发,我们开发了一个新神经网络层,包括一个固定非线性动力学系统,具有可调的动态特性。我们通过使用LASA人类手写字符数据集来验证我们神经网络层的有效性。通过实验我们证明了将我们的层应用于现有的神经网络架构可以解决LfD中累积误差的問題。此外,我们还对包括基于策略预测的时间聚类和Echo State Networks(ESNs)实现的现有方法进行了比较评估。我们发现,与其他方法相比,我们的方法在手写字符任务中具有更高的策略精度、鲁棒性和扩展性,同时保持竞争延迟分数。
https://arxiv.org/abs/2409.18768
Handwritten Text Generation (HTG) conditioned on text and style is a challenging task due to the variability of inter-user characteristics and the unlimited combinations of characters that form new words unseen during training. Diffusion Models have recently shown promising results in HTG but still remain under-explored. We present DiffusionPen (DiffPen), a 5-shot style handwritten text generation approach based on Latent Diffusion Models. By utilizing a hybrid style extractor that combines metric learning and classification, our approach manages to capture both textual and stylistic characteristics of seen and unseen words and styles, generating realistic handwritten samples. Moreover, we explore several variation strategies of the data with multi-style mixtures and noisy embeddings, enhancing the robustness and diversity of the generated data. Extensive experiments using IAM offline handwriting database show that our method outperforms existing methods qualitatively and quantitatively, and its additional generated data can improve the performance of Handwriting Text Recognition (HTR) systems. The code is available at: this https URL.
手写文本生成(HTG)根据文本和风格进行条件是一个具有挑战性的任务,因为用户特征的变异性以及训练过程中形成的未见过的字符组合的无限可能性。扩散模型最近在HTG上取得了良好的效果,但仍然有待进一步探索。我们提出了一种基于潜在扩散模型的5轮手写文本生成方法:DiffusionPen。通过结合度量学习和分类,我们的方法能够捕捉见到的和未见到的单词及其风格的文本和风格特征,生成真实的手写样本。此外,我们探讨了使用多风格混合和噪声嵌入来增强数据的数据变化策略,从而提高生成的数据的稳健性和多样性。使用IAM离线手写数据库进行的大量实验证明,我们的方法在质量和数量上优于现有方法,其额外生成的数据可以提高手写文本识别(HTR)系统的性能。代码可在此处访问:https:// this URL。
https://arxiv.org/abs/2409.06065
The primary challenge for handwriting recognition systems lies in managing long-range contextual dependencies, an issue that traditional models often struggle with. To mitigate it, attention mechanisms have recently been employed to enhance context-aware labelling, thereby achieving state-of-the-art performance. In the field of pattern recognition and image analysis, however, the use of contextual information in labelling problems has a long history and goes back at least to the early 1970's. Among the various approaches developed in those years, Relaxation Labelling (RL) processes have played a prominent role and have been the method of choice in the field for more than a decade. Contrary to recent transformer-based architectures, RL processes offer a principled approach to the use of contextual constraints, having a solid theoretic foundation grounded on variational inequality and game theory, as well as effective algorithms with convergence guarantees. In this paper, we propose a novel approach to handwriting recognition that integrates the strengths of two distinct methodologies. In particular, we propose integrating (trainable) RL processes with various well-established neural architectures and we introduce a sparsification technique that accelerates the convergence of the algorithm and enhances the overall system's performance. Experiments over several benchmark datasets show that RL processes can improve the generalisation ability, even surpassing in some cases transformer-based architectures.
手写字符识别系统的主要挑战在于处理长距离上下文依赖关系,这是传统模型通常难以应对的问题。为解决这个问题,近年来采用了一种注意力机制来增强上下文敏感的标注,从而实现了最先进的性能。然而,在模式识别和图像分析领域,使用上下文信息进行标注问题已经有着悠久的历史,并且至少可以追溯到20世纪70年代初。在那几年里开发的各种方法中, relaxation标记(RL)过程发挥了重要作用,并且已经成为该领域的首选方法超过十年。与最近基于Transformer的架构不同,RL过程具有基于变分不等式和博弈论的严谨的理论基础,以及具有收敛保证的有效算法。在本文中,我们提出了一种将两个不同的方法论相结合的新型手写字符识别方法。特别地,我们提出将可训练的RL过程与各种已经确立的神经网络架构集成,并引入了一种稀疏化技术,加速算法的收敛,提高整体系统的性能。在多个基准数据集上的实验结果表明,RL过程可以提高泛化能力,甚至在某些情况下超过了基于Transformer的架构。
https://arxiv.org/abs/2409.05699
Existing handwritten text generation methods often require more than ten handwriting samples as style references. However, in practical applications, users tend to prefer a handwriting generation model that operates with just a single reference sample for its convenience and efficiency. This approach, known as "one-shot generation", significantly simplifies the process but poses a significant challenge due to the difficulty of accurately capturing a writer's style from a single sample, especially when extracting fine details from the characters' edges amidst sparse foreground and undesired background noise. To address this problem, we propose a One-shot Diffusion Mimicker (One-DM) to generate handwritten text that can mimic any calligraphic style with only one reference sample. Inspired by the fact that high-frequency information of the individual sample often contains distinct style patterns (e.g., character slant and letter joining), we develop a novel style-enhanced module to improve the style extraction by incorporating high-frequency components from a single sample. We then fuse the style features with the text content as a merged condition for guiding the diffusion model to produce high-quality handwritten text images. Extensive experiments demonstrate that our method can successfully generate handwriting scripts with just one sample reference in multiple languages, even outperforming previous methods using over ten samples. Our source code is available at this https URL.
现有的手写文本生成方法通常需要超过10个手写样本作为样式参考。然而,在实际应用中,用户倾向于更喜欢只需一个手写样本操作的单样本生成模型,因为这样既方便又高效。这种方法被称为“一次生成”,显著简化了工艺流程,但由于准确捕捉一个写者的风格从单个样本的难度较大,尤其是在从字符边缘提取细小细节的过程中,尤其是在稀疏的前景和不需要的背景噪音中,这个问题仍然存在。为了应对这个问题,我们提出了一个一次生成扩散模仿者(One-DM)来生成具有任何楷体风格的只有一单个样本的手写文本。受到单个样本中高频信息通常包含不同的风格模式(例如,字符倾斜和字母连接)的影响,我们开发了一个新的样式增强模块,通过将高频组件从单个样本中集成来提高样式提取。然后将样式特征与文本内容作为合并条件,引导扩散模型产生高质量的手写文本图像。大量实验证明,我们的方法可以在多种语言中成功生成只需一个样本参考的手写文本,甚至超过了使用超过10个样本的前人方法。我们的源代码可在此处访问:https://url.cn/
https://arxiv.org/abs/2409.04004
The evaluation of generative models for natural image tasks has been extensively studied. Similar protocols and metrics are used in cases with unique particularities, such as Handwriting Generation, even if they might not be completely appropriate. In this work, we introduce three measures tailored for HTG evaluation, $ \text{HTG}_{\text{HTR}} $, $ \text{HTG}_{\text{style}} $, and $ \text{HTG}_{\text{OOV}} $, and argue that they are more expedient to evaluate the quality of generated handwritten images. The metrics rely on the recognition error/accuracy of Handwriting Text Recognition and Writer Identification models and emphasize writing style, textual content, and diversity as the main aspects that adhere to the content of handwritten images. We conduct comprehensive experiments on the IAM handwriting database, showcasing that widely used metrics such as FID fail to properly quantify the diversity and the practical utility of generated handwriting samples. Our findings show that our metrics are richer in information and underscore the necessity of standardized evaluation protocols in HTG. The proposed metrics provide a more robust and informative protocol for assessing HTG quality, contributing to improved performance in HTR. Code for the evaluation protocol is available at: this https URL.
自然图像任务的生成模型评估已经得到了广泛研究。在具有独特特点的情况下,如手写字符生成,即使它们可能不完全合适,也会使用类似的协议和指标。在这项工作中,我们引入了三个专门为HTG评估而设计的度量标准:$ \text{HTG}_{\text{HTR}} $,$ \text{HTG}_{\text{style}} $和$ \text{HTG}_{\text{OOV}} $,并认为它们更有效地评估生成手写字符图片的质量。这些指标基于手写字符识别错误/准确性和写者识别模型的识别误差/准确度,并强调书写风格,文本内容,多样性为主要方面,符合手写字符图片的内容。我们在IAM手写数据库上进行全面的实验,表明广泛使用的指标如FID无法正确量化生成手写字符样本的多样性和实际应用价值。我们的研究结果表明,我们的指标在信息量上更加丰富,进一步强调了在HTG中使用标准化评估协议的必要性。所提出的指标为评估HTG质量提供了一个更健壮和有益的协议,有助于提高HTR的性能。评估协议的代码可在此处下载:https:// this URL。
https://arxiv.org/abs/2409.02683
A person's Body Mass Index, or BMI, is the most widely used parameter for assessing their health. BMI is a crucial predictor of potential diseases that may arise at higher body fat levels because it is correlated with body fat. Conversely, a community's or an individual's nutritional status can be determined using the BMI. Although deep learning models are used in several studies to estimate BMI from face photos and other data, no previous research established a clear connection between deep learning techniques for handwriting analysis and BMI prediction. This article addresses this research gap with a deep learning approach to estimating BMI from handwritten characters by developing a convolutional neural network (CNN). A dataset containing samples from 48 people in lowercase English scripts is successfully captured for the BMI prediction task. The proposed CNN-based approach reports a commendable accuracy of 99.92%. Performance comparison with other popular CNN architectures reveals that AlexNet and InceptionV3 achieve the second and third-best performance, with the accuracy of 99.69% and 99.53%, respectively.
一个人的身体质量指数(BMI)是对其健康状况最广泛使用的参数。 BMI 与身体脂肪水平相关,是预测在较高身体脂肪水平可能出现的潜在疾病的關鍵預測因素。相反,一个社区或个人的營養狀況可以使用 BMI 來確定。雖然在多個研究中,使用面照片和其他數據來估計 BMI 的深度學習模型,但之前的研究沒有建立手寫字分析中深度學習技術與 BMI 預測之間的明確聯繫。本文通過使用深度學習方法估計手寫字中的 BMI,來填补這個研究缺口。開發了一個卷積神經網絡(CNN),通過估計手寫字中的 BMI,成功捕捉了包含48個人的小寫英文字母數據集的 BMI 預測任務數據集。與其他流行的 CNN 架構相比,基於 CNN 的方法報告了 99.92% 的優秀準確度。性能比較顯示,AlexNet 和 InceptionV3 實現了第二和第三好的性能,分別達到 99.69% 和 99.53%。
https://arxiv.org/abs/2409.02584
The imitation of cursive handwriting is mainly limited to generating handwritten words or lines. Multiple synthetic outputs must be stitched together to create paragraphs or whole pages, whereby consistency and layout information are lost. To close this gap, we propose a method for imitating handwriting at the paragraph level that also works for unseen writing styles. Therefore, we introduce a modified latent diffusion model that enriches the encoder-decoder mechanism with specialized loss functions that explicitly preserve the style and content. We enhance the attention mechanism of the diffusion model with adaptive 2D positional encoding and the conditioning mechanism to work with two modalities simultaneously: a style image and the target text. This significantly improves the realism of the generated handwriting. Our approach sets a new benchmark in our comprehensive evaluation. It outperforms all existing imitation methods at both line and paragraph levels, considering combined style and content preservation.
模仿草书手迹的主要限制在于只能生成手写单词或行。多个合成输出必须拼接在一起才能创建段落或整个页面,从而丢失了 consistency 和 layout 信息。为了填补这个空白,我们提出了一个模仿草书手迹的方法,该方法对于未见过的书写风格也有效。因此,我们引入了一个修改的潜在扩散模型,它通过专门损失函数丰富编码器-解码器机制,并保留风格和内容。我们通过自适应 2D 位置编码和条件机制来增强扩散模型的注意机制,以同时处理样式图像和目标文本。这使得生成的手写内容更真实。在我们的全面评估中,我们的方法为一个新的基准。它超越了所有现有的模仿方法,既考虑了风格,也考虑了内容保留。
https://arxiv.org/abs/2409.00786
Developmental dysgraphia is a neurological disorder that hinders children's writing skills. In recent years, researchers have increasingly explored machine learning methods to support the diagnosis of dysgraphia based on offline and online handwriting. In most previous studies, the two types of handwriting have been analysed separately, which does not necessarily lead to promising results. In this way, the relationship between online and offline data cannot be explored. To address this limitation, we propose a novel multimodal machine learning approach utilizing both online and offline handwriting data. We created a new dataset by transforming an existing online handwritten dataset, generating corresponding offline handwriting images. We considered only different types of word data (simple word, pseudoword & difficult word) in our multimodal analysis. We trained SVM and XGBoost classifiers separately on online and offline features as well as implemented multimodal feature fusion and soft-voted ensemble. Furthermore, we proposed a novel ensemble with conditional feature fusion method which intelligently combines predictions from online and offline classifiers, selectively incorporating feature fusion when confidence scores fall below a threshold. Our novel approach achieves an accuracy of 88.8%, outperforming SVMs for single modalities by 12-14%, existing methods by 8-9%, and traditional multimodal approaches (soft-vote ensemble and feature fusion) by 3% and 5%, respectively. Our methodology contributes to the development of accurate and efficient dysgraphia diagnosis tools, requiring only a single instance of multimodal word/pseudoword data to determine the handwriting impairment. This work highlights the potential of multimodal learning in enhancing dysgraphia diagnosis, paving the way for accessible and practical diagnostic tools.
发展性失写症是一种神经紊乱,它阻碍了孩子的写作能力。近年来,研究人员越来越关注利用机器学习方法来支持基于离线和在线手写体症的诊断。在大多数之前的研究中,两种手写类型都被单独分析,这并不一定带来积极的结果。因此,我们提出了一个新颖的多模态机器学习方法,利用在线和离线手写体数据。我们在 multimodal 分析中考虑了不同的单词数据(简单单词、伪单词和困难单词)。我们对在线和离线特征进行单独训练,并实现了多模态特征融合和软投票集成。此外,我们提出了一个新颖的集成方法,该方法能够智能地将在线和离线分类器的预测相结合,并在置信度分数低于临界值时选择性地集成特征融合。我们新颖的方法获得了 88.8% 的准确度,比单模态 SVMs 提高了 12-14%,比现有方法提高了 8-9%,比传统多模态方法(软投票集成和特征融合)提高了 3% 和 5%。我们的方法为开发准确且有效的失写症诊断工具做出了贡献,只需要一个多模态单词/伪单词数据实例来确定手写障碍。这项工作突出了多模态学习在增强失写症诊断方面的潜力,为可访问且实用的诊断工具铺平了道路。
https://arxiv.org/abs/2408.13754
Defining script types and establishing classification criteria for medieval handwriting is a central aspect of palaeographical analysis. However, existing typologies often encounter methodological challenges, such as descriptive limitations and subjective criteria. We propose an interpretable deep learning-based approach to morphological script type analysis, which enables systematic and objective analysis and contributes to bridging the gap between qualitative observations and quantitative measurements. More precisely, we adapt a deep instance segmentation method to learn comparable character prototypes, representative of letter morphology, and provide qualitative and quantitative tools for their comparison and analysis. We demonstrate our approach by applying it to the Textualis Formata script type and its two subtypes formalized by A. Derolez: Northern and Southern Textualis
确定中世纪手写的脚本类型并建立分类标准是手稿学分析的一个关键方面。然而,现有的类型学通常会遇到方法论挑战,例如描述限制和主观标准。我们提出了一个可解释的深度学习为基础的变体,用于形态脚本类型分析,这使得可以进行系统化和客观的分析,并有助于弥合定性和定量观察之间的差距。具体来说,我们将深度实例分割方法适应用于学习具有相似字符原型,代表字母形态的脚本类型,并为它们的比较和分析提供定性和定量工具。我们通过将该方法应用于Textualis Formata脚本类型及其由A. Derolez定义的两个子脚本类型来展示我们的方法。
https://arxiv.org/abs/2408.11150
The Bengali language is the 5th most spoken native and 7th most spoken language in the world, and Bengali handwritten character recognition has attracted researchers for decades. However, other languages such as English, Arabic, Turkey, and Chinese character recognition have contributed significantly to developing handwriting recognition systems. Still, little research has been done on Bengali character recognition because of the similarity of the character, curvature and other complexities. However, many researchers have used traditional machine learning and deep learning models to conduct Bengali hand-written recognition. The study employed a convolutional neural network (CNN) with ensemble transfer learning and a multichannel attention network. We generated the feature from the two branches of the CNN, including Inception Net and ResNet and then produced an ensemble feature fusion by concatenating them. After that, we applied the attention module to produce the contextual information from the ensemble features. Finally, we applied a classification module to refine the features and classification. We evaluated the proposed model using the CAMTERdb 3.1.2 data set and achieved 92\% accuracy for the raw dataset and 98.00\% for the preprocessed dataset. We believe that our contribution to the Bengali handwritten character recognition domain will be considered a great development.
孟加拉语是世界上第5种最常用的母语,第7种被使用的语言,孟加拉语手写字符识别已经吸引了研究者们几十年的研究。然而,其他语言如英语、阿拉伯语、土耳其和中国字符识别对发展手写字符识别系统做出了重大贡献。尽管如此,因为字符的相似性、曲率和其他复杂性,孟加拉语字符识别的研究还很少。然而,许多研究者们已经使用传统的机器学习和深度学习模型来进行孟加拉语手写字符识别。这项研究采用了一种卷积神经网络(CNN)的上下文传输学习和多通道关注网络。我们从CNN的两个分支中生成特征,包括Inception网络和ResNet,然后通过连接它们来生成 ensemble 特征融合。接着,我们应用注意模块从ensemble特征中产生上下文信息。最后,我们应用分类模块来细化特征和分类。我们使用CAMTERdb 3.1.2数据集对所提出的模型进行了评估,实现了92%的准确率(原始数据集)和98.00%的准确率(预处理数据集)。我们相信,我们在孟加拉语手写字符识别领域所做出的贡献将被认为是很大的发展。
https://arxiv.org/abs/2408.10955
Freeform handwriting authentication verifies a person's identity from their writing style and habits in messy handwriting data. This technique has gained widespread attention in recent years as a valuable tool for various fields, e.g., fraud prevention and cultural heritage protection. However, it still remains a challenging task in reality due to three reasons: (i) severe damage, (ii) complex high-dimensional features, and (iii) lack of supervision. To address these issues, we propose SherlockNet, an energy-oriented two-branch contrastive self-supervised learning framework for robust and fast freeform handwriting authentication. It consists of four stages: (i) pre-processing: converting manuscripts into energy distributions using a novel plug-and-play energy-oriented operator to eliminate the influence of noise; (ii) generalized pre-training: learning general representation through two-branch momentum-based adaptive contrastive learning with the energy distributions, which handles the high-dimensional features and spatial dependencies of handwriting; (iii) personalized fine-tuning: calibrating the learned knowledge using a small amount of labeled data from downstream tasks; and (iv) practical application: identifying individual handwriting from scrambled, missing, or forged data efficiently and conveniently. Considering the practicality, we construct EN-HA, a novel dataset that simulates data forgery and severe damage in real applications. Finally, we conduct extensive experiments on six benchmark datasets including our EN-HA, and the results prove the robustness and efficiency of SherlockNet.
自由形式手写验证通过一个人的书写风格和习惯从凌乱的手写数据中验证一个人的身份。这项技术近年来在各个领域得到了广泛关注,例如,欺诈预防和文化遗产保护。然而,在现实生活中,它仍然是一个具有挑战性的任务,由于以下三个原因:(i)严重破坏,(ii)复杂的高维特征和(iii)缺乏监督。为了解决这些问题,我们提出了SherlockNet,一个以能量为基础的两分支对比自监督学习框架,用于稳健且快速的免费形式手写验证。它包括四个阶段:(i)预处理:使用新颖的可插拔和可播放的能量方向操作将手稿转换为能量分布,消除噪声的影响;(ii)一般预训练:通过两分支的动量基于自适应对比学习学习一般表示,处理手写的二维特征和空间依赖;(iii)个性化微调:利用下游任务的少量标记数据对学习到的知识进行校准;(iv)实际应用:从打乱、缺失或伪造的数据中高效且方便地识别个人手写。考虑到实际应用的必要性,我们构建了EN-HA,一个模拟真实应用数据伪造和严重损坏的数据的新数据集。最后,我们在包括我们的EN-HA在内的六个基准数据集上进行了广泛的实验,结果证明了SherlockNet的稳健性和高效性。
https://arxiv.org/abs/2408.09676
Teaching Computer Science (CS) by having students write programs by hand on paper has key pedagogical advantages: It allows focused learning and requires careful thinking compared to the use of Integrated Development Environments (IDEs) with intelligent support tools or "just trying things out". The familiar environment of pens and paper also lessens the cognitive load of students with no prior experience with computers, for whom the mere basic usage of computers can be intimidating. Finally, this teaching approach opens learning opportunities to students with limited access to computers. However, a key obstacle is the current lack of teaching methods and support software for working with and running handwritten programs. Optical character recognition (OCR) of handwritten code is challenging: Minor OCR errors, perhaps due to varied handwriting styles, easily make code not run, and recognizing indentation is crucial for languages like Python but is difficult to do due to inconsistent horizontal spacing in handwriting. Our approach integrates two innovative methods. The first combines OCR with an indentation recognition module and a language model designed for post-OCR error correction without introducing hallucinations. This method, to our knowledge, surpasses all existing systems in handwritten code recognition. It reduces error from 30\% in the state of the art to 5\% with minimal hallucination of logical fixes to student programs. The second method leverages a multimodal language model to recognize handwritten programs in an end-to-end fashion. We hope this contribution can stimulate further pedagogical research and contribute to the goal of making CS education universally accessible. We release a dataset of handwritten programs and code to support future research at this https URL
通过让学生手写编程来教授计算机科学(CS)具有关键的教育教学优点:它允许专注学习和需要仔细思考,与使用智能支持工具或“随便试试”的集成开发环境(IDE)相比,后者具有优势。熟悉的纸笔环境还减轻了没有计算机使用经验的学生的认知负担。最后,这种教学方法为有条件的计算机使用的学生打开了学习机会。然而,关键的障碍是,目前缺乏教授和运行手写程序的方法和软件。手写代码的光学字符识别(OCR)具有挑战性:轻微的OCR错误,可能是由于手写风格的差异,很容易使代码无法运行,识别缩进对像Python等语言至关重要,但由于手写中的不一致水平间距,这种识别方法很难进行。我们采用了一种结合OCR和手写代码识别模块以及一个用于后OCR错误修复的语言模型的方法。据我们所知,这是所有手写代码识别系统的佼佼者。它将错误率从现有技术的30%降低到5%,并且在最小化逻辑修复的幻觉的情况下,将幻觉减少到学生程序的程度。第二种方法利用多模态语言模型以端到端的方式识别手写程序。我们希望这个贡献可以激发进一步的教育教学研究,并为使计算机科学教育普遍可用做出贡献。我们在这个网址上发布了手写程序和代码的数据集,以支持未来的研究:https://url.
https://arxiv.org/abs/2408.07220
Considering digital ink as plane curves provides a valuable framework for various applications, including signature verification, note-taking, and mathematical handwriting recognition. These plane curves can be obtained as parameterized pairs of approximating truncated series (x(s), y(s)) determined by sampled points. Earlier work has found that representing these truncated series (polynomials) in a Legendre or Legendre-Sobolev basis has a number of desirable properties. These include compact data representation, meaningful clustering of like symbols in the vector space of polynomial coefficients, linear separability of classes in this space, and highly efficient calculation of variation between curves. In this work, we take a first step at examining the use of Chebyshev-Sobolev series for symbol recognition. The early indication is that this representation may be superior to Legendre-Sobolev representation for some purposes.
考虑将数字墨水视为平滑曲线为各种应用提供了有价值的框架,包括签名验证、记笔记和数学手写字符识别等。这些平滑曲线可以通过由采样点确定的参数化截断系列(x(s),y(s))来获得。之前的工作发现,用Legendre或Legendre-Sobolev基表示这些截断系列(多项式)具有多项有益的特性。这些特性包括紧凑的数据表示、在多项式系数的向量空间中有意义地聚类同符号,线性可分离性以及在曲线之间的变化计算的高效性。在这篇论文中,我们首先探讨了使用Chebyshev-Sobolev系列进行符号识别的使用。早期的迹象表明,这种表示在某些目的上可能优于Legendre-Sobolev表示。
https://arxiv.org/abs/2408.02135
Handwriting Verification is a critical in document forensics. Deep learning based approaches often face skepticism from forensic document examiners due to their lack of explainability and reliance on extensive training data and handcrafted features. This paper explores using Vision Language Models (VLMs), such as OpenAI's GPT-4o and Google's PaliGemma, to address these challenges. By leveraging their Visual Question Answering capabilities and 0-shot Chain-of-Thought (CoT) reasoning, our goal is to provide clear, human-understandable explanations for model decisions. Our experiments on the CEDAR handwriting dataset demonstrate that VLMs offer enhanced interpretability, reduce the need for large training datasets, and adapt better to diverse handwriting styles. However, results show that the CNN-based ResNet-18 architecture outperforms the 0-shot CoT prompt engineering approach with GPT-4o (Accuracy: 70%) and supervised fine-tuned PaliGemma (Accuracy: 71%), achieving an accuracy of 84% on the CEDAR AND dataset. These findings highlight the potential of VLMs in generating human-interpretable decisions while underscoring the need for further advancements to match the performance of specialized deep learning models.
手写字符验证在文件法医学中是一个关键的问题。基于深度学习的技术方法通常会让法医文件审查员对它们的透明度和对广泛训练数据和手动特征的依赖表示怀疑。本文探讨使用像OpenAI的GPT-4o和Google的PaliGemma这样的视觉语言模型(VLMs)来解决这些挑战。通过利用它们的视觉问答能力和零样本链式思维(CoT)推理,我们的目标是为模型决策提供清晰、可解释的人类理解。我们在CEDAR手写数据集上的实验证明,VLMs具有增强的可解释性,减少了大型训练数据集的需求,并更好地适应多样性的手写风格。然而,实验结果表明,基于CNN的ResNet-18架构在GPT-4o(准确率:70%)和监督微调的PaliGemma(准确率:71%)的零样本CoT提示工程方法上胜出,达到CEDAR AND数据集上的84%准确率。这些发现突出了VLMs在生成可解释决策方面的潜力,同时强调了需要进一步改进以匹配专业深度学习模型的性能。
https://arxiv.org/abs/2407.21788
Schizophrenia is a globally prevalent psychiatric disorder that severely impairs daily life. Schizophrenia is caused by dopamine imbalances in the fronto-striatal pathways of the brain, which influences fine motor control in the cerebellum. This leads to abnormalities in handwriting. The goal of this study was to develop an accurate, objective, and accessible computational method to be able to distinguish schizophrenic handwriting samples from non-schizophrenic handwriting samples. To achieve this, data from Crespo et al. (2019) was used, which contains images of handwriting samples from schizophrenic and non-schizophrenic patients. The data was preprocessed and augmented to produce a more robust model that can recognize different types of handwriting. The data was used to train several different convolutional neural networks, and the model with the base architecture of InceptionV3 performed the best, differentiating between the two types of image with a 92% accuracy rate. To make this model accessible, a secure website was developed for medical professionals to use for their patients. Such a result suggests that handwriting analysis through computational models holds promise as a non-invasive and objective method for clinicians to diagnose and monitor schizophrenia.
精神分裂症是一种全球普遍的精神病,严重影响日常生活。精神分裂症是由大脑前额叶皮层神经通路中多巴胺失衡引起的,这会影响小脑的精细运动控制。导致手写异常。本研究旨在开发一种准确、客观、易於使用的计算方法,以便能够区分精神分裂症手写样本和非精神分裂症手写样本。为实现这一目标,使用了Crespo等人(2019)的数据,其中包括来自精神分裂症和非精神分裂症患者的手写样本。数据进行了预处理和增强,以产生更稳健的模型,可以识别不同类型的手写。将数据用于训练几个不同的卷积神经网络,其中基于InceptionV3架构的模型表现最佳,在区分两种类型的图像方面具有92%的准确率。为了使该模型易于使用,为医疗专业人士开发了一个安全网站,用于其患者。这样的结果表明,通过计算模型对手写进行分析具有作为非侵入性和客观方法诊断和监测精神分裂症的意义。
https://arxiv.org/abs/2408.06347
In this study, we introduce StylusAI, a novel architecture leveraging diffusion models in the domain of handwriting style generation. StylusAI is specifically designed to adapt and integrate the stylistic nuances of one language's handwriting into another, particularly focusing on blending English handwriting styles into the context of the German writing system. This approach enables the generation of German text in English handwriting styles and German handwriting styles into English, enriching machine-generated handwriting diversity while ensuring that the generated text remains legible across both languages. To support the development and evaluation of StylusAI, we present the \lq{Deutscher Handschriften-Datensatz}\rq~(DHSD), a comprehensive dataset encompassing 37 distinct handwriting styles within the German language. This dataset provides a fundamental resource for training and benchmarking in the realm of handwritten text generation. Our results demonstrate that StylusAI not only introduces a new method for style adaptation in handwritten text generation but also surpasses existing models in generating handwriting samples that improve both text quality and stylistic fidelity, evidenced by its performance on the IAM database and our newly proposed DHSD. Thus, StylusAI represents a significant advancement in the field of handwriting style generation, offering promising avenues for future research and applications in cross-linguistic style adaptation for languages with similar scripts.
https://arxiv.org/abs/2407.15608
Arabic Optical Character Recognition (OCR) and Handwriting Recognition (HWR) pose unique challenges due to the cursive and context-sensitive nature of the Arabic script. This study introduces Qalam, a novel foundation model designed for Arabic OCR and HWR, built on a SwinV2 encoder and RoBERTa decoder architecture. Our model significantly outperforms existing methods, achieving a Word Error Rate (WER) of just 0.80% in HWR tasks and 1.18% in OCR tasks. We train Qalam on a diverse dataset, including over 4.5 million images from Arabic manuscripts and a synthetic dataset comprising 60k image-text pairs. Notably, Qalam demonstrates exceptional handling of Arabic diacritics, a critical feature in Arabic scripts. Furthermore, it shows a remarkable ability to process high-resolution inputs, addressing a common limitation in current OCR systems. These advancements underscore Qalam's potential as a leading solution for Arabic script recognition, offering a significant leap in accuracy and efficiency.
阿拉伯语光学字符识别(OCR)和手写识别(HWR)由于阿拉伯文本的流畅和上下文敏感的特性而面临着独特的挑战。本研究介绍了一种名为Qalam的新基础模型,为阿拉伯 OCR 和 HWR 设计,基于 SwinV2 编码器 和 RoBERTa 解码器架构。我们的模型在 HWR 任务中的 Word Error Rate(WER)仅有 0.80%,在 OCR 任务中的 WER 仅有 1.18%。我们通过训练 Qalam 在一个包括超过 4.5 亿阿拉伯手稿图片和由 60k 个图像-文本对组成的合成数据集上进行训练,来训练 Qalam。值得注意的是,Qalam 对阿拉伯标点符号的处理表现出非凡的能力,这是阿拉伯文本的一个关键特征。此外,它还表现出对高分辨率输入的处理能力,这是当前 OCR 系统的一个常见局限。这些进步突出了 Qalam 在阿拉伯文本识别方面的潜在领导地位,为阿拉伯文本识别提供了显著的准确性和效率的飞跃。
https://arxiv.org/abs/2407.13559
Information extraction from handwritten documents involves traditionally three distinct steps: Document Layout Analysis, Handwritten Text Recognition, and Named Entity Recognition. Recent approaches have attempted to integrate these steps into a single process using fully end-to-end architectures. Despite this, these integrated approaches have not yet matched the performance of language models, when applied to information extraction in plain text. In this paper, we introduce DANIEL (Document Attention Network for Information Extraction and Labelling), a fully end-to-end architecture integrating a language model and designed for comprehensive handwritten document understanding. DANIEL performs layout recognition, handwriting recognition, and named entity recognition on full-page documents. Moreover, it can simultaneously learn across multiple languages, layouts, and tasks. For named entity recognition, the ontology to be applied can be specified via the input prompt. The architecture employs a convolutional encoder capable of processing images of any size without resizing, paired with an autoregressive decoder based on a transformer-based language model. DANIEL achieves competitive results on four datasets, including a new state-of-the-art performance on RIMES 2009 and M-POPP for Handwriting Text Recognition, and IAM NER for Named Entity Recognition. Furthermore, DANIEL is much faster than existing approaches. We provide the source code and the weights of the trained models at \url{this https URL}.
手写文档的信息提取通常包括三个传统步骤:文档布局分析、手写文本识别和命名实体识别。最近的方法尝试将这三个步骤集成到一个端到端架构中,使用完整的端到端架构。尽管如此,这些集成方法在应用于普通文本中的信息提取时还没有达到语言模型的性能水平。在本文中,我们介绍了 DANIEL(文档注意力网络用于信息提取和标签),一个端到端架构,集成了语言模型,专为全面理解手写文档而设计。DANIEL 对完整的页面文档进行布局识别、手写文本识别和命名实体识别。此外,它可以同时学习多个语言、布局和任务。对于命名实体识别,可以通过输入提示指定要应用的语义范畴。该架构采用了一种卷积编码器,可以处理任何大小的不需要缩放的图像,与基于Transformer的语言模型相辅相成。DANIEL 在四个数据集上实现了竞争性的结果,包括在RIMES 2009和M-POPP手写文本识别的最新 state-of-the-art 成绩,以及 IAM NER命名实体识别。此外,DANIEL 比现有方法要快得多。我们提供了该源代码以及训练模型的权重,链接在\url{这个链接}。
https://arxiv.org/abs/2407.09103