This study presents a hybrid model for classifying handwritten digits in the MNIST dataset, combining convolutional neural networks (CNNs) with a multi-well Hopfield network. The approach employs a CNN to extract high-dimensional features from input images, which are then clustered into class-specific prototypes using k-means clustering. These prototypes serve as attractors in a multi-well energy landscape, where a Hopfield network performs classification by minimizing an energy function that balances feature similarity and class this http URL model's design enables robust handling of intraclass variability, such as diverse handwriting styles, while providing an interpretable framework through its energy-based decision process. Through systematic optimization of the CNN architecture and the number of wells, the model achieves a high test accuracy of 99.2% on 10,000 MNIST images, demonstrating its effectiveness for image classification tasks. The findings highlight the critical role of deep feature extraction and sufficient prototype coverage in achieving high performance, with potential for broader applications in pattern recognition.
这项研究提出了一种用于在MNIST数据集上分类手写数字的混合模型,该模型结合了卷积神经网络(CNN)与多阱霍普菲尔德网络。这种方法利用CNN从输入图像中提取高维特征,并使用k均值聚类将这些特征聚类为特定于每个类别的原型。这些原型作为具有多个能量陷阱的能量景观中的吸引子,在其中霍普菲尔德网络通过最小化一个平衡了特征相似性和类别归属的能函数来进行分类。该模型的设计能够稳健地处理同一类别内的变化,例如多样的手写风格,并且由于其基于能量的决策过程提供了可解释性框架。 通过对CNN架构和阱的数量进行系统优化,该模型在10,000张MNIST图像上达到了99.2%的高测试准确率,证明了它对于图像分类任务的有效性。研究结果强调了深度特征提取以及充分原型覆盖对于实现高性能的关键作用,并且其潜在的应用范围可能更广泛,在模式识别领域具有潜力。
https://arxiv.org/abs/2507.08766
This paper introduces a SSSUMO, semi-supervised deep learning approach for submovement decomposition that achieves state-of-the-art accuracy and speed. While submovement analysis offers valuable insights into motor control, existing methods struggle with reconstruction accuracy, computational cost, and validation, due to the difficulty of obtaining hand-labeled data. We address these challenges using a semi-supervised learning framework. This framework learns from synthetic data, initially generated from minimum-jerk principles and then iteratively refined through adaptation to unlabeled human movement data. Our fully convolutional architecture with differentiable reconstruction significantly surpasses existing methods on both synthetic and diverse human motion datasets, demonstrating robustness even in high-noise conditions. Crucially, the model operates in real-time (less than a millisecond per input second), a substantial improvement over optimization-based techniques. This enhanced performance facilitates new applications in human-computer interaction, rehabilitation medicine, and motor control studies. We demonstrate the model's effectiveness across diverse human-performed tasks such as steering, rotation, pointing, object moving, handwriting, and mouse-controlled gaming, showing notable improvements particularly on challenging datasets where traditional methods largely fail. Training and benchmarking source code, along with pre-trained model weights, are made publicly available at this https URL.
本文介绍了一种名为SSSUMO的半监督深度学习方法,用于次运动分解,该方法实现了最先进的准确性和速度。虽然次运动分析为对运动控制的理解提供了有价值的见解,但现有方法在重建精度、计算成本和验证方面面临着挑战,这主要是由于难以获取手动标注的数据所导致的问题。我们通过使用一种基于合成数据的半监督学习框架来解决这些问题。该框架首先从最小化位移的原则生成初始数据,并且通过不断迭代地适应未标记的人体运动数据来进行优化。我们的全卷积架构和可微重构显著超越了现有的方法,不仅在合成数据集上,在多样化的人类动作数据集上也表现出色,即使在高噪音条件下也能展现出强大的鲁棒性。特别重要的是,该模型可以在实时环境中运行(每秒输入处理时间少于一毫秒),这远远优于基于优化的方法。这种性能的提升为人类与计算机交互、康复医学和运动控制研究等新领域的应用开辟了道路。我们通过驾驶、旋转、指向、物体移动、书写以及鼠标控制游戏等一系列多样化的人类任务,展示了该模型的有效性,并且在传统方法表现不佳的数据集上显示出了显著的进步。训练和基准测试的源代码,以及预训练模型权重已公开提供,在此网址:[URL](请将[URL]替换为实际提供的链接)。
https://arxiv.org/abs/2507.08028
Offline Handwritten Text Recognition (HTR) systems play a crucial role in applications such as historical document digitization, automatic form processing, and biometric authentication. However, their performance is often hindered by the limited availability of annotated training data, particularly for low-resource languages and complex scripts. This paper presents a comprehensive survey of offline handwritten data augmentation and generation techniques designed to improve the accuracy and robustness of HTR systems. We systematically examine traditional augmentation methods alongside recent advances in deep learning, including Generative Adversarial Networks (GANs), diffusion models, and transformer-based approaches. Furthermore, we explore the challenges associated with generating diverse and realistic handwriting samples, particularly in preserving script authenticity and addressing data scarcity. This survey follows the PRISMA methodology, ensuring a structured and rigorous selection process. Our analysis began with 1,302 primary studies, which were filtered down to 848 after removing duplicates, drawing from key academic sources such as IEEE Digital Library, Springer Link, Science Direct, and ACM Digital Library. By evaluating existing datasets, assessment metrics, and state-of-the-art methodologies, this survey identifies key research gaps and proposes future directions to advance the field of handwritten text generation across diverse linguistic and stylistic landscapes.
离线手写文本识别(HTR)系统在历史文档数字化、自动表单处理和生物特征认证等应用中扮演着重要角色。然而,其性能往往受到注释训练数据可用性有限的限制,尤其是对于资源匮乏的语言和复杂的书写系统。本文综述了用于提高HTR系统准确性和鲁棒性的离线手写数据增强和生成技术。我们系统地考察了传统增广方法以及深度学习领域的最新进展,包括生成对抗网络(GANs)、扩散模型和基于变压器的方法。此外,还探讨了生成多样且现实的手写样本的挑战,特别是保持书写真实性及应对数据稀缺问题。本次综述遵循PRISMA方法论,确保了一个结构化和严谨的选择过程。我们的分析始于1,302项初步研究,并在去除重复后筛选至848项,这些研究主要来源于IEEE数字图书馆、Springer Link、Science Direct和ACM数字图书馆等关键学术来源。通过评估现有的数据集、评价指标及前沿方法论,本综述指出了关键的研究缺口,并提出了未来发展方向,以推进跨越多样化语言和书写风格的手写文本生成领域的进步。
https://arxiv.org/abs/2507.06275
Effectively handling temporal redundancy remains a key challenge in learning video models. Prevailing approaches often treat each set of frames independently, failing to effectively capture the temporal dependencies and redundancies inherent in videos. To address this limitation, we introduce RefTok, a novel reference-based tokenization method capable of capturing complex temporal dynamics and contextual information. Our method encodes and decodes sets of frames conditioned on an unquantized reference frame. When decoded, RefTok preserves the continuity of motion and the appearance of objects across frames. For example, RefTok retains facial details despite head motion, reconstructs text correctly, preserves small patterns, and maintains the legibility of handwriting from the context. Across 4 video datasets (K600, UCF-101, BAIR Robot Pushing, and DAVIS), RefTok significantly outperforms current state-of-the-art tokenizers (Cosmos and MAGVIT) and improves all evaluated metrics (PSNR, SSIM, LPIPS) by an average of 36.7% at the same or higher compression ratios. When a video generation model is trained using RefTok's latents on the BAIR Robot Pushing task, the generations not only outperform MAGVIT-B but the larger MAGVIT-L, which has 4x more parameters, across all generation metrics by an average of 27.9%.
有效地处理时间冗余仍然是学习视频模型的关键挑战。当前的主流方法通常将每一组帧独立看待,未能有效捕捉视频中固有的时间和冗余关系。为了解决这一局限性,我们引入了RefTok,这是一种新型基于参考的标记化方法,能够捕获复杂的时态动态和上下文信息。我们的方法在未量化的参考帧条件下编码和解码一组帧。当进行解码时,RefTok保持了跨帧之间的运动连续性和对象外观的一致性。例如,即使头部有动作,RefTok也能保留面部细节;正确重构文本内容;保持小图案的完整性,并从上下文中维持手写的可读性。 在四个视频数据集(K600、UCF-101、BAIR机器人推拉任务和DAVIS)上,RefTok显著优于目前最先进的标记化方法(Cosmos 和 MAGVIT),并且在同一或更高的压缩比下,改进了所有评估指标(PSNR、SSIM、LPIPS),平均提高了36.7%。 当使用RefTok的潜在表示在BAIR机器人推拉任务上训练视频生成模型时,所产生的结果不仅超过了MAGVIT-B,在所有生成指标中平均高出27.9%,还超越了参数量是其4倍的更大规模模型MAGVIT-L。
https://arxiv.org/abs/2507.02862
Handwriting signatures, as an important means of identity recognition, are widely used in multiple fields such as financial transactions, commercial contracts and personal affairs due to their legal effect and uniqueness. In forensic science appraisals, the analysis of offline handwriting signatures requires the appraiser to provide a certain number of signature samples, which are usually derived from various historical contracts or archival materials. However, the provided handwriting samples are often mixed with a large amount of interfering information, which brings severe challenges to handwriting identification work. This study proposes a signature handwriting denoising model based on the improved U-net structure, aiming to enhance the robustness of the signature recognition system. By introducing discrete wavelet transform and PCA transform, the model's ability to suppress noise has been enhanced. The experimental results show that this modelis significantly superior to the traditional methods in denoising effect, can effectively improve the clarity and readability of the signed images, and provide more reliable technical support for signature analysis and recognition.
手写签名作为一种重要的身份识别手段,在金融交易、商业合同和个人事务等多个领域中广泛应用,因其法律效力和独特性而备受重视。在法医科学鉴定中,离线手写签名的分析要求评估人员提供一定数量的手写签名样本,这些样本通常从各种历史合同或档案材料中提取。然而,提供的书写样本往往掺杂了大量的干扰信息,这对手写识别工作带来了严峻挑战。本研究提出了一种基于改进U-net结构的手写签名去噪模型,旨在增强签名识别系统的鲁棒性。通过引入离散小波变换和主成分分析(PCA)变换,该模型的降噪能力得到了提升。实验结果显示,相比传统方法,此模型在去噪效果上具有显著优势,能够有效提高签名字迹的清晰度和可读性,并为签名分析与识别提供了更加可靠的技术支持。
https://arxiv.org/abs/2507.00365
We posit that handwriting recognition benefits from complementary cues carried by the rasterized complex glyph and the pen's trajectory, yet most systems exploit only one modality. We introduce an end-to-end network that performs early fusion of offline images and online stroke data within a shared latent space. A patch encoder converts the grayscale crop into fixed-length visual tokens, while a lightweight transformer embeds the $(x, y, \text{pen})$ sequence. Learnable latent queries attend jointly to both token streams, yielding context-enhanced stroke embeddings that are pooled and decoded under a cross-entropy loss objective. Because integration occurs before any high-level classification, temporal cues reinforce each other during representation learning, producing stronger writer independence. Comprehensive experiments on IAMOn-DB and VNOn-DB demonstrate that our approach achieves state-of-the-art accuracy, exceeding previous bests by up to 1\%. Our study also shows adaptation of this pipeline with gesturification on the ISI-Air dataset. Our code can be found here.
我们提出,手写识别可以从栅格化复杂字符和笔迹轨迹所携带的互补线索中获益,然而大多数系统仅利用了一种模态。为此,我们引入了一个端到端网络,在共享潜在空间内对离线图像与在线笔画数据进行早期融合。一个补丁编码器将灰度裁剪部分转换为固定长度的视觉令牌,而轻量级变压器则嵌入了$(x, y, \text{pen})$序列。可学习的潜在查询同时关注两个令牌流,生成增强上下文的笔画嵌入,并在交叉熵损失目标下进行池化和解码。由于这种集成发生在任何高层次分类之前,在表示学习过程中时间线索相互强化,从而产生更强的书写者独立性。 我们在IAMOn-DB和VNOn-DB数据集上进行了全面实验,结果表明我们的方法实现了业界领先精度,比先前的最佳模型高出最多1%。此外,研究表明在ISI-Air数据集中使用手势化技术也可以适应该管道的应用。我们的代码可以在这里找到。
https://arxiv.org/abs/2506.20255
We developed a proof-of-concept method for the automatic analysis of the structure and content of incunabula pages. A custom dataset comprising 500 annotated pages from five different incunabula was created using resources from the Jagiellonian Digital Library. Each page was manually labeled with five predefined classes: Text, Title, Picture, Table, and Handwriting. Additionally, the publicly available DocLayNet dataset was utilized as supplementary training data. To perform object detection, YOLO11n and YOLO11s models were employed and trained using two strategies: a combined dataset (DocLayNet and the custom dataset) and the custom dataset alone. The highest performance (F1 = 0.94) was achieved by the YOLO11n model trained exclusively on the custom data. Optical character recognition was then conducted on regions classified as Text, using both Tesseract and Kraken OCR, with Tesseract demonstrating superior results. Subsequently, image classification was applied to the Picture class using a ResNet18 model, achieving an accuracy of 98.7% across five subclasses: Decorative_letter, Illustration, Other, Stamp, and Wrong_detection. Furthermore, the CLIP model was utilized to generate semantic descriptions of illustrations. The results confirm the potential of machine learning in the analysis of early printed books, while emphasizing the need for further advancements in OCR performance and visual content interpretation.
我们开发了一种概念验证方法,用于自动分析古籍印刷品页面的结构和内容。使用来自雅盖隆数字图书馆(Jagiellonian Digital Library)的资源创建了一个包含500页注释数据集,这些注释来自于五本不同的古登堡圣经,每一页都被手动标注为五个预定义类别之一:文本、标题、图片、表格和手写。此外,还使用了公开可用的DocLayNet数据集作为辅助训练数据。 为了执行目标检测任务,我们采用了YOLO11n和YOLO11s模型,并通过两种策略对其进行训练:一种是结合使用的数据集(DocLayNet和自定义数据集),另一种则是仅使用自定义数据集。结果显示,专门在自定义数据集上进行训练的YOLO11n模型取得了最佳性能(F1分数为0.94)。 随后,我们对被分类为“文本”的区域进行了光学字符识别(OCR)。这项任务采用了Tesseract和Kraken两种OCR工具,其中Tesseract表现更优。接着,我们使用ResNet18模型对图片类别进行图像分类,在五种子类(装饰字母、插图、其他、印章以及错误检测)中实现了98.7%的准确率。最后,还利用了CLIP模型生成关于插图的语义描述。 这些结果证实了机器学习在分析早期印刷书籍方面的潜力,同时也强调了需要进一步提高OCR性能和视觉内容解读方面的发展需求。
https://arxiv.org/abs/2506.18069
Handwritten text recognition aims to convert visual input into machine-readable text, and it remains challenging due to the evolving and context-dependent nature of handwriting. Character sets change over time, and character frequency distributions shift across historical periods or regions, often causing models trained on broad, heterogeneous corpora to underperform on specific subsets. To tackle this, we propose a novel loss function that incorporates the Wasserstein distance between the character frequency distribution of the predicted text and a target distribution empirically derived from training data. By penalizing divergence from expected distributions, our approach enhances both accuracy and robustness under temporal and contextual intra-dataset shifts. Furthermore, we demonstrate that character distribution alignment can also improve existing models at inference time without requiring retraining by integrating it as a scoring function in a guided decoding scheme. Experimental results across multiple datasets and architectures confirm the effectiveness of our method in boosting generalization and performance. We open source our code at this https URL.
手写文本识别旨在将视觉输入转换为机器可读的文本,但由于手写的演变和上下文依赖性,这一过程仍然具有挑战性。字符集会随时间变化,不同历史时期或地区的字符频率分布也会有所不同,这通常会导致在广泛、异质语料库上训练的模型在其特定子集中表现不佳。为此,我们提出了一种新的损失函数,该函数结合了预测文本中的字符频率分布与从训练数据中经验性推导出的目标分布之间的Wasserstein距离。通过惩罚偏离预期分布的行为,我们的方法可以提高模型在时间变化和上下文内的数据集内部变化情况下的准确性和鲁棒性。此外,我们还证明,在推理阶段将字符分布对齐整合为指导解码方案中的评分函数,可以在不重新训练的情况下改善现有模型的性能。通过多个数据集和架构上的实验结果证实了我们的方法在提高泛化能力和性能方面的有效性。我们将源代码开源于此 [URL](请用实际链接替换)。
https://arxiv.org/abs/2506.09846
This article presents a large-scale effort to create a structured dataset of internal migration in Finland between 1800 and 1920 using digitized church moving records. These records, maintained by Evangelical-Lutheran parishes, document the migration of individuals and families and offer a valuable source for studying historical demographic patterns. The dataset includes over six million entries extracted from approximately 200,000 images of handwritten migration records. The data extraction process was automated using a deep learning pipeline that included layout analysis, table detection, cell classification, and handwriting recognition. The complete pipeline was applied to all images, resulting in a structured dataset suitable for research. The dataset can be used to study internal migration, urbanization, and family migration, and the spread of disease in preindustrial Finland. A case study from the Elimäki parish shows how local migration histories can be reconstructed. The work demonstrates how large volumes of handwritten archival material can be transformed into structured data to support historical and demographic research.
本文介绍了一项大规模的努力,利用数字化的教会迁移记录创建了1800年至1920年间芬兰国内移民的结构化数据集。这些记录由福音路德会教区维护,详细记载了个人和家庭的迁移情况,是研究历史人口模式的重要资料来源。该数据集包含超过六百万条从大约二十万张手写迁移记录图片中提取出来的信息。 数据提取过程采用了深度学习流水线自动化处理,包括版面分析、表格检测、单元格分类以及手写识别等步骤。整个流水线应用于所有图像后,生成了一个适合研究的结构化数据集。该数据集可以用于研究内部移民、城市化、家庭迁移以及工业化前芬兰疾病的传播。 通过艾利马基教区的一个案例研究展示了如何重建本地迁移历史。这项工作证明了大量手写档案资料可以通过转化为结构化数据来支持历史和人口学的研究。
https://arxiv.org/abs/2506.07960
This paper investigates the task of writer retrieval, which identifies documents authored by the same individual within a dataset based on handwriting similarities. While existing datasets and methodologies primarily focus on page level retrieval, we explore the impact of text quantity on writer retrieval performance by evaluating line- and word level retrieval. We examine three state-of-the-art writer retrieval systems, including both handcrafted and deep learning-based approaches, and analyze their performance using varying amounts of text. Our experiments on the CVL and IAM dataset demonstrate that while performance decreases by 20-30% when only one line of text is used as query and gallery, retrieval accuracy remains above 90% of full-page performance when at least four lines are included. We further show that text-dependent retrieval can maintain strong performance in low-text scenarios. Our findings also highlight the limitations of handcrafted features in low-text scenarios, with deep learning-based methods like NetVLAD outperforming traditional VLAD encoding.
本文探讨了书写者检索任务,该任务基于手写相似性在数据集中识别由同一人创作的文档。虽然现有的数据集和方法主要集中在页面级别的检索上,我们通过评估行级和词级检索来研究文本量对书写者检索性能的影响。我们分析了三种最先进的书写者检索系统(包括手工设计和深度学习基于的方法),并使用不同数量的文本进行性能测试。我们在CVL和IAM数据集上的实验表明,当仅使用一行文本作为查询和检索库时,性能会下降20-30%,但如果至少包含四行文本,检索准确率仍然可以保持在全页表现的90%以上。此外,我们还展示了基于文本依赖性的检索可以在低文本场景中维持较强的性能。我们的研究结果进一步强调了手工设计特征在低文本环境中的局限性,表明如NetVLAD这样的深度学习方法优于传统的VLAD编码方法。
https://arxiv.org/abs/2506.07566
Although vision-language and large language models (VLM and LLM) offer promising opportunities for AI-driven educational assessment, their effectiveness in real-world classroom settings, particularly in underrepresented educational contexts, remains underexplored. In this study, we evaluated the performance of a state-of-the-art VLM and several LLMs on 646 handwritten exam responses from grade 4 students in six Indonesian schools, covering two subjects: Mathematics and English. These sheets contain more than 14K student answers that span multiple choice, short answer, and essay questions. Assessment tasks include grading these responses and generating personalized feedback. Our findings show that the VLM often struggles to accurately recognize student handwriting, leading to error propagation in downstream LLM grading. Nevertheless, LLM-generated feedback retains some utility, even when derived from imperfect input, although limitations in personalization and contextual relevance persist.
尽管视觉-语言模型(VLM)和大型语言模型(LLM)为基于AI的教育评估提供了许多机会,但它们在现实世界课堂环境中的有效性,尤其是在教育资源不足的情境下,仍然有待深入研究。在这项研究中,我们评估了一种最先进的VLM以及几种LLM在印度尼西亚六所学校的646份四年级学生手写考试答案上的表现,涵盖数学和英语两门学科。这些答卷包含超过1.4万个学生的答题内容,包括多项选择题、简答题和作文题。评估任务包括对这些回答进行评分并生成个性化反馈。我们的研究发现表明,VLM在准确识别学生手写方面经常遇到困难,这会导致下游LLM评分级别的错误传播。尽管如此,在基于不完美的输入数据时,由LLM生成的反馈仍具有一定的实用性,但个性化和情境相关性的局限性仍然存在。
https://arxiv.org/abs/2506.04822
Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layout and variability in handwriting styles. Prior methods have faced performance bottlenecks, proposing isolated architectural modifications that are difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves new state-of-the-art performance, surpassing the best lightweight specialized model SSAN by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% in the zero-shot setting. Our datasets, models, and code are open-sourced at: this https URL
手写数学表达式识别(HMER)在光学字符识别(OCR)中仍然是一项持续的挑战,这是由于符号布局的自由度和书写风格的变化所致。先前的方法由于提出孤立的架构修改而遇到了性能瓶颈,这些修改难以整合到一个统一框架内。与此同时,近期预训练视觉语言模型(VLMs)在跨任务泛化方面表现出色,为开发统一体解决方案提供了有前景的基础。本文中,我们介绍了一种名为Uni-MuMER的方法,该方法无需修改其架构即可对VLM进行全面微调,从而有效地将领域特定知识注入到通用框架中。我们的方法整合了三项数据驱动的任务:结构化空间推理的树感知链式思维(Tree-CoT)、减少视觉相似字符混淆的错误驱动学习(EDL)以及提高长表达式识别一致性的符号计数(SC)。在CROHME和HME100K数据集上的实验表明,Uni-MuMER达到了新的最先进的性能,在零样本设置下,其表现优于最轻量级的专业模型SSAN 16.31%,优于顶级的VLM Gemini2.5-flash 24.42%。 我们的数据集、模型和代码已开源:[请在此处插入链接](请注意原文中的具体网址无法直接显示,请访问原论文获取准确的URL)。
https://arxiv.org/abs/2505.23566
Recent advancements in handwritten text recognition (HTR) have enabled the effective conversion of handwritten text to digital formats. However, achieving robust recognition across diverse writing styles remains challenging. Traditional HTR methods lack writer-specific personalization at test time due to limitations in model architecture and training strategies. Existing attempts to bridge this gap, through gradient-based meta-learning, still require labeled examples and suffer from parameter-inefficient fine-tuning, leading to substantial computational and memory overhead. To overcome these challenges, we propose an efficient framework that formulates personalization as prompt tuning, incorporating an auxiliary image reconstruction task with a self-supervised loss to guide prompt adaptation with unlabeled test-time examples. To ensure self-supervised loss effectively minimizes text recognition error, we leverage meta-learning to learn the optimal initialization of the prompts. As a result, our method allows the model to efficiently capture unique writing styles by updating less than 1% of its parameters and eliminating the need for time-intensive annotation processes. We validate our approach on the RIMES and IAM Handwriting Database benchmarks, where it consistently outperforms previous state-of-the-art methods while using 20x fewer parameters. We believe this represents a significant advancement in personalized handwritten text recognition, paving the way for more reliable and practical deployment in resource-constrained scenarios.
最近在手写文本识别(HTR)领域的进展使得将手写文本转换为数字格式成为可能。然而,实现不同书写风格下的稳健识别仍然具有挑战性。传统的方法由于模型架构和训练策略的限制,在测试时缺乏特定于书写的个性化调整。尽管有一些尝试通过基于梯度的元学习来弥合这一差距,但这些方法仍需要标注示例,并且因参数效率低下的微调而遭受计算量大和内存开销大的问题。 为了解决这些问题,我们提出了一种高效的框架,该框架将个性化视为提示调整,并结合了辅助图像重建任务以及自监督损失来引导未标记的测试时间实例中的提示适应。为了确保自监督损失能有效减少文本识别误差,我们利用元学习来学习提示的最佳初始设置。因此,我们的方法使模型能够通过更新不到1%的参数高效地捕捉独特的书写风格,并消除了耗时的标注过程需求。 我们在RIMES和IAM手写数据库基准测试上验证了这一方法的有效性,在这些测试中,它持续优于先前最先进的方法,同时使用的参数减少了20倍。我们认为这代表了个性化手写文本识别的重要进展,为资源受限场景中的更可靠和实际部署铺平了道路。
https://arxiv.org/abs/2505.20513
Humans can quickly generalize handwriting styles from a single example by intuitively separating content from style. Machines, however, struggle with this task, especially in low-data settings, often missing subtle spatial and stylistic cues. Motivated by this gap, we introduce WriteViT, a one-shot handwritten text synthesis framework that incorporates Vision Transformers (ViT), a family of models that have shown strong performance across various computer vision tasks. WriteViT integrates a ViT-based Writer Identifier for extracting style embeddings, a multi-scale generator built with Transformer encoder-decoder blocks enhanced by conditional positional encoding (CPE), and a lightweight ViT-based recognizer. While previous methods typically rely on CNNs or CRNNs, our design leverages transformers in key components to better capture both fine-grained stroke details and higher-level style information. Although handwritten text synthesis has been widely explored, its application to Vietnamese -- a language rich in diacritics and complex typography -- remains limited. Experiments on Vietnamese and English datasets demonstrate that WriteViT produces high-quality, style-consistent handwriting while maintaining strong recognition performance in low-resource scenarios. These results highlight the promise of transformer-based designs for multilingual handwriting generation and efficient style adaptation.
人类可以从单一的样本中快速地概括出手写风格,通过直观地区分内容和风格。然而,机器在处理这种任务时,尤其是在数据量较少的情况下,经常难以捕捉到细微的空间和风格线索。为了弥补这一差距,我们引入了一种名为WriteViT的一次性手写文本合成框架,该框架结合了视觉变换器(Vision Transformers, ViT),这是一种在各种计算机视觉任务中表现出色的模型系列。WriteViT集成了基于ViT的Writer Identifier用于提取风格嵌入、采用Transformer编码-解码块并通过条件位置编码(CPE)增强的多尺度生成器,以及一个轻量级的基于ViT的识别器。相比之下,之前的方法通常依赖于CNN或CRNN模型,而我们的设计在关键组件中采用了变换器来更好地捕捉细节笔画特征和更高级别的风格信息。 尽管手写文本合成已经被广泛研究,但在越南语这种具有丰富音调符号和复杂版面的语言上的应用仍然有限。实验表明,在越南语和英语数据集上,WriteViT能够生成高质量且风格一致的手写字体,并在低资源场景中保持强大的识别性能。这些结果突显了基于变换器的设计在多语言手写生成和高效风格适应方面的潜力。
https://arxiv.org/abs/2505.13235
Handwritten fonts have a distinct expressive character, but they are often difficult to read due to unclear or inconsistent handwriting. FontFusionGAN (FFGAN) is a novel method for improving handwritten fonts by combining them with printed fonts. Our method implements generative adversarial network (GAN) to generate font that mix the desirable features of handwritten and printed fonts. By training the GAN on a dataset of handwritten and printed fonts, it can generate legible and visually appealing font images. We apply our method to a dataset of handwritten fonts and demonstrate that it significantly enhances the readability of the original fonts while preserving their unique aesthetic. Our method has the potential to improve the readability of handwritten fonts, which would be helpful for a variety of applications including document creation, letter writing, and assisting individuals with reading and writing difficulties. In addition to addressing the difficulties of font creation for languages with complex character sets, our method is applicable to other text-image-related tasks, such as font attribute control and multilingual font style transfer.
手写字体具有独特的表现力,但由于笔迹不清楚或不一致,它们通常难以阅读。FontFusionGAN(FFGAN)是一种结合手写和印刷字体以改进手写字体的新方法。我们的方法利用生成对抗网络(GAN)来创建融合了手写和印刷字体优点的新型字体。通过在包含手写和印刷字体的数据集上训练GAN,它可以生成清晰且视觉效果吸引人的字体图像。我们将该方法应用于一个手写字体数据集,并证明它显著提升了原始字体的可读性,同时保留其独特的美学风格。 我们的方法有望提高手写字体的易读性,在文档创建、书信书写以及帮助阅读和写作有困难的人士等方面大有用处。此外,除了解决复杂字符集语言中字体创作的难题外,我们的方法还适用于其他与文本图像相关任务,例如控制字体属性及跨多种语言风格转换。
https://arxiv.org/abs/2505.12834
We present MarkMatch, a retrieval system for detecting whether two paper ballot marks were filled by the same hand. Unlike the previous SOTA method BubbleSig, which used binary classification on isolated mark pairs, MarkMatch ranks stylistic similarity between a query mark and a mark in the database using contrastive learning. Our model is trained with a dense batch similarity matrix and a dual loss objective. Each sample is contrasted against many negatives within each batch, enabling the model to learn subtle handwriting difference and improve generalization under handwriting variation and visual noise, while diagonal supervision reinforces high confidence on true matches. The model achieves an F1 score of 0.943, surpassing BubbleSig's best performance. MarkMatch also integrates Segment Anything Model for flexible mark extraction via box- or point-based prompts. The system offers election auditors a practical tool for visual, non-biometric investigation of suspicious ballots.
我们介绍了MarkMatch,这是一种检索系统,用于检测两张选票上的标记是否由同一人填写。与之前最先进的方法BubbleSig不同,后者使用二分类方式处理孤立的标记对,MarkMatch则通过对比学习来评估查询标记和数据库中一个标记之间的风格相似度。我们的模型在密集批次相似性矩阵上进行训练,并采用双重损失目标函数。每个样本都与其他许多负例(不匹配的例子)进行对比,在同一批量内进行训练,使模型能够捕捉到细微的手写差异并提高其在手写变化和视觉噪声下的泛化能力。对角监督则确保了真实匹配时的高度信心。该模型达到了0.943的F1分数,超过了BubbleSig的最佳性能记录。 MarkMatch还集成了Segment Anything Model(SAM),通过基于框或点的提示灵活提取标记。这一系统为选举审计员提供了一个实用工具,可以进行非生物识别、视觉检查可疑选票的方法。
https://arxiv.org/abs/2505.07032
This course design aims to develop and research a handwriting matrix recognition and step-by-step visual calculation process display system, addressing the issue of abstract formulas and complex calculation steps that students find difficult to understand when learning mathematics. By integrating artificial intelligence with visualization animation technology, the system enhances precise recognition of handwritten matrix content through the introduction of Mamba backbone networks, completes digital extraction and matrix reconstruction using the YOLO model, and simultaneously combines CoordAttention coordinate attention mechanisms to improve the accurate grasp of character spatial positions. The calculation process is demonstrated frame by frame through the Manim animation engine, vividly showcasing each mathematical calculation step, helping students intuitively understand the intrinsic logic of mathematical operations. Through dynamically generating animation processes for different computational tasks, the system exhibits high modularity and flexibility, capable of generating various mathematical operation examples in real-time according to student needs. By innovating human-computer interaction methods, it brings mathematical calculation processes to life, helping students bridge the gap between knowledge and understanding on a deeper level, ultimately achieving a learning experience where "every step is understood." The system's scalability and interactivity make it an intuitive, user-friendly, and efficient auxiliary tool in education.
本课程设计旨在开发并研究一种手写矩阵识别与分步可视化计算过程展示系统,解决学生在学习数学时遇到的抽象公式和复杂计算步骤难以理解的问题。通过将人工智能与可视化动画技术相结合,该系统利用Mamba骨干网络提高了对手写矩阵内容的精确识别能力;采用YOLO模型完成数字提取及矩阵重构,并结合CoordAttention坐标注意力机制来提高对字符空间位置的准确把握。使用Manim动画引擎逐帧展示计算过程,生动地呈现每个数学运算步骤,帮助学生直观理解数学操作的内在逻辑。 该系统通过为不同的计算任务动态生成动画流程,在保持高度模块化和灵活性的同时能够根据学生的需要实时生成各种数学操作示例。通过创新人机交互方法,使数学计算过程更加生动有趣,有助于学生在更深层面建立知识与理解之间的桥梁,最终实现“每一步都能理解”的学习体验。系统的可扩展性和互动性使其成为一种直观、用户友好且高效的教育辅助工具。
https://arxiv.org/abs/2505.03800
Parkinsons Disease (PD) is a progressive neurological disorder that primarily affects motor functions and can lead to mild cognitive impairment (MCI) and dementia in its advanced stages. With approximately 10 million people diagnosed globally 1 to 1.8 per 1,000 individuals, according to reports by the Japan Times and the Parkinson Foundation early and accurate diagnosis of PD is crucial for improving patient outcomes. While numerous studies have utilized machine learning (ML) and deep learning (DL) techniques for PD recognition, existing surveys are limited in scope, often focusing on single data modalities and failing to capture the potential of multimodal approaches. To address these gaps, this study presents a comprehensive review of PD recognition systems across diverse data modalities, including Magnetic Resonance Imaging (MRI), gait-based pose analysis, gait sensory data, handwriting analysis, speech test data, Electroencephalography (EEG), and multimodal fusion techniques. Based on over 347 articles from leading scientific databases, this review examines key aspects such as data collection methods, settings, feature representations, and system performance, with a focus on recognition accuracy and robustness. This survey aims to serve as a comprehensive resource for researchers, providing actionable guidance for the development of next generation PD recognition systems. By leveraging diverse data modalities and cutting-edge machine learning paradigms, this work contributes to advancing the state of PD diagnostics and improving patient care through innovative, multimodal approaches.
帕金森病(PD)是一种进行性神经退行性疾病,主要影响运动功能,并可能导致其晚期阶段出现轻度认知障碍(MCI)和痴呆。据《日本时报》和帕金森基金会的报告,全球约有1000万患者,患病率为每千人中有1.8例。早期且准确地诊断PD对于改善患者的预后至关重要。尽管许多研究已经利用机器学习(ML)和深度学习(DL)技术进行PD识别,但现有的综述往往在范围上有所局限,常常仅关注单一的数据模态,并未能捕捉到多模态方法的潜力。 为了填补这些空白,本研究对多种数据模态下的帕金森病识别系统进行了全面回顾。这包括磁共振成像(MRI)、基于步态的姿态分析、步态感觉数据、手写分析、语音测试数据、脑电图(EEG)以及多模态融合技术等。根据来自领先科学数据库的347篇以上文献,本综述探讨了数据收集方法、设置、特征表示及系统性能的关键方面,并重点关注识别准确性和鲁棒性。 此次调查旨在为研究人员提供一个全面资源,为其开发下一代帕金森病识别系统的行动指南。通过利用多样化的数据模态和尖端的机器学习范式,这项工作有助于推进PD诊断技术的进步,并通过创新、多模态的方法改善患者的护理质量。
https://arxiv.org/abs/2505.00525
Language students can increase their effectiveness in learning written Japanese by mastering the visual structure and written technique of Japanese kanji. Yet, existing kanji handwriting recognition systems do not assess the written technique sufficiently enough to discourage students from developing bad learning habits. In this paper, we describe our work on Hashigo, a kanji sketch interactive system which achieves human instructor-level critique and feedback on both the visual structure and written technique of students' sketched kanji. This type of automated critique and feedback allows students to target and correct specific deficiencies in their sketches that, if left untreated, are detrimental to effective long-term kanji learning.
语言学习者可以通过掌握日语汉字的视觉结构和书写技巧来提高他们学习书面日语的效果。然而,现有的日语手写识别系统未能充分评估书写的技巧,不足以阻止学生养成不良的学习习惯。在本文中,我们介绍了我们在Hashigo项目上的工作,这是一个日语汉字草图互动系统,能够对学生的草图进行视觉结构和书写技巧方面的评判与反馈,其水平达到了人类教师的标准。这种自动化的评判和反馈可以让学生针对并纠正他们草图中的具体缺陷,这些缺陷如果不加以改正,在长期有效学习汉字的过程中将产生不利影响。
https://arxiv.org/abs/2504.13940
Handwritten Text Recognition (HTR) is essential for document analysis and digitization. However, handwritten data often contains user-identifiable information, such as unique handwriting styles and personal lexicon choices, which can compromise privacy and erode trust in AI services. Legislation like the ``right to be forgotten'' underscores the necessity for methods that can expunge sensitive information from trained models. Machine unlearning addresses this by selectively removing specific data from models without necessitating complete retraining. Yet, it frequently encounters a privacy-accuracy tradeoff, where safeguarding privacy leads to diminished model performance. In this paper, we introduce a novel two-stage unlearning strategy for a multi-head transformer-based HTR model, integrating pruning and random labeling. Our proposed method utilizes a writer classification head both as an indicator and a trigger for unlearning, while maintaining the efficacy of the recognition head. To our knowledge, this represents the first comprehensive exploration of machine unlearning within HTR tasks. We further employ Membership Inference Attacks (MIA) to evaluate the effectiveness of unlearning user-identifiable information. Extensive experiments demonstrate that our approach effectively preserves privacy while maintaining model accuracy, paving the way for new research directions in the document analysis community. Our code will be publicly available upon acceptance.
手写文本识别(HTR)在文档分析和数字化中至关重要。然而,手写数据常常包含可辨识个人的信息,例如独特的书写风格和个人词汇选择,这可能危及隐私并削弱用户对AI服务的信任。类似于“被遗忘权”这样的立法强调了从训练模型中清除敏感信息的方法的必要性。机器卸载通过有选择地从模型中移除特定数据而不必重新进行全面训练来解决这个问题。然而,在保护隐私和保证准确性之间,经常存在一种折衷,即保护隐私会导致模型性能下降。 本文介绍了针对基于多头变换器的手写文本识别(HTR)模型的一种新的两阶段卸载策略,该策略结合了修剪和随机标签处理。我们提出的方法利用书写者分类头作为卸载的指标和触发器,同时保持识别头的有效性。据我们所知,这是首次全面探讨在HTR任务中机器卸载的研究。 此外,我们使用成员资格推理攻击(MIA)来评估删除可辨识用户信息的效果。广泛的实验表明,我们的方法能够有效地保护隐私的同时保持模型的准确性,为文档分析社区的新研究方向铺平了道路。我们的代码将在接受后公开提供。
https://arxiv.org/abs/2504.08616