Handwriting remains an essential skill, particularly in education. Therefore, providing visual feedback on handwritten documents is an important but understudied area. We outline the many challenges when going from an image of handwritten input to correctly placed informative error feedback. We empirically compare modular and end-to-end systems and find that both approaches currently do not achieve acceptable overall quality. We identify the major challenges and outline an agenda for future research.
手写仍然是一个重要的技能,特别是在教育领域。因此,在手写文档中提供视觉反馈是一个重要但研究不足的领域。我们概述了从手写输入图像到正确放置的信息性错误反馈之间存在的诸多挑战。通过实证分析模块化系统和端到端系统的差异,我们发现目前这两种方法都无法达到令人满意的总体质量水平。我们识别出了主要的挑战,并为未来的研究制定了议程。
https://arxiv.org/abs/2601.09586
Paleography is the study of ancient and historical handwriting, its key objectives include the dating of manuscripts and understanding the evolution of writing. Estimating when a document was written and tracing the development of scripts and writing styles can be aided by identifying the individual scribes who contributed to a medieval manuscript. Although digital technologies have made significant progress in this field, the general problem remains unsolved and continues to pose open challenges. ... We previously proposed an approach focused on identifying specific letters or abbreviations that characterize each writer. In that study, we considered the letter "a", as it was widely present on all pages of text and highly distinctive, according to the suggestions of expert paleographers. We used template matching techniques to detect the occurrences of the character "a" on each page and the convolutional neural network (CNN) to attribute each instance to the correct scribe. Moving from the interesting results achieved from this previous system and being aware of the limitations of the template matching technique, which requires an appropriate threshold to work, we decided to experiment in the same framework with the use of the YOLO object detection model to identify the scribe who contributed to the writing of different medieval books. We considered the fifth version of YOLO to implement the YOLO object detection model, which completely substituted the template matching and CNN used in the previous work. The experimental results demonstrate that YOLO effectively extracts a greater number of letters considered, leading to a more accurate second-stage classification. Furthermore, the YOLO confidence score provides a foundation for developing a system that applies a rejection threshold, enabling reliable writer identification even in unseen manuscripts.
古文字学是研究古代和历史上的手写笔迹的学科,其主要目标包括对抄本进行年代鉴定以及理解书写的发展演变。通过识别贡献于中世纪手稿的个人抄写员,可以辅助估算文档的写作时间和追踪脚本及书写风格的变化发展。尽管数字技术在这一领域取得了显著进展,但总体问题仍未解决,并且仍然面临诸多挑战。 我们之前提出了一种专注于识别每个作者特征特定字母或缩写的的方法。在这项研究中,我们考虑了字母“a”,因为它广泛存在于所有文本页面上并且根据古文字学专家的建议具有高度辨识度。我们使用模板匹配技术来检测每一页上的字符"a"出现情况,并利用卷积神经网络(CNN)将每个实例归因于正确的抄写员。 基于之前系统取得的有趣成果以及意识到模板匹配技术需要适当阈值才能工作的限制,我们决定在同一框架下实验采用YOLO目标检测模型来识别不同中世纪书籍中的作者。我们考虑使用YOLO的第五版本来实施YOLO对象检测模型,这完全取代了先前工作中使用的模板匹配和CNN。 实验结果表明,YOLO能够有效地提取更多被考量的字母数量,从而实现更精确的第二阶段分类。此外,YOLO置信度分数为开发应用拒绝阈值系统的奠定了基础,使得在未见过的手稿中进行可靠的作者识别也成为可能。
https://arxiv.org/abs/2601.04834
Handwritten digit images lie in a high-dimensional pixel space but exhibit strong geometric and statistical structure. This paper investigates the latent organization of handwritten digits in the MNIST dataset using three complementary dimensionality reduction techniques: Principal Component Analysis (PCA), Factor Analysis (FA), and Uniform Manifold Approximation and Projection (UMAP). Rather than focusing on classification accuracy, we study how each method characterizes intrinsic dimensionality, shared variation, and nonlinear geometry. PCA reveals dominant global variance directions and enables high-fidelity reconstructions using a small number of components. FA decomposes digits into interpretable latent handwriting primitives corresponding to strokes, loops, and symmetry. UMAP uncovers nonlinear manifolds that reflect smooth stylistic transitions between digit classes. Together, these results demonstrate that handwritten digits occupy a structured low-dimensional manifold and that different statistical frameworks expose complementary aspects of this structure.
手写数字图像存在于高维像素空间中,但表现出强烈的几何和统计结构。本文使用三种互补的降维技术:主成分分析(PCA)、因子分析(FA)以及均匀流形近似与投影(UMAP),来研究MNIST数据集中的手写数字的潜在组织情况。不同于关注分类准确度,我们探讨每种方法如何表征固有维度、共同变化和非线性几何。 PCA揭示了主导全局方差的方向,并能利用少量组件实现高保真重构。FA将数字分解为可解释的手部书写基本元素,这些元素对应于笔画、环形和对称结构。UMAP则发现了反映不同数字类别之间平滑风格转换的非线性流形。 这些结果共同表明手写数字占据一个结构化的低维流形,并且不同的统计框架揭示了该结构的不同方面。
https://arxiv.org/abs/2601.06168
Handwritten STEM exams capture open-ended reasoning and diagrams, but manual grading is slow and difficult to scale. We present an end-to-end workflow for grading scanned handwritten engineering quizzes with multimodal large language models (LLMs) that preserves the standard exam process (A4 paper, unconstrained student handwriting). The lecturer provides only a handwritten reference solution (100%) and a short set of grading rules; the reference is converted into a text-only summary that conditions grading without exposing the reference scan. Reliability is achieved through a multi-stage design with a format/presence check to prevent grading blank answers, an ensemble of independent graders, supervisor aggregation, and rigid templates with deterministic validation to produce auditable, machine-parseable reports. We evaluate the frozen pipeline in a clean-room protocol on a held-out real course quiz in Slovenian, including hand-drawn circuit schematics. With state-of-the-art backends (GPT-5.2 and Gemini-3 Pro), the full pipeline achieves $\approx$8-point mean absolute difference to lecturer grades with low bias and an estimated manual-review trigger rate of $\approx$17% at $D_{\max}=40$. Ablations show that trivial prompting and removing the reference solution substantially degrade accuracy and introduce systematic over-grading, confirming that structured prompting and reference grounding are essential.
手写STEM考试能够捕捉到开放性推理和图表,但人工评分既缓慢又难以扩展。我们提出了一种端到端的工作流程,用于使用多模态大型语言模型(LLM)对扫描的手写工程小测验进行评分,该工作流保持了标准的考试过程(A4纸张、不受限制的学生手写)。讲师只需提供一份手写的参考答案(100%正确率)和一组简短的评分规则;参考答案被转换成仅包含文本的摘要,以条件化评分而不暴露参考扫描。可靠性通过多阶段设计实现,包括格式/存在检查以防止空白答案被评分、独立评分级联、监督聚合以及使用确定性验证的固定模板来生成可审计和机器解析报告。 我们在一个清洁房协议中对一项保留的真实课程测验进行了冻结管道的评估,该测验包含手绘电路图,并且是用斯洛文尼亚语进行的。在最先进的后台(GPT-5.2 和 Gemini-3 Pro)支持下,整个工作流程与讲师评分相比达到了约8分的平均绝对差异,偏差较低,并且估计的手动审查触发率为约17%(当最大分数D为40时)。消融研究表明,简单的提示和移除参考答案会显著降低准确性并引入系统性的过度打分,确认结构化提示和参考依据是必不可少的。
https://arxiv.org/abs/2601.00730
Handwritten Text Recognition (HTR) is a well-established research area. In contrast, Handwritten Text Generation (HTG) is an emerging field with significant potential. This task is challenging due to the variation in individual handwriting styles. A large and diverse dataset is required to generate realistic handwritten text. However, such datasets are difficult to collect and are not readily available. Bengali is the fifth most spoken language in the world. While several studies exist for languages such as English and Arabic, Bengali handwritten text generation has received little attention. To address this gap, we propose a method for generating Bengali handwritten words. We developed and used a self-collected dataset of Bengali handwriting samples. The dataset includes contributions from approximately five hundred individuals across different ages and genders. All images were pre-processed to ensure consistency and quality. Our approach demonstrates the ability to produce diverse handwritten outputs from input plain text. We believe this work contributes to the advancement of Bengali handwriting generation and can support further research in this area.
手写文本识别(HTR)是一个研究领域,已经相当成熟。相比之下,手写文本生成(HTG)则是一个新兴的、具有巨大潜力的研究方向。然而,由于个体书写风格的变化莫测,这一任务面临诸多挑战。为了生成逼真的手写文本,需要一个大型且多样化的数据集,但这类数据集难以收集并无法轻易获取。 孟加拉语是世界上使用人数第五多的语言。尽管对于英语和阿拉伯语等语言已有若干研究,但对于孟加拉语的手写文本生成却鲜有关注。为填补这一空白,我们提出了一种用于生成孟加拉语手写单词的方法。为此,我们开发并利用了一个由大约500人提供样本的自收集数据集,参与者来自不同的年龄段和性别群体。所有图像都经过预处理以确保一致性和质量。 我们的方法展示了将输入的纯文本转换为多样化的手写输出的能力。我们认为这项工作有助于推进孟加拉语的手写生成技术,并能够支持该领域进一步的研究。
https://arxiv.org/abs/2512.21694
Handwritten text recognition (HTR) and machine translation continue to pose significant challenges, particularly for low-resource languages like Marathi, which lack large digitized corpora and exhibit high variability in handwriting styles. The conventional approach to address this involves a two-stage pipeline: an OCR system extracts text from handwritten images, which is then translated into the target language using a machine translation model. In this work, we explore and compare the performance of traditional OCR-MT pipelines with Vision Large Language Models that aim to unify these stages and directly translate handwritten text images in a single, end-to-end step. Our motivation is grounded in the urgent need for scalable, accurate translation systems to digitize legal records such as FIRs, charge sheets, and witness statements in India's district and high courts. We evaluate both approaches on a curated dataset of handwritten Marathi legal documents, with the goal of enabling efficient legal document processing, even in low-resource environments. Our findings offer actionable insights toward building robust, edge-deployable solutions that enhance access to legal information for non-native speakers and legal professionals alike.
手写文本识别(HTR)和机器翻译在像马拉地语这样资源匮乏的语言中仍面临重大挑战,这些语言缺乏大规模数字化语料库,并且手写风格具有高度的变异性。传统方法是采用两阶段管道:光学字符识别(OCR)系统从手写图像中提取文本,然后使用机器翻译模型将该文本翻译成目标语言。在这项工作中,我们探讨并比较了传统OCR-MT管道与旨在统一这些步骤并将手写文本图像直接转换为另一种语言的视觉大规模语言模型的表现。我们的动机是基于印度地方法院和高等法院数字化法律记录(如第一信息报告、起诉书和证人陈述)对可扩展且准确翻译系统的迫切需求。 我们在精心策划的手写马拉地语法律文件数据集上评估了这两种方法,旨在实现低资源环境下的高效法律文档处理。我们的研究结果为构建稳健的边缘部署解决方案提供了可行见解,这些方案能够提高非母语者和法律专业人士获取法律信息的途径。
https://arxiv.org/abs/2512.18004
Handwritten Text Recognition remains challenging due to the limited data, high writing style variance, and scripts with complex diacritics. Existing approaches, though partially address these issues, often struggle to generalize without massive synthetic data. To address these challenges, we propose HTR-ConvText, a model designed to capture fine-grained, stroke-level local features while preserving global contextual dependencies. In the feature extraction stage, we integrate a residual Convolutional Neural Network backbone with a MobileViT with Positional Encoding block. This enables the model to both capture structural patterns and learn subtle writing details. We then introduce the ConvText encoder, a hybrid architecture combining global context and local features within a hierarchical structure that reduces sequence length for improved efficiency. Additionally, an auxiliary module injects textual context to mitigate the weakness of Connectionist Temporal Classification. Evaluations on IAM, READ2016, LAM and HANDS-VNOnDB demonstrate that our approach achieves improved performance and better generalization compared to existing methods, especially in scenarios with limited training samples and high handwriting diversity.
手写文本识别(Handwritten Text Recognition,HTR)仍然面临挑战,原因在于数据量有限、书写风格变化大以及复杂音调符号的使用。现有的方法尽管部分解决了这些问题,但在没有大量合成数据的情况下往往难以泛化。为应对这些挑战,我们提出了HTR-ConvText模型,该模型旨在捕捉细粒度的笔画级别的局部特征,同时保持全局上下文依赖性。在特征提取阶段,我们将残差卷积神经网络(Residual Convolutional Neural Network)骨干网与带有位置编码块的MobileViT集成在一起,使模型既能捕获结构模式又能学习微妙的书写细节。 接着我们引入了ConvText编码器,这是一种结合全局上下文和局部特征的混合架构,在层次化结构中减少了序列长度以提高效率。此外,一个辅助模块注入文本上下文以减轻连接时序分类(Connectionist Temporal Classification)方法的弱点。 在IAM、READ2016、LAM以及HANDS-VNOnDB数据集上的评估表明,我们的方法相比现有技术实现了更好的性能和泛化能力,尤其是在训练样本有限且手写风格多样化的场景下。
https://arxiv.org/abs/2512.05021
Online handwriting generation (OHG) enhances handwriting recognition models by synthesizing diverse, human-like samples. However, existing OHG methods struggle to generate unseen characters, particularly in glyph-based languages like Chinese, limiting their real-world applicability. In this paper, we introduce our method for OHG, where the writer's style and the characters generated during testing are unseen during training. To tackle this challenge, we propose a Dual-branch Network with Adaptation (DNA), which comprises an adaptive style branch and an adaptive content branch. The style branch learns stroke attributes such as writing direction, spacing, placement, and flow to generate realistic handwriting. Meanwhile, the content branch is designed to generalize effectively to unseen characters by decomposing character content into structural information and texture details, extracted via local and global encoders, respectively. Extensive experiments demonstrate that our DNA model is well-suited for the unseen OHG setting, achieving state-of-the-art performance.
https://arxiv.org/abs/2511.22064
Learning safe and stable robot motions from demonstrations remains a challenge, especially in complex, nonlinear tasks involving dynamic, obstacle-rich environments. In this paper, we propose Safe and Stable Neural Network Dynamical Systems S$^2$-NNDS, a learning-from-demonstration framework that simultaneously learns expressive neural dynamical systems alongside neural Lyapunov stability and barrier safety certificates. Unlike traditional approaches with restrictive polynomial parameterizations, S$^2$-NNDS leverages neural networks to capture complex robot motions providing probabilistic guarantees through split conformal prediction in learned certificates. Experimental results on various 2D and 3D datasets -- including LASA handwriting and demonstrations recorded kinesthetically from the Franka Emika Panda robot -- validate S$^2$-NNDS effectiveness in learning robust, safe, and stable motions from potentially unsafe demonstrations.
https://arxiv.org/abs/2511.20593
Styled handwriting generation aims to synthesize handwritten text that looks both realistic and aligned with a specific writer's style. While recent approaches involving GAN, transformer and diffusion-based models have made progress, they often struggle to capture the full spectrum of writer-specific attributes, particularly global stylistic patterns that span long-range spatial dependencies. As a result, capturing subtle writer-specific traits such as consistent slant, curvature or stroke pressure, while keeping the generated text accurate is still an open problem. In this work, we present a unified framework designed to address these limitations. We introduce a Vision Transformer-based style encoder that learns global stylistic patterns from multiple reference images, allowing the model to better represent long-range structural characteristics of handwriting. We then integrate these style cues with the target text using a cross-attention mechanism, enabling the system to produce handwritten images that more faithfully reflect the intended style. To make the process more interpretable, we utilize Salient Stroke Attention Analysis (SSAA), which reveals the stroke-level features the model focuses on during style transfer. Together, these components lead to handwriting synthesis that is not only more stylistically coherent, but also easier to understand and analyze.
https://arxiv.org/abs/2511.18307
Dynamical system (DS)-based learning from demonstration (LfD) is a powerful tool for generating motion plans in the operation (`task') space of robotic systems. However, the realization of the generated motion plans is often compromised by a ''task-execution mismatch'', where unmodeled dynamics, persistent disturbances, and system latency cause the robot's actual task-space state to diverge from the desired motion trajectory. We propose a novel task-level robust control architecture, L1-augmented Dynamical Systems (L1-DS), that explicitly handles the task-execution mismatch in tracking a nominal motion plan generated by any DS-based LfD scheme. Our framework augments any DS-based LfD model with a nominal stabilizing controller and an L1 adaptive controller. Furthermore, we introduce a windowed Dynamic Time Warping (DTW)-based target selector, which enables the nominal stabilizing controller to handle temporal misalignment for improved phase-consistent tracking. We demonstrate the efficacy of our architecture on the LASA and IROS handwriting datasets.
https://arxiv.org/abs/2511.09790
Several computer vision applications like vehicle license plate recognition, captcha recognition, printed or handwriting character recognition from images etc., text polarity detection and binarization are the important preprocessing tasks. To analyze any image, it has to be converted to a simple binary image. This binarization process requires the knowledge of polarity of text in the images. Text polarity is defined as the contrast of text with respect to background. That means, text is darker than the background (dark text on bright background) or vice-versa. The binarization process uses this polarity information to convert the original colour or gray scale image into a binary image. In the literature, there is an intuitive approach based on power-law transformation on the original images. In this approach, the authors have illustrated an interesting phenomenon from the histogram statistics of the transformed images. Considering text and background as two classes, they have observed that maximum between-class variance between two classes is increasing (decreasing) for dark (bright) text on bright (dark) background. The corresponding empirical results have been presented. In this paper, we present a theoretical analysis of the above phenomenon.
https://arxiv.org/abs/2511.07916
This research explores the fusion of graphology and artificial intelligence to quantify psychological stress levels in students by analyzing their handwritten examination scripts. By leveraging Optical Character Recognition and transformer based sentiment analysis models, we present a data driven approach that transcends traditional grading systems, offering deeper insights into cognitive and emotional states during examinations. The system integrates high resolution image processing, TrOCR, and sentiment entropy fusion using RoBERTa based models to generate a numerical Stress Index. Our method achieves robustness through a five model voting mechanism and unsupervised anomaly detection, making it an innovative framework in academic forensics.
https://arxiv.org/abs/2511.11633
Alzheimer's disease is a prevalent neurodegenerative disorder for which early detection is critical. Handwriting-often disrupted in prodromal AD-provides a non-invasive and cost-effective window into subtle motor and cognitive decline. Existing handwriting-based AD studies, mostly relying on online trajectories and hand-crafted features, have not systematically examined how task type influences diagnostic performance and cross-task generalization. Meanwhile, large-scale vision language models have demonstrated remarkable zero or few-shot anomaly detection in natural images and strong adaptability across medical modalities such as chest X-ray and brain MRI. However, handwriting-based disease detection remains largely unexplored within this paradigm. To close this gap, we introduce a lightweight Cross-Layer Fusion Adapter framework that repurposes CLIP for handwriting-based AD screening. CLFA implants multi-level fusion adapters within the visual encoder to progressively align representations toward handwriting-specific medical cues, enabling prompt-free and efficient zero-shot inference. Using this framework, we systematically investigate cross-task generalization-training on a specific handwriting task and evaluating on unseen ones-to reveal which task types and writing patterns most effectively discriminate AD. Extensive analyses further highlight characteristic stroke patterns and task-level factors that contribute to early AD identification, offering both diagnostic insights and a benchmark for handwriting-based cognitive assessment.
https://arxiv.org/abs/2511.05841
Generative robotic motion planning requires not only the synthesis of smooth and collision-free trajectories but also feasibility across diverse tasks and dynamic constraints. Prior planning methods, both traditional and generative, often struggle to incorporate high-level semantics with low-level constraints, especially the nexus between task configurations and motion controllability. In this work, we present XFlowMP, a task-conditioned generative motion planner that models robot trajectory evolution as entropic flows bridging stochastic noises and expert demonstrations via Schrodinger bridges given the inquiry task configuration. Specifically, our method leverages Schrodinger bridges as a conditional flow matching coupled with a score function to learn motion fields with high-order dynamics while encoding start-goal configurations, enabling the generation of collision-free and dynamically-feasible motions. Through evaluations, XFlowMP achieves up to 53.79% lower maximum mean discrepancy, 36.36% smoother motions, and 39.88% lower energy consumption while comparing to the next-best baseline on the RobotPointMass benchmark, and also reducing short-horizon planning time by 11.72%. On long-horizon motions in the LASA Handwriting dataset, our method maintains the trajectories with 1.26% lower maximum mean discrepancy, 3.96% smoother, and 31.97% lower energy. We further demonstrate the practicality of our method on the Kinova Gen3 manipulator, executing planning motions and confirming its robustness in real-world settings.
生成式机器人运动规划不仅要求合成平滑且无碰撞的轨迹,还需确保在多样任务和动态约束下的可行性。先前的传统方法和生成方法往往难以将高层次语义与低层次约束结合起来,特别是在任务配置和运动可控性之间的连接上尤为困难。为此,我们提出了XFlowMP,这是一种基于任务条件的生成式运动规划器,它通过薛定谔桥(Schrodinger bridges)来建模机器人的轨迹演化,将随机噪声和专家演示联系起来,在给定查询任务配置的情况下进行工作。具体来说,我们的方法利用了薛定谔桥作为条件流匹配并结合评分函数来学习具有高阶动力学的运动场,并编码起点-终点配置,从而能够生成无碰撞且动态可行性的动作。 通过评估,XFlowMP在RobotPointMass基准测试中相较于最佳基线方案分别降低了最大均值差异(maximum mean discrepancy)达53.79%,提升了动作平滑度36.36%,减少了能耗39.88%。此外,在LASA Handwriting数据集的长时序运动上,我们的方法维持了轨迹的最大均值差异降低1.26%,动作更光滑(提升幅度为3.96%),且能量消耗降低了31.97%。我们进一步在Kinova Gen3机械臂上展示了该方法的实际应用性,在执行规划动作的同时验证其在真实环境中的鲁棒性能。 这项研究展示了一种创新性的生成式运动规划策略,不仅提升了机器人操作的灵活性和效率,也为未来机器人技术的发展奠定了基础。
https://arxiv.org/abs/2512.00022
This paper investigates various factors that influence the performance of end-to-end deep learning approaches for historical writer identification (HWI), a task that remains challenging due to the diversity of handwriting styles, document degradation, and the limited number of labelled samples per writer. These conditions often make accurate recognition difficult, even for human experts. Traditional HWI methods typically rely on handcrafted image processing and clustering techniques, which tend to perform well on small and carefully curated datasets. In contrast, end-to-end pipelines aim to automate the process by learning features directly from document images. However, our experiments show that many of these models struggle to generalise in more realistic, document-level settings, especially under zero-shot scenarios where writers in the test set are not present in the training data. We explore different combinations of pre-processing methods, backbone architectures, and post-processing strategies, including text segmentation, patch sampling, and feature aggregation. The results suggest that most configurations perform poorly due to weak capture of low-level visual features, inconsistent patch representations, and high sensitivity to content noise. Still, we identify one end-to-end setup that achieves results comparable to the top-performing system, despite using a simpler design. These findings point to key challenges in building robust end-to-end systems and offer insight into design choices that improve performance in historical document writer identification.
本文研究了影响端到端深度学习方法在历史书写者识别(HWI)任务中的各种因素。由于手写风格的多样性、文档退化以及每位书写者的标注样本数量有限,这一任务仍然具有挑战性,甚至人类专家也难以准确地进行识别。传统的历史书写者识别方法通常依赖于手工制作的图像处理和聚类技术,在经过精心策划的小型数据集上表现出色。相比之下,端到端流程旨在通过直接从文档图像中学习特征来自动化整个过程。然而,我们的实验表明,许多这些模型在更加现实、以文档为单位的情境下难以泛化,特别是在零样本场景(测试集中出现的书写者不在训练数据中)中表现尤其不佳。 本文探讨了不同的预处理方法、主干架构和后处理策略组合的效果,包括文本分割、补丁采样以及特征聚合。结果显示,大多数配置由于对低级视觉特征捕捉能力弱、补丁表示不一致以及内容噪声敏感度高而表现较差。尽管如此,我们发现有一种端到端设置在使用更为简单的设计时仍能取得与最佳系统相媲美的结果。 这些发现揭示了构建稳健的端到端系统的挑战,并为改善历史文档书写者识别中的性能提供了关于设计选择的见解。
https://arxiv.org/abs/2510.18671
Automated signature verification is a critical biometric technique used in banking, identity authentication, and legal documentation. Despite the notable progress achieved by deep learning methods, most approaches in offline signature verification still struggle to generalize across datasets, as variations in handwriting styles and acquisition protocols often degrade performance. This study investigates feature learning strategies for signature forgery detection, focusing on improving cross-dataset generalization -- that is, model robustness when trained on one dataset and tested on another. Using three public benchmarks -- CEDAR, ICDAR, and GPDS Synthetic -- two experimental pipelines were developed: one based on raw signature images and another employing a preprocessing method referred to as shell preprocessing. Several behavioral patterns were identified and analyzed; however, no definitive superiority between the two approaches was established. The results show that the raw-image model achieved higher performance across benchmarks, while the shell-based model demonstrated promising potential for future refinement toward robust, cross-domain signature verification.
自动签名验证是一种在银行、身份认证和法律文件中至关重要的生物识别技术。尽管深度学习方法已取得了显著的进步,但在离线签名验证领域,大多数方法仍然难以跨数据集泛化,因为书写风格的差异以及获取协议的不同往往会导致性能下降。本研究调查了用于签名伪造检测的功能学习策略,并专注于改进跨数据集的泛化能力——即,在一个数据集上训练模型并在另一个数据集上进行测试时的模型鲁棒性。 该研究使用三个公共基准——CEDAR、ICDAR和GPDS合成,开发了两个实验流程:一种基于原始签名图像,另一种采用被称为“壳预处理”的预处理方法。在识别并分析了几种行为模式后,并未明确证明这两种方法中哪一种更优越。结果显示,基于原始图像的模型在所有基准测试中的表现更好,而基于“壳”的模型展示了在未来进行优化以实现稳健、跨领域的签名验证的巨大潜力。
https://arxiv.org/abs/2510.17724
This paper presents ClapperText, a benchmark dataset for handwritten and printed text recognition in visually degraded and low-resource settings. The dataset is derived from 127 World War II-era archival video segments containing clapperboards that record structured production metadata such as date, location, and camera-operator identity. ClapperText includes 9,813 annotated frames and 94,573 word-level text instances, 67% of which are handwritten and 1,566 are partially occluded. Each instance includes transcription, semantic category, text type, and occlusion status, with annotations available as rotated bounding boxes represented as 4-point polygons to support spatially precise OCR applications. Recognizing clapperboard text poses significant challenges, including motion blur, handwriting variation, exposure fluctuations, and cluttered backgrounds, mirroring broader challenges in historical document analysis where structured content appears in degraded, non-standard forms. We provide both full-frame annotations and cropped word images to support downstream tasks. Using a consistent per-video evaluation protocol, we benchmark six representative recognition and seven detection models under zero-shot and fine-tuned conditions. Despite the small training set (18 videos), fine-tuning leads to substantial performance gains, highlighting ClapperText's suitability for few-shot learning scenarios. The dataset offers a realistic and culturally grounded resource for advancing robust OCR and document understanding in low-resource archival contexts. The dataset and evaluation code are available at this https URL.
这篇论文介绍了ClapperText,这是一个用于在视觉退化和低资源环境下识别手写和印刷文本的基准数据集。该数据集来源于包含 clapboard(场记板)的127段第二次世界大战时期的档案视频片段,这些 clapboard 记录了生产元数据,如日期、地点和摄像师身份等结构化信息。ClapperText 包含9,813张注释图像帧和94,573个单词级别的文本实例,其中67%为手写文本,并且有1,566个部分被遮挡的实例。每个实例包括转录、语义类别、文本类型以及遮挡状态的信息,注释以旋转边界框的形式提供(表示为四点多边形),以便支持空间精确的OCR应用。 识别 clapboard 上的文字面临着许多挑战,比如运动模糊、手写风格的变化、曝光度波动和杂乱背景等。这些问题反映了在历史文档分析中遇到的更广泛的问题,在这种情况下,结构化内容以退化的非标准形式出现。我们提供全帧注释及裁剪后的单词图像,支持下游任务需求。 通过一致的视频级评估协议,我们在零样本(zero-shot)和微调条件下对六个代表性识别模型和七个检测模型进行了基准测试。尽管训练集规模较小(18段视频),但微调能够带来显著性能提升,这突显了ClapperText在少量数据学习场景中的适用性。 该数据集为推进低资源档案环境下的稳健OCR技术和文档理解提供了现实且文化相关的资源。数据集和评估代码可在[此处](https://this https URL)获取。
https://arxiv.org/abs/2510.15557
Recognizing and processing Classical Chinese (Han-Nom) texts play a vital role in digitizing Vietnamese historical documents and enabling cross-lingual semantic research. However, existing OCR systems struggle with degraded scans, non-standard glyphs, and handwriting variations common in ancient sources. In this work, we propose a fine-tuning approach for PaddleOCRv5 to improve character recognition on Han-Nom texts. We retrain the text recognition module using a curated subset of ancient Vietnamese Chinese manuscripts, supported by a full training pipeline covering preprocessing, LMDB conversion, evaluation, and visualization. Experimental results show a significant improvement over the base model, with exact accuracy increasing from 37.5 percent to 50.0 percent, particularly under noisy image conditions. Furthermore, we develop an interactive demo that visually compares pre- and post-fine-tuning recognition results, facilitating downstream applications such as Han-Vietnamese semantic alignment, machine translation, and historical linguistics research. The demo is available at this https URL.
识别和处理古典中文(汉喃文)文本在越南历史文献的数字化及跨语言语义研究中扮演着至关重要的角色。然而,现有的OCR系统在处理古代资料中存在的退化扫描、非标准字符和手写变体时面临挑战。为此,我们提出了一种针对PaddleOCRv5的微调方法,以提高对汉喃文字符识别的准确率。我们将文本识别模块重新训练于精心挑选的一组越南古中文手稿子集上,并且提供了一个完整的培训流程,涵盖了预处理、LMDB转换、评估和可视化。实验结果显示,在噪声图像条件下,与基础模型相比,我们的方法在精确度方面有显著提升,从37.5%提高到了50.0%。 此外,我们还开发了一款交互式演示程序,该程序能直观地比较微调前后的识别结果,有助于下游应用如汉越语义对齐、机器翻译及历史语言学研究。这款演示工具可在以下链接中访问:[https URL](请将方括号内的文本替换为实际的URL)。
https://arxiv.org/abs/2510.04003
Deep generative models have advanced text-to-online handwriting generation (TOHG), which aims to synthesize realistic pen trajectories conditioned on textual input and style references. However, most existing methods still primarily focus on character- or word-level generation, resulting in inefficiency and a lack of holistic structural modeling when applied to full text lines. To address these issues, we propose DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation. We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses: (1) an OCR-based loss enforcing glyph-level accuracy, and (2) a style-classification loss preserving writing style. This dual regularization yields a semantically structured latent space where character content and writer styles are effectively disentangled. We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories. Experimental results demonstrate that DiffInk outperforms existing state-of-the-art methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency. Code will be made publicly available.
深度生成模型在文本到在线手写生成(TOHG)方面取得了进展,该技术旨在根据文本输入和风格参考合成逼真的笔迹轨迹。然而,大多数现有的方法仍然主要集中在字符级或单词级别的生成上,在应用于整个文本行时会导致效率低下且缺乏整体结构建模。为了应对这些挑战,我们提出了DiffInk,这是首个用于完整文本行手写生成的潜在扩散Transformer框架。 我们首先引入了InkVAE,这是一种新颖的序列变分自编码器,增强了两种互补的潜在空间正则化损失:(1) 基于OCR的损失,强制要求字符级别的准确性;(2) 保持书写风格的风格分类损失。这种双重正则化产生了语义结构化的潜在空间,在其中可以有效地分离出字符内容和写作风格。 然后我们引入了InkDiT,这是一种新颖的潜在扩散Transformer模型,它可以整合目标文本和参考风格来生成连贯的笔迹轨迹。 实验结果表明,DiffInk在字符准确性、风格保真度方面均优于现有的最先进的方法,并且显著提高了生成效率。代码将在公开发布。
https://arxiv.org/abs/2509.23624