The adoption of tablets with touchscreens and styluses is increasing, and a key feature is converting handwriting to text, enabling search, indexing, and AI assistance. Meanwhile, vision-language models (VLMs) are now the go-to solution for image understanding, thanks to both their state-of-the-art performance across a variety of tasks and the simplicity of a unified approach to training, fine-tuning, and inference. While VLMs obtain high performance on image-based tasks, they perform poorly on handwriting recognition when applied naively, i.e., by rendering handwriting as an image and performing optical character recognition (OCR). In this paper, we study online handwriting recognition with VLMs, going beyond naive OCR. We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image. We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers. Wide applicability is shown through results with two different VLM families, on multiple public datasets. Our approach can be applied to off-the-shelf VLMs, does not require any changes in their architecture, and can be used in both fine-tuning and parameter-efficient tuning. We perform a detailed ablation study to identify the key elements of the proposed representation.
采用触摸屏和触控笔的平板电脑正在增加,关键功能是将手写内容转换为文本,实现搜索、索引和人工智能协助。同时,凭借其在各种任务上的先进性能和统一训练方法的简单性,视觉语言模型(VLMs)现在成为图像理解的绝佳解决方案。然而,当它们应用于粗暴地将手写内容转换为图像并进行光学字符识别(OCR)时,VLMs表现不佳。在本文中,我们研究使用VLMs的在线手写识别,超越了粗暴的OCR。我们提出了一种新颖的数字墨水(在线手写)分词表示,包括 strokes 时间序列作为文本,以及作为图像。我们证明了这种表示产生了与或优于最先进的在线手写识别器相同或更好的结果。通过在两个不同的VLM家族上展示结果,在多个公共数据集上进行了广泛的适用性展示。我们的方法可以应用于现成的VLMs,无需对其架构进行任何更改,并且可以用于参数高效调整。我们进行了详细的可缩放研究,以确定所提出的表示的关键要素。
https://arxiv.org/abs/2402.15307
Federated learning (FL) is a machine learning paradigm where the clients possess decentralized training data and the central server handles aggregation and scheduling. Typically, FL algorithms involve clients training their local models using stochastic gradient descent (SGD), which carries drawbacks such as slow convergence and being prone to getting stuck in suboptimal solutions. In this work, we propose a message passing based Bayesian federated learning (BFL) framework to avoid these drawbacks.Specifically, we formulate the problem of deep neural network (DNN) learning and compression and as a sparse Bayesian inference problem, in which group sparse prior is employed to achieve structured model compression. Then, we propose an efficient BFL algorithm called EMTDAMP, where expectation maximization (EM) and turbo deep approximate message passing (TDAMP) are combined to achieve distributed learning and compression. The central server aggregates local posterior distributions to update global posterior distributions and update hyperparameters based on EM to accelerate convergence. The clients perform TDAMP to achieve efficient approximate message passing over DNN with joint prior distribution. We detail the application of EMTDAMP to Boston housing price prediction and handwriting recognition, and present extensive numerical results to demonstrate the advantages of EMTDAMP.
联邦学习(FL)是一种机器学习范式,其中客户端拥有分散的训练数据,而中央服务器负责聚合和调度。通常,FL算法涉及客户端使用随机梯度下降(SGD)训练本地模型,这具有诸如收敛缓慢和容易陷入次优解等问题。在本文中,我们提出了一个基于消息传递的基于贝叶斯联邦学习的(BFL)框架,以避免这些缺陷。 具体来说,我们将深度神经网络(DNN)学习和压缩问题以及稀疏贝叶斯推理问题建模为凸优化问题,其中采用组稀疏先验以实现结构化模型压缩。然后,我们提出了一种高效的BFL算法EMTDAMP,其中期望最大化(EM)和涡轮深度近似消息传递(TDAMP)被结合以实现分布式学习和压缩。中央服务器根据EM更新全局后验分布,并根据EM更新超参数以加速收敛。客户端进行TDAMP以实现DNN上高效近似消息传递。 我们详细介绍了EMTDAMP在波士顿住房价格预测和手写识别中的应用,并提供了广泛的数值结果以证明EMTDAMP的优势。
https://arxiv.org/abs/2402.07366
Digital note-taking is gaining popularity, offering a durable, editable, and easily indexable way of storing notes in the vectorized form, known as digital ink. However, a substantial gap remains between this way of note-taking and traditional pen-and-paper note-taking, a practice still favored by a vast majority. Our work, InkSight, aims to bridge the gap by empowering physical note-takers to effortlessly convert their work (offline handwriting) to digital ink (online handwriting), a process we refer to as Derendering. Prior research on the topic has focused on the geometric properties of images, resulting in limited generalization beyond their training domains. Our approach combines reading and writing priors, allowing training a model in the absence of large amounts of paired samples, which are difficult to obtain. To our knowledge, this is the first work that effectively derenders handwritten text in arbitrary photos with diverse visual characteristics and backgrounds. Furthermore, it generalizes beyond its training domain into simple sketches. Our human evaluation reveals that 87% of the samples produced by our model on the challenging HierText dataset are considered as a valid tracing of the input image and 67% look like a pen trajectory traced by a human.
数字笔记变得越来越受欢迎,提供了一种在矢量形式中进行持久、可编辑和易于索引的笔记存储方式,称为数字墨水。然而,这种笔记方式与传统手写笔记之间仍然存在很大的差距,这是一种大多数人都仍然喜欢的方式。我们的工作InkSight旨在通过赋予物理笔记者将他们的作品(离线手写)转换为数字墨水(在线手写)的轻松方式,从而弥合这个差距。我们称之为Derendering。之前关于这个主题的研究主要集中在图像的的几何性质上,导致在训练领域之外的应用有限。我们的方法结合了阅读和写作的先验,使得在缺乏大量配对样本的情况下训练模型成为可能,这些样本很难获得。据我们所知,这是第一个有效地将手写文本从任意照片中消退的带有丰富视觉特性和背景的图像的工作。此外,它还扩展到其训练领域之外,呈现出简单的草图。我们的人工评估显示,由我们的模型在具有挑战性的HierText数据集上生成的样本中,87%被认为是对输入图像的有效跟踪,而67%看起来像是由人类追踪的笔迹。
https://arxiv.org/abs/2402.05804
Developing an automatic signature verification system is challenging and demands a large number of training samples. This is why synthetic handwriting generation is an emerging topic in document image analysis. Some handwriting synthesizers use the motor equivalence model, the well-established hypothesis from neuroscience, which analyses how a human being accomplishes movement. Specifically, a motor equivalence model divides human actions into two steps: 1) the effector independent step at cognitive level and 2) the effector dependent step at motor level. In fact, recent work reports the successful application to Western scripts of a handwriting synthesizer, based on this theory. This paper aims to adapt this scheme for the generation of synthetic signatures in two Indic scripts, Bengali (Bangla), and Devanagari (Hindi). For this purpose, we use two different online and offline databases for both Bengali and Devanagari signatures. This paper reports an effective synthesizer for static and dynamic signatures written in Devanagari or Bengali scripts. We obtain promising results with artificially generated signatures in terms of appearance and performance when we compare the results with those for real signatures.
开发一个自动签名验证系统具有挑战性,需要大量的训练样本。这就是为什么在文档图像分析中,合成手写字符是一个新兴的话题。一些手写合成器使用运动等价模型,这是来自神经科学的有争议的理论,该理论分析了一个人类如何完成运动。具体来说,运动等价模型将人类动作分为两个步骤:1)认知水平上的效应器独立步骤,2)肌肉级别上的效应器依赖步骤。实际上,最近的工作报告了一个手写合成器是基于这个理论应用于西方手写本的成功案例。本文旨在为 Bengali 和 Devanagari 两种印度语编写合成签名。为此,我们使用了两个不同的在线和线下数据库来存储 Bengali 和 Devanagari 签名。本文报道了在 Devanagari 或 Bengali scripts中书写的静态和动态签名中,获得良好效果的合成器。我们在与真实签名结果进行比较时,获得了满意的效果。
https://arxiv.org/abs/2401.17026
Signature synthesis is a computation technique that generates artificial specimens which can support decision making in automatic signature verification. A lot of work has been dedicated to this subject, which centres on synthesizing dynamic and static two-dimensional handwriting on canvas. This paper proposes a framework to generate synthetic 3D on-air signatures exploiting the lognormality principle, which mimics the complex neuromotor control processes at play as the fingertip moves. Addressing the usual cases involving the development of artificial individuals and duplicated samples, this paper contributes to the synthesis of: (1) the trajectory and velocity of entirely 3D new signatures; (2) kinematic information when only the 3D trajectory of the signature is known, and (3) duplicate samples of 3D real signatures. Validation was conducted by generating synthetic 3D signature databases mimicking real ones and showing that automatic signature verifications of genuine and skilled forgeries report performances similar to those of real and synthetic databases. We also observed that training 3D automatic signature verifiers with duplicates can reduce errors. We further demonstrated that our proposal is also valid for synthesizing 3D air writing and gestures. Finally, a perception test confirmed the human likeness of the generated specimens. The databases generated are publicly available, only for research purposes, at .
签名合成是一种计算技术,用于生成支持自动签名验证的假定标本。在这个主题上已经投入了大量的工作,重点是在画布上合成动态和静态的二维手写。本文提出了一种利用对数正态原理生成合成3D手写签名的新框架,该原理模仿了手指移动时存在的复杂神经肌肉控制过程。解决了通常涉及开发人工个体和重复样本的情况,本文为合成:(1)完全3D的新签名的时间轨迹和速度;(2)仅知道签名3D轨迹时,签名验证的动态信息;(3)3D真实签名重复样本。验证通过生成模仿真实情况的合成3D签名数据库来进行。结果表明,用副本训练3D自动签名验证器可以减少错误。我们还观察到,使用副本训练3D自动签名验证器可以降低错误。最后,通过感知测试证实了生成的样本具有人类肖像。生成的数据库对公众是可访问的,仅限于研究目的,位于 。
https://arxiv.org/abs/2401.16329
The increasing prevalence of Autism Spectrum Disorder and Attention-Deficit/ Hyperactivity Disorder among students highlights the need to improve evaluation and diagnostic techniques, as well as effective tools to mitigate the negative consequences associated with these disorders. With the widespread use of touchscreen mobile devices, there is an opportunity to gather comprehensive data beyond visual cues. These devices enable the collection and visualization of information on velocity profiles and the time taken to complete drawing and handwriting tasks. These data can be leveraged to develop new neuropsychological tests based on the velocity profile that assists in distinguishing between challenging cases of ASD and ADHD that are difficult to differentiate in clinical practice. In this paper, we present a proof of concept that compares and combines the results obtained from standardized tasks in the NEPSY-II assessment with a proposed observational scale based on the visual analysis of the velocity profile collected using digital tablets.
在学生中Autism Spectrum Disorder(ASD)和Attention-Deficit/ Hyperactivity Disorder(ADHD)的发病率不断增加,这凸显了需要改进评估和诊断技术以及有效工具来减轻这些疾病带来的负面后果。随着智能手机的广泛应用,可以收集到除视觉线索之外更全面的數據。这些设备可以收集和可视化关于速度曲线的信息,以及完成绘图和书写任务所需要的时间。这些数据可以用于开发新的神经心理学测试,基于速度曲线,用于区分在临床实践中难以区分的挑战性ASD和ADHD病例。在本文中,我们提出了一个概念证明,将标准化任务在NEPSY-II评估中获得的結果与基于数字平板收集的速度曲线下进行的视觉分析提出的观察尺度进行比较和結合。
https://arxiv.org/abs/2401.15685
Human movement studies and analyses have been fundamental in many scientific domains, ranging from neuroscience to education, pattern recognition to robotics, health care to sports, and beyond. Previous speech motor models were proposed to understand how speech movement is produced and how the resulting speech varies when some parameters are changed. However, the inverse approach, in which the muscular response parameters and the subject's age are derived from real continuous speech, is not possible with such models. Instead, in the handwriting field, the kinematic theory of rapid human movements and its associated Sigma-lognormal model have been applied successfully to obtain the muscular response parameters. This work presents a speech kinematics based model that can be used to study, analyze, and reconstruct complex speech kinematics in a simplified manner. A method based on the kinematic theory of rapid human movements and its associated Sigma lognormal model are applied to describe and to parameterize the asymptotic impulse response of the neuromuscular networks involved in speech as a response to a neuromotor command. The method used to carry out transformations from formants to a movement observation is also presented. Experiments carried out with the (English) VTR TIMIT database and the (German) Saarbrucken Voice Database, including people of different ages, with and without laryngeal pathologies, corroborate the link between the extracted parameters and aging, on the one hand, and the proportion between the first and second formants required in applying the kinematic theory of rapid human movements, on the other. The results should drive innovative developments in the modeling and understanding of speech kinematics.
人类运动研究和分析在许多科学领域都至关重要,从神经科学到教育,模式识别到机器人学,医疗保健到体育,等等。提出了前馈Speech motor模型来理解言语运动是如何产生的以及当参数改变时产生的言语变化。然而,基于这样的模型,反向方法,即从真实连续语音中提取肌肉反应参数和受试者年龄,是不可能的。相反,在手写领域,成功应用了快速人类运动学及其相关Sigma-logistic模型来获得肌肉反应参数。这项工作提出了一种可以简便地研究和分析复杂言语运动学的手写模型。基于快速人类运动学及其相关Sigma-logistic模型的方法被应用于描述和参数化参与言语的神经肌肉网络的渐进激励响应作为对神经肌肉命令的反应。描述从辅音到运动观察的变换的方法也被提出。使用(英语)VTRIMIT数据库和(德国)Saarbrucken Voice Database进行实验,包括不同年龄、有无喉病的人,证实了提取的参数与衰老之间的联系,以及应用快速人类运动学时所需的第一个和第二个辅音之间的比例。结果应该推动在言语运动学建模和理解方面的创新发展。
https://arxiv.org/abs/2401.17320
New methods for generating synthetic handwriting images for biometric applications have recently been developed. The temporal evolution of handwriting from childhood to adulthood is usually left unexplored in these works. This paper proposes a novel methodology for including temporal evolution in a handwriting synthesizer by means of simplifying the text trajectory plan and handwriting dynamics. This is achieved through a tailored version of the kinematic theory of rapid human movements and the neuromotor inspired handwriting synthesizer. The realism of the proposed method has been evaluated by comparing the temporal evolution of real and synthetic samples both quantitatively and subjectively. The quantitative test is based on a visual perception algorithm that compares the letter variability and the number of strokes in the real and synthetic handwriting produced at different ages. In the subjective test, 30 people are asked to evaluate the perceived realism of the evolution of the synthetic handwriting.
近年来,为生物识别应用开发新的手写图像生成方法已经引起了关注。然而,在这些研究中,通常会忽略手写从童年到成年的时间演变。本文提出了一种通过简化文本轨迹计划和手写动态来包括时间演变得手写生成器的新方法。这通过定制快速人类运动学理论和仿神经肌肉手写合成器实现。所提出方法的真实性通过比较真实和合成样本的时间演变进行定量测试和主观测试来评估。定量测试基于视觉感知算法,比较不同年龄下真实和合成手写样本的字母变异性数量。在主观测试中,30个人被要求评估合成手写演变的感知真实性。
https://arxiv.org/abs/2401.15472
Forensic handwriting examination is a branch of Forensic Science that aims to examine handwritten documents in order to properly define or hypothesize the manuscript's author. These analysis involves comparing two or more (digitized) documents through a comprehensive comparison of intrinsic local and global features. If a correlation exists and specific best practices are satisfied, then it will be possible to affirm that the documents under analysis were written by the same individual. The need to create sophisticated tools capable of extracting and comparing significant features has led to the development of cutting-edge software with almost entirely automated processes, improving the forensic examination of handwriting and achieving increasingly objective evaluations. This is made possible by algorithmic solutions based on purely mathematical concepts. Machine Learning and Deep Learning models trained with specific datasets could turn out to be the key elements to best solve the task at hand. In this paper, we proposed a new and challenging dataset consisting of two subsets: the first consists of 21 documents written either by the classic ``pen and paper" approach (and later digitized) and directly acquired on common devices such as tablets; the second consists of 362 handwritten manuscripts by 124 different people, acquired following a specific pipeline. Our study pioneered a comparison between traditionally handwritten documents and those produced with digital tools (e.g., tablets). Preliminary results on the proposed datasets show that 90% classification accuracy can be achieved on the first subset (documents written on both paper and pen and later digitized and on tablets) and 96% on the second portion of the data. The datasets are available at this https URL.
司法笔迹分析是一门法医学分支,旨在通过全面比较文书的内在局部和全局特征来正确确定或推测稿件的作者。这些分析涉及将两个或更多(数字化)文件通过比较其内在局部和全局特征的全面对比来完成。如果存在相关性并且满足特定的最佳实践,那么就可以证实分析中的文件是由同一个人所写。为了创建能够提取和比较重要特征的复杂工具,以提高笔迹研究和实现越来越客观的评估,进而发展了具有几乎完全自动化的过程的尖端软件。机器学习和深度学习模型通过针对特定数据集进行训练,可能会成为解决这一任务的的关键要素。在本文中,我们提出了一个新而具有挑战性的数据集,由两个子集组成:第一个子集包括21篇由经典“笔和纸”方法(后来数字化)直接获得的文件,和使用普通设备(如平板电脑)获取;第二个子集包括124个不同人员创作的362篇手写稿件。我们的研究首创了传统手写文件与使用数字工具(如平板电脑)制作的文件之间的比较。初步结果表明,第一个子集中的分类准确率为90%,第二个部分的数据中的分类准确率为96%。数据集可在此链接下载:https://www.academia.edu/39411041/Forensic_Handwriting_Analysis_Dataset
https://arxiv.org/abs/2401.04448
Handwritten signature verification poses a formidable challenge in biometrics and document authenticity. The objective is to ascertain the authenticity of a provided handwritten signature, distinguishing between genuine and forged ones. This issue has many applications in sectors such as finance, legal documentation, and security. Currently, the field of computer vision and machine learning has made significant progress in the domain of handwritten signature verification. The outcomes, however, may be enhanced depending on the acquired findings, the structure of the datasets, and the used models. Four stages make up our suggested strategy. First, we collected a large dataset of 12600 images from 420 distinct individuals, and each individual has 30 signatures of a certain kind (All authors signatures are genuine). In the subsequent stage, the best features from each image were extracted using a deep learning model named MobileNetV2. During the feature selection step, three selectors neighborhood component analysis (NCA), Chi2, and mutual info (MI) were used to pull out 200, 300, 400, and 500 features, giving a total of 12 feature vectors. Finally, 12 results have been obtained by applying machine learning techniques such as SVM with kernels (rbf, poly, and linear), KNN, DT, Linear Discriminant Analysis, and Naive Bayes. Without employing feature selection techniques, our suggested offline signature verification achieved a classification accuracy of 91.3%, whereas using the NCA feature selection approach with just 300 features it achieved a classification accuracy of 97.7%. High classification accuracy was achieved using the designed and suggested model, which also has the benefit of being a self-organized framework. Consequently, using the optimum minimally chosen features, the proposed method could identify the best model performance and result validation prediction vectors.
手写签名验证在生物识别和文档真实性方面是一个具有挑战性的问题。目标是确定提供的手写签名的真实性,区分真实和伪造的签名。这个问题在金融、法律文件和安全性等领域有很多应用。目前,计算机视觉和机器学习领域在手写签名验证领域取得了显著进展。然而,根据获得的发现,数据集的结构和使用模型的不同,结果可能会有所不同。我们的建议策略分为四个阶段。首先,我们收集了来自420个不同个体的12600个图像,每个个体有30个某种类型的签名(所有作者的签名都是真实的)。在后续阶段,使用名为MobileNetV2的深度学习模型提取每个图像中的最佳特征。在特征选择阶段,我们使用了NCA、Chi2和互信息(MI)等三种选择器。通过这三种选择器,我们提取了200、300、400和500个特征,总共12个特征向量。最后,通过应用SVM(核函数,多项式和线性)、KNN、DT、线性判别分析(线性可分判别分析)和朴素贝叶斯等机器学习技术,我们获得了12个结果。如果没有使用特征选择技术,我们提出的离线签名验证的分类准确率为91.3%,而使用只有300个特征的NCA特征选择方法获得的分类准确率为97.7%。在手写签名验证领域获得高分类准确率,同时具有自组织框架的优点,我们提出的这种方法可以确定最佳模型性能和结果验证预测向量。
https://arxiv.org/abs/2401.09467
In this work, we propose MetaScript, a novel Chinese content generation system designed to address the diminishing presence of personal handwriting styles in the digital representation of Chinese characters. Our approach harnesses the power of few-shot learning to generate Chinese characters that not only retain the individual's unique handwriting style but also maintain the efficiency of digital typing. Trained on a diverse dataset of handwritten styles, MetaScript is adept at producing high-quality stylistic imitations from minimal style references and standard fonts. Our work demonstrates a practical solution to the challenges of digital typography in preserving the personal touch in written communication, particularly in the context of Chinese script. Notably, our system has demonstrated superior performance in various evaluations, including recognition accuracy, inception score, and Frechet inception distance. At the same time, the training conditions of our model are easy to meet and facilitate generalization to real applications.
在这项工作中,我们提出了MetaScript,一种旨在解决中文手写风格在数字字符表示中逐渐减少的问题的新颖中文内容生成系统。我们的方法利用了少样本学习的力量,从极少的样本来生成中文字符,不仅保留了了个人的独特手写风格,还保持了数字打字的效率。通过训练在一个多样化的手写风格数据集中,MetaScript擅长从少量的样式参考和标准字体中产生高质量的文体模仿。我们的工作展示了在保留书面沟通的个人触感的数字排版挑战方面的一种实际解决方案,特别是在中文文本背景下。值得注意的是,我们的系统在各种评估中已经表现出卓越的性能,包括识别准确度、创意得分和弗雷歇创新距离。同时,我们模型的训练条件很容易满足,并有助于将模型扩展到真实应用场景。
https://arxiv.org/abs/2312.16251
Handwriting recognition is a key technology for accessing the content of old manuscripts, helping to preserve cultural heritage. Deep learning shows an impressive performance in solving this task. However, to achieve its full potential, it requires a large amount of labeled data, which is difficult to obtain for ancient languages and scripts. Often, a trade-off has to be made between ground truth quantity and quality, as is the case for the recently introduced Bullinger database. It contains an impressive amount of over a hundred thousand labeled text line images of mostly premodern German and Latin texts that were obtained by automatically aligning existing page-level transcriptions with text line images. However, the alignment process introduces systematic errors, such as wrongly hyphenated words. In this paper, we investigate the impact of such errors on training and evaluation and suggest means to detect and correct typical alignment errors.
手写字符识别是访问旧手稿内容的关键技术,有助于保护文化遗产。深度学习在解决这个任务方面表现出令人印象深刻的性能。然而,要实现其全部潜力,需要大量标记数据,这对古代语言和文字来说很难实现。通常,在真实值数量和质量之间必须做出权衡,正如最近推出的Bullinger数据库所示。它包含超过100,000个标记文本行图像,主要是从中自动对齐现有页面级转录与文本行图像。然而,对齐过程引入了系统误差,例如错拼单词。在本文中,我们研究了这种错误对训练和评估的影响,并提出了一种检测和纠正典型对齐错误的手段。
https://arxiv.org/abs/2312.09037
During recent years, there here has been a boom in terms of deep learning use for handwriting analysis and recognition. One main application for handwriting analysis is early detection and diagnosis in the health field. Unfortunately, most real case problems still suffer a scarcity of data, which makes difficult the use of deep learning-based models. To alleviate this problem, some works resort to synthetic data generation. Lately, more works are directed towards guided data synthetic generation, a generation that uses the domain and data knowledge to generate realistic data that can be useful to train deep learning models. In this work, we combine the domain knowledge about the Alzheimer's disease for handwriting and use it for a more guided data generation. Concretely, we have explored the use of in-air movements for synthetic data generation.
在最近几年,深度学习在手写分析和识别领域的应用呈现爆炸式增长。手写分析的一个主要应用是医学领域中的早期检测和诊断。然而,大多数真实病例问题仍然缺乏数据,这使得基于深度学习的模型的应用变得困难。为了解决这个问题,一些工作求助于合成数据生成。近年来,更多的研究将方向转向了指导式数据合成,这是一种利用领域和数据知识来生成现实数据以供训练深度学习模型使用的生成方法。在这项工作中,我们结合了关于阿尔茨海默病手写的领域知识,并将其应用于更指导性的数据生成。具体来说,我们研究了使用空气运动进行合成数据生成的应用。
https://arxiv.org/abs/2312.05086
The challenge of teaching robots to perform dexterous manipulation, dynamic locomotion, or whole--body manipulation from a small number of demonstrations is an important research field that has attracted interest from across the robotics community. In this work, we propose a novel approach by joining the theories of Koopman Operators and Dynamic Movement Primitives to Learning from Demonstration. Our approach, named \gls{admd}, projects nonlinear dynamical systems into linear latent spaces such that a solution reproduces the desired complex motion. Use of an autoencoder in our approach enables generalizability and scalability, while the constraint to a linear system attains interpretability. Our results are comparable to the Extended Dynamic Mode Decomposition on the LASA Handwriting dataset but with training on only a small fractions of the letters.
教机器人进行灵巧操作、动态运动或全身操作,从少数演示中学习,是一个重要的机器人领域,吸引了来自机器人界的广泛关注。在这项工作中,我们提出了一个新方法,将Koopman操作理论和动态运动原型与从演示中学习相结合。我们的方法称为\gls{admd},将非线性动力学系统投影到线性潜在空间中,使得解决方案能够复制所需的复杂运动。使用自编码器在我们的方法中实现了泛化性和可扩展性,而将约束限制到线性系统使得可解释性得到实现。我们的结果与在LASA手写数据集上扩展动态模式分解类似,但只有对少数字母进行训练。
https://arxiv.org/abs/2312.03328
As text generative models can give increasingly long answers, we tackle the problem of synthesizing long text in digital ink. We show that the commonly used models for this task fail to generalize to long-form data and how this problem can be solved by augmenting the training data, changing the model architecture and the inference procedure. These methods use contrastive learning technique and are tailored specifically for the handwriting domain. They can be applied to any encoder-decoder model that works with digital ink. We demonstrate that our method reduces the character error rate on long-form English data by half compared to baseline RNN and by 16% compared to the previous approach that aims at addressing the same problem. We show that all three parts of the method improve recognizability of generated inks. In addition, we evaluate synthesized data in a human study and find that people perceive most of generated data as real.
作为文本生成模型可以给出越来越长的答案,我们解决了在数字墨水合成长文本的问题。我们证明了用于此任务的常用模型无法泛化到长形式数据,以及如何通过增加训练数据、改变模型架构和推理过程来解决这个问题。这些方法使用了对比学习技术,并专门针对手写领域。它们可以应用于任何使用数字墨水的编码器-解码器模型。我们证明了我们的方法将基于RNN的長英文數據的字符錯誤率降低了一半,比基線RNN低16%。我们证明了所有三个部分的方法都能提高生成的墨水的可识别性。此外,我们在人类研究中评估了合成数据,发现人们认为大多数生成的数据是真实的。
https://arxiv.org/abs/2311.17786
Recent advancements in Optical Character Recognition (OCR) have been driven by transformer-based models. OCR systems are critical in numerous high-stakes domains, yet their vulnerability to adversarial attack remains largely uncharted territory, raising concerns about security and compliance with emerging AI regulations. In this work we present a novel framework to assess the resilience of Transformer-based OCR (TrOCR) models. We develop and assess algorithms for both targeted and untargeted attacks. For the untargeted case, we measure the Character Error Rate (CER), while for the targeted case we use the success ratio. We find that TrOCR is highly vulnerable to untargeted attacks and somewhat less vulnerable to targeted attacks. On a benchmark handwriting data set, untargeted attacks can cause a CER of more than 1 without being noticeable to the eye. With a similar perturbation size, targeted attacks can lead to success rates of around $25\%$ -- here we attacked single tokens, requiring TrOCR to output the tenth most likely token from a large vocabulary.
近年来,在自然语言处理(NLP)领域,特别是基于Transformer的模型在光学字符识别(OCR)方面的进步取得了重要突破。OCR系统在许多高风险领域至关重要,但它们对攻击的抵抗力仍然是一个未被充分探索的领域,这引发了关于安全和人工智能法规遵守方面的担忧。在这项工作中,我们提出了一个评估Transformer-based OCR(TrOCR)模型韧性的新框架。我们开发并评估了针对目标和无目标攻击的算法。对于无目标攻击,我们测量字符误差率(CER),而对于有目标攻击,我们使用成功率。我们发现,TrOCR在无目标攻击上高度脆弱,在有目标攻击上 somewhat less vulnerable。在一个人工手写数据集上进行评估,无目标攻击可能导致超过1的CER,而目标攻击可能导致约25%的成功率。在这里,我们攻击单个标记,要求TrOCR从大型词汇中输出最有可能的标记。
https://arxiv.org/abs/2311.17128
This paper introduces a novel approach to post-Optical Character Recognition Correction (POC) for handwritten Cyrillic text, addressing a significant gap in current research methodologies. This gap is due to the lack of large text corporas that provide OCR errors for further training of language-based POC models, which are demanding in terms of corpora size. Our study primarily focuses on the development and application of a synthetic handwriting generation engine based on Bézier curves. Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset by transforming Russian text corpora sourced from the internet. We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training. The correction model is trained on a 90-symbol input context, utilizing a pre-trained T5 architecture with a seq2seq correction task. We evaluate our approach on HWR200 and School_notebooks_RU datasets as they provide significant challenges in the HTR domain. Furthermore, POC can be used to highlight errors for teachers, evaluating student performance. This can be done simply by comparing sentences before and after correction, displaying differences in text. Our primary contribution lies in the innovative use of Bézier curves for Cyrillic text generation and subsequent error correction using a specialized POC model. We validate our approach by presenting Word Accuracy Rate (WAR) and Character Accuracy Rate (CAR) results, both with and without post-OCR correction, using real open corporas of handwritten Cyrillic text. These results, coupled with our methodology, are designed to be reproducible, paving the way for further advancements in the field of OCR and handwritten text analysis. Paper contributions can be found in this https URL
本文提出了一种新的手写体识别纠正方法,用于解决当前研究方法论中存在的显著空白。这一空白是因为缺乏大型的文本语料库,这些语料库提供了进一步训练基于语言的POC模型的错误信息。我们的研究主要关注基于Bézier曲线的手写体生成引擎的开发和应用。这种引擎可以生成任意数量高度逼真的手写文本,我们利用这个生成的大量俄语文本语料库来创建一个庞大的数据集。我们将手写文本识别(HTR)模型应用于这个数据集,以识别OCR错误,为POC模型训练奠定基础。修正模型在90个符号输入上下文中训练,利用预训练的T5架构和序列2序列修正任务。我们在HWR200和School_notebooks_RU数据集上评估我们的方法,因为这些数据集在HTR领域存在重大挑战。此外,POC可以用于教师评估学生表现。这可以通过比较修复前后的句子来简单地完成,显示文本中的差异。我们主要的贡献在于创新地使用Bézier曲线生成手写体和利用专用POC模型进行错误纠正。我们通过展示Word Accuracy Rate(WAR)和Character Accuracy Rate(CAR)结果,包括修复前和修复后的结果,使用真实的手写体拉丁文语料库进行验证。这些结果与我们的方法结合在一起,旨在为OCR和手写体分析领域带来进一步的进步。您可以在该链接找到论文贡献:https://url.cnki.net/ after-correction
https://arxiv.org/abs/2311.15896
The use of artificial intelligence technology in education is growing rapidly, with increasing attention being paid to handwritten mathematical expression recognition (HMER) by researchers. However, many existing methods for HMER may fail to accurately read formulas with complex structures, as the attention results can be inaccurate due to illegible handwriting or large variations in writing styles. Our proposed Intelligent-Detection Network (IDN) for HMER differs from traditional encoder-decoder methods by utilizing object detection techniques. Specifically, we have developed an enhanced YOLOv7 network that can accurately detect both digital and symbolic objects. The detection results are then integrated into the bidirectional gated recurrent unit (BiGRU) and the baseline symbol relationship tree (BSRT) to determine the relationships between symbols and numbers. The experiments demonstrate that the proposed method outperforms those encoder-decoder networks in recognizing complex handwritten mathematical expressions. This is due to the precise detection of symbols and numbers. Our research has the potential to make valuable contributions to the field of HMER. This could be applied in various practical scenarios, such as assignment grading in schools and information entry of paper documents.
人工智能技术在教育领域的应用迅速增长,研究人员越来越关注手写数学表达识别(HMER)。然而,许多现有方法可能无法准确地识别具有复杂结构的复杂数学公式,因为注意结果可能因不清晰的手写或写作风格的巨大差异而出现误差。我们提出的智能检测网络(IDN) for HMER与传统编码器-解码器方法不同,因为它利用了物体检测技术。具体来说,我们开发了一个增强的 YOLOv7 网络,可以准确检测数字和符号物体。检测结果 then 集成到双向门控循环单元(BiGRU)和基线符号关系树(BSRT)中,以确定符号和数字之间的关系。实验证明,与传统的编码器-解码器网络相比,该方法在识别复杂手写数学表达式方面表现出优异性能。这是由于符号和数字的准确检测。我们的研究有望为 H梅尔领域做出有价值的贡献。这可以在各种实际场景中应用,例如学校中的作业评分和纸质文件的信息录入等。
https://arxiv.org/abs/2311.15273
Background and objectives: Dynamic handwriting analysis, due to its non-invasive and readily accessible nature, has recently emerged as a vital adjunctive method for the early diagnosis of Parkinson's disease. In this study, we design a compact and efficient network architecture to analyse the distinctive handwriting patterns of patients' dynamic handwriting signals, thereby providing an objective identification for the Parkinson's disease diagnosis. Methods: The proposed network is based on a hybrid deep learning approach that fully leverages the advantages of both long short-term memory (LSTM) and convolutional neural networks (CNNs). Specifically, the LSTM block is adopted to extract the time-varying features, while the CNN-based block is implemented using one-dimensional convolution for low computational cost. Moreover, the hybrid model architecture is continuously refined under ablation studies for superior performance. Finally, we evaluate the proposed method with its generalization under a five-fold cross-validation, which validates its efficiency and robustness. Results: The proposed network demonstrates its versatility by achieving impressive classification accuracies on both our new DraWritePD dataset ($96.2\%$) and the well-established PaHaW dataset ($90.7\%$). Moreover, the network architecture also stands out for its excellent lightweight design, occupying a mere $0.084$M of parameters, with a total of only $0.59$M floating-point operations. It also exhibits near real-time CPU inference performance, with inference times ranging from $0.106$ to $0.220$s. Conclusions: We present a series of experiments with extensive analysis, which systematically demonstrate the effectiveness and efficiency of the proposed hybrid neural network in extracting distinctive handwriting patterns for precise diagnosis of Parkinson's disease.
背景和目标:由于其非侵入性和易于访问的特点,近年来动态手写分析已成为早期诊断帕金森病的实用方法。在这项研究中,我们设计了一个紧凑且高效的网络架构,用于分析患者动态手写信号的显著特征,从而为帕金森病诊断提供客观依据。方法:所提出的网络基于一种结合长短时记忆(LSTM)和卷积神经网络(CNN)的优势的混合深度学习方法。具体来说,LSTM模块用于提取时间变化特征,而基于CNN的模块则使用一维卷积进行低计算成本的实现。此外,在消融研究中对模型架构进行了持续改进,以提高性能。最后,我们在五倍交叉验证上评估所提出的方法,验证了其效率和稳健性。结果:与我们的新DraWritePD数据集($96.2\%)和已知的有成效的PaHaW数据集($90.7\%)相比,所提出的网络在分类准确性方面都表现出惊人的效果。此外,网络架构还因其轻量级设计而脱颖而出,仅占0.084M的参数,总共有0.59M的浮点运算。它还表现出近乎实时的CPU推理性能,推理时间从0.106到0.220秒。结论:我们提供了系列实验,详细分析了所提出的混合神经网络在提取帕金森病独特手写模式方面的有效性和效率。实验结果表明,所提出的混合神经网络在精确诊断帕金森病方面具有出色的效果和效率。
https://arxiv.org/abs/2311.11756
Writing assistance is an application closely related to human life and is also a fundamental Natural Language Processing (NLP) research field. Its aim is to improve the correctness and quality of input texts, with character checking being crucial in detecting and correcting wrong characters. From the perspective of the real world where handwriting occupies the vast majority, characters that humans get wrong include faked characters (i.e., untrue characters created due to writing errors) and misspelled characters (i.e., true characters used incorrectly due to spelling errors). However, existing datasets and related studies only focus on misspelled characters mainly caused by phonological or visual confusion, thereby ignoring faked characters which are more common and difficult. To break through this dilemma, we present Visual-C$^3$, a human-annotated Visual Chinese Character Checking dataset with faked and misspelled Chinese characters. To the best of our knowledge, Visual-C$^3$ is the first real-world visual and the largest human-crafted dataset for the Chinese character checking scenario. Additionally, we also propose and evaluate novel baseline methods on Visual-C$^3$. Extensive empirical results and analyses show that Visual-C$^3$ is high-quality yet challenging. The Visual-C$^3$ dataset and the baseline methods will be publicly available to facilitate further research in the community.
写作协助是一个与人类生活和自然语言处理(NLP)密切相关并作为NLP研究领域的基本应用。其目标是提高输入文本的正确性和质量,其中字符检查在检测和纠正错误字符方面至关重要。从现实世界的角度来看,人类会犯错包括由于书写错误而创建的虚假字符和拼写错误导致的真实字符。然而,现有数据集和相关研究主要关注由于音标或视觉混淆引起的拼写错误主要字符,而忽略了更加普遍和困难的伪造字符。为了突破这一困境,我们提出了Visual-C$^3$,一个由人类标注的视觉中文字符检查数据集,包括伪造和拼写错误的中文字符。据我们所知,Visual-C$^3$是第一个真实世界的视觉中文字符检查数据集,也是中文字符检查场景中最大的人造数据集。此外,我们还针对Visual-C$^3$提出了并评估了新型基准方法。大量的实证结果和分析表明,Visual-C$^3$具有高质量但具有挑战性。Visual-C$^3$数据集和基准方法将公开发布,以促进社区进一步研究。
https://arxiv.org/abs/2311.11268