Paper Reading AI Learner

A Transformer Based Handwriting Recognition System Jointly Using Online and Offline Features

2025-06-25 08:58:47
Ayush Lodh, Ritabrata Chakraborty, Shivakumara Palaiahnakote, Umapada Pal

Abstract

We posit that handwriting recognition benefits from complementary cues carried by the rasterized complex glyph and the pen's trajectory, yet most systems exploit only one modality. We introduce an end-to-end network that performs early fusion of offline images and online stroke data within a shared latent space. A patch encoder converts the grayscale crop into fixed-length visual tokens, while a lightweight transformer embeds the $(x, y, \text{pen})$ sequence. Learnable latent queries attend jointly to both token streams, yielding context-enhanced stroke embeddings that are pooled and decoded under a cross-entropy loss objective. Because integration occurs before any high-level classification, temporal cues reinforce each other during representation learning, producing stronger writer independence. Comprehensive experiments on IAMOn-DB and VNOn-DB demonstrate that our approach achieves state-of-the-art accuracy, exceeding previous bests by up to 1\%. Our study also shows adaptation of this pipeline with gesturification on the ISI-Air dataset. Our code can be found here.

Abstract (translated)

我们提出,手写识别可以从栅格化复杂字符和笔迹轨迹所携带的互补线索中获益,然而大多数系统仅利用了一种模态。为此,我们引入了一个端到端网络,在共享潜在空间内对离线图像与在线笔画数据进行早期融合。一个补丁编码器将灰度裁剪部分转换为固定长度的视觉令牌,而轻量级变压器则嵌入了$(x, y, \text{pen})$序列。可学习的潜在查询同时关注两个令牌流,生成增强上下文的笔画嵌入,并在交叉熵损失目标下进行池化和解码。由于这种集成发生在任何高层次分类之前,在表示学习过程中时间线索相互强化,从而产生更强的书写者独立性。 我们在IAMOn-DB和VNOn-DB数据集上进行了全面实验,结果表明我们的方法实现了业界领先精度,比先前的最佳模型高出最多1%。此外,研究表明在ISI-Air数据集中使用手势化技术也可以适应该管道的应用。我们的代码可以在这里找到。

URL

https://arxiv.org/abs/2506.20255

PDF

https://arxiv.org/pdf/2506.20255.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot