Paper Reading AI Learner

Urdu Digital Text Word Optical Character Recognition Using Permuted Auto Regressive Sequence Modeling

2024-08-27 14:58:13
Ahmed Mustafa, Ijlal Baig, Hasan Sajid

Abstract

This research paper introduces an innovative word-level Optical Character Recognition (OCR) model specifically designed for digital Urdu text recognition. Utilizing transformer-based architectures and attention mechanisms, the model was trained on a comprehensive dataset of approximately 160,000 Urdu text images, achieving a character error rate (CER) of 0.178, which highlights its superior accuracy in recognizing Urdu characters. The model's strength lies in its unique architecture, incorporating the permuted autoregressive sequence (PARSeq) model, which allows for context-aware inference and iterative refinement by leveraging bidirectional context information to enhance recognition accuracy. Furthermore, its capability to handle a diverse range of Urdu text styles, fonts, and variations enhances its applicability in real-world scenarios. Despite its promising results, the model has some limitations, such as difficulty with blurred images, non-horizontal orientations, and overlays of patterns, lines, or other text, which can occasionally lead to suboptimal performance. Additionally, trailing or following punctuation marks can introduce noise into the recognition process. Addressing these challenges will be a focus of future research, aiming to refine the model further, explore data augmentation techniques, optimize hyperparameters, and integrate contextual improvements for more accurate and efficient Urdu text recognition.

Abstract (translated)

本文提出了一种专门针对数字乌尔都语文本识别的创新词级光学字符识别(OCR)模型。该模型利用变压器架构和注意力机制在综合数据集上训练,约160,000个乌尔都语文本图像,实现了一个字符错误率(CER)为0.178,这表明其在识别乌尔都字符方面的优越性。该模型的优势在于其独特的架构,包括通过使用可变自回归序列(PARSeq)模型实现上下文感知推理和迭代改进,以便利用双向上下文信息提高识别准确性。此外,它能处理各种乌尔都语文本风格、字体和变化,使其在现实场景中具有更广泛的适用性。尽管其取得了很好的效果,但该模型仍存在一些局限性,例如对模糊图像处理困难,不规则的旋转方向,以及图形的叠加,这些偶尔会导致性能最优。此外,尾随或紧跟标点符号可能会引入噪声到识别过程。解决这些挑战将是未来研究的重点,旨在进一步优化模型,探索数据增强技术,优化超参数,并实现对乌尔都语文本识别的上下文改进,使其更加准确和高效。

URL

https://arxiv.org/abs/2408.15119

PDF

https://arxiv.org/pdf/2408.15119.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot