Paper Reading AI Learner

A Transformer-based Approach for Arabic Offline Handwritten Text Recognition

2023-07-27 17:51:52
Saleh Momeni, Bagher BabaAli

Abstract

Handwriting recognition is a challenging and critical problem in the fields of pattern recognition and machine learning, with applications spanning a wide range of domains. In this paper, we focus on the specific issue of recognizing offline Arabic handwritten text. Existing approaches typically utilize a combination of convolutional neural networks for image feature extraction and recurrent neural networks for temporal modeling, with connectionist temporal classification used for text generation. However, these methods suffer from a lack of parallelization due to the sequential nature of recurrent neural networks. Furthermore, these models cannot account for linguistic rules, necessitating the use of an external language model in the post-processing stage to boost accuracy. To overcome these issues, we introduce two alternative architectures, namely the Transformer Transducer and the standard sequence-to-sequence Transformer, and compare their performance in terms of accuracy and speed. Our approach can model language dependencies and relies only on the attention mechanism, thereby making it more parallelizable and less complex. We employ pre-trained Transformers for both image understanding and language modeling. Our evaluation on the Arabic KHATT dataset demonstrates that our proposed method outperforms the current state-of-the-art approaches for recognizing offline Arabic handwritten text.

Abstract (translated)

手写识别是模式识别和机器学习领域的挑战和关键问题,应用范围广泛。在本文中,我们将专注于识别离线阿拉伯手写文本的特定问题。现有的方法通常使用卷积神经网络的图像特征提取和循环神经网络的时间建模相结合,并用连接主义的时间分类用于文本生成。然而,这些方法由于循环神经网络的序列性质而缺乏并行化。此外,这些模型无法考虑语言学规则,因此需要在处理后期使用外部语言模型来提高准确性。为了克服这些问题,我们介绍了两个 alternative 架构,即Transformer 转换器和标准序列到序列 Transformer,并比较了它们的性能和速度。我们的方法可以建模语言依赖关系,仅依靠注意力机制,因此使其更可并行化,更简洁。我们使用预训练的Transformers 用于图像理解和语言建模。我们对阿拉伯语的KHATT 数据集进行评估,表明我们提出的方法在识别离线阿拉伯手写文本方面优于当前最先进的方法。

URL

https://arxiv.org/abs/2307.15045

PDF

https://arxiv.org/pdf/2307.15045.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot