Paper Reading AI Learner

IDPL-PFOD2: A New Large-Scale Dataset for Printed Farsi Optical Character Recognition

2023-12-02 16:56:57
Fatemeh Asadi-zeydabadi, Ali Afkari-Fahandari, Amin Faraji, Elham Shabaninia, Hossein Nezamabadi-pour

Abstract

Optical Character Recognition is a technique that converts document images into searchable and editable text, making it a valuable tool for processing scanned documents. While the Farsi language stands as a prominent and official language in Asia, efforts to develop efficient methods for recognizing Farsi printed text have been relatively limited. This is primarily attributed to the languages distinctive features, such as cursive form, the resemblance between certain alphabet characters, and the presence of numerous diacritics and dot placement. On the other hand, given the substantial training sample requirements of deep-based architectures for effective performance, the development of such datasets holds paramount significance. In light of these concerns, this paper aims to present a novel large-scale dataset, IDPL-PFOD2, tailored for Farsi printed text recognition. The dataset comprises 2003541 images featuring a wide variety of fonts, styles, and sizes. This dataset is an extension of the previously introduced IDPL-PFOD dataset, offering a substantial increase in both volume and diversity. Furthermore, the datasets effectiveness is assessed through the utilization of both CRNN-based and Vision Transformer architectures. The CRNN-based model achieves a baseline accuracy rate of 78.49% and a normalized edit distance of 97.72%, while the Vision Transformer architecture attains an accuracy of 81.32% and a normalized edit distance of 98.74%.

Abstract (translated)

光学字符识别是一种将文档图像转换为可搜索和可编辑文本的技术,使其成为处理扫描文档的有价值的工具。虽然波斯语作为一种突出和官方的语言亚洲具有突出地位,但开发有效的识别波斯语印刷文本的方法相对有限。这主要归因于其独特的特征,如手写形式、某些字母之间的相似性以及大量的小写和点状排列。另一方面,由于基于深度架构的架构对有效性能的训练样本需求很大,开发这样的数据集具有关键意义。鉴于这些担忧,本文旨在介绍一个专为波斯语印刷文本识别而设计的大型数据集——IDPL-PFOD2。该数据集包括2003541张具有各种字体、风格和大小的图像。这个数据集是之前介绍的IDPL-PFOD数据集的扩展,提供了极大的数据量和多样性。此外,通过使用CRNN和Vision Transformer架构对数据集的有效性进行评估。基于CRNN的模型实现基线准确率为78.49%,归一化编辑距离为97.72%,而基于Vision Transformer的架构实现准确率为81.32%,归一化编辑距离为98.74%。

URL

https://arxiv.org/abs/2312.01177

PDF

https://arxiv.org/pdf/2312.01177.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot