Abstract
Optical Character Recognition is a technique that converts document images into searchable and editable text, making it a valuable tool for processing scanned documents. While the Farsi language stands as a prominent and official language in Asia, efforts to develop efficient methods for recognizing Farsi printed text have been relatively limited. This is primarily attributed to the languages distinctive features, such as cursive form, the resemblance between certain alphabet characters, and the presence of numerous diacritics and dot placement. On the other hand, given the substantial training sample requirements of deep-based architectures for effective performance, the development of such datasets holds paramount significance. In light of these concerns, this paper aims to present a novel large-scale dataset, IDPL-PFOD2, tailored for Farsi printed text recognition. The dataset comprises 2003541 images featuring a wide variety of fonts, styles, and sizes. This dataset is an extension of the previously introduced IDPL-PFOD dataset, offering a substantial increase in both volume and diversity. Furthermore, the datasets effectiveness is assessed through the utilization of both CRNN-based and Vision Transformer architectures. The CRNN-based model achieves a baseline accuracy rate of 78.49% and a normalized edit distance of 97.72%, while the Vision Transformer architecture attains an accuracy of 81.32% and a normalized edit distance of 98.74%.
Abstract (translated)
光学字符识别是一种将文档图像转换为可搜索和可编辑文本的技术,使其成为处理扫描文档的有价值的工具。虽然波斯语作为一种突出和官方的语言亚洲具有突出地位,但开发有效的识别波斯语印刷文本的方法相对有限。这主要归因于其独特的特征,如手写形式、某些字母之间的相似性以及大量的小写和点状排列。另一方面,由于基于深度架构的架构对有效性能的训练样本需求很大,开发这样的数据集具有关键意义。鉴于这些担忧,本文旨在介绍一个专为波斯语印刷文本识别而设计的大型数据集——IDPL-PFOD2。该数据集包括2003541张具有各种字体、风格和大小的图像。这个数据集是之前介绍的IDPL-PFOD数据集的扩展,提供了极大的数据量和多样性。此外,通过使用CRNN和Vision Transformer架构对数据集的有效性进行评估。基于CRNN的模型实现基线准确率为78.49%,归一化编辑距离为97.72%,而基于Vision Transformer的架构实现准确率为81.32%,归一化编辑距离为98.74%。
URL
https://arxiv.org/abs/2312.01177