Abstract
In this paper, we create benchmarks and assess the effectiveness of error correction methods for Japanese vouchers in OCR (Optical Character Recognition) systems. It is essential for automation processing to correctly recognize scanned voucher text, such as the company name on invoices. However, perfect recognition is complex due to the noise, such as stamps. Therefore, it is crucial to correctly rectify erroneous OCR results. However, no publicly available OCR error correction benchmarks for Japanese exist, and methods have not been adequately researched. In this study, we measured text recognition accuracy by existing services on Japanese vouchers and developed a post-OCR correction benchmark. Then, we proposed simple baselines for error correction using language models and verified whether the proposed method could effectively correct these errors. In the experiments, the proposed error correction algorithm significantly improved overall recognition accuracy.
Abstract (translated)
在本文中,我们为日本电子券的错误纠正方法创建了基准并评估了其在OCR(光学字符识别)系统中的有效性。对于自动处理,正确识别扫描的电子券文本至关重要,例如发票上的公司名称。然而,由于噪声(如邮票)的存在,完美识别是复杂的。因此,正确校正错误的OCR结果至关重要。然而,目前尚未公开的日本电子券的OCR错误纠正基准存在,且相关方法的研究不足。在本研究中,我们通过现有的日本电子券服务测量了文本识别准确性,并开发了一个后OCR修正基准。然后,我们使用语言模型提出了简单的错误纠正基线,并验证了所提出的方法是否能有效纠正这些错误。在实验中,与所提出的错误纠正算法相比,所提出的算法显著提高了整体识别准确性。
URL
https://arxiv.org/abs/2409.19948