Abstract
Precise character segmentation is the only solution towards higher Optical Character Recognition (OCR) accuracy. In cursive script, overlapped characters are serious issue in the process of character segmentations as characters are deprived from their discriminative parts using conventional linear segmentation strategy. Hence, non-linear segmentation is an utmost need to avoid loss of characters parts and to enhance character/script recognition accuracy. This paper presents an improved approach for non-linear segmentation of the overlapped characters in handwritten roman script. The proposed technique is composed of a sequence of heuristic rules based on geometrical features of characters to locate possible non-linear character boundaries in a cursive script word. However, to enhance efficiency, heuristic approach is integrated with trained ensemble neural network validation strategy for verification of character boundaries. Accordingly, correct boundaries are retained and incorrect are removed based on ensemble neural networks vote. Finally, based on verified valid segmentation points, characters are segmented non-linearly. For fair comparison CEDAR benchmark database is experimented. The experimental results are much better than conventional linear character segmentation techniques reported in the state of art. Ensemble neural network play vital role to enhance character segmentation accuracy as compared to individual neural networks.
Abstract (translated)
精确的字符分割是提高光学字符识别(OCR)精度的唯一解决方案。在草书中,重叠字符是字符分割过程中的一个重要问题,因为使用传统的线性分割策略可以将字符从识别部分去除。因此,非线性分割是避免字符部分丢失和提高字符/脚本识别精度的最大需要。本文提出了一种改进的手写体重叠字符非线性分割方法。该技术由一系列基于字符几何特征的启发式规则组成,用于在草书字中定位可能的非线性字符边界。然而,为了提高效率,启发式方法与训练的集成神经网络验证策略相结合,用于字符边界的验证。相应地,保留了正确的边界,并根据集合神经网络投票去除了不正确的边界。最后,基于验证的有效分割点,对字符进行非线性分割。为了公平比较,对Cedar基准数据库进行了试验。实验结果比现有的线性字符分割技术要好得多。与单个神经网络相比,集成神经网络在提高字符分割精度方面起着至关重要的作用。
URL
https://arxiv.org/abs/1904.12592