Abstract
Offline handwriting recognition (HWR) has improved significantly with the advent of deep learning architectures in recent years. Nevertheless, it remains a challenging problem and practical applications often rely on post-processing techniques for restricting the predicted words via lexicons or language models. Despite their enhanced performance, such systems are less usable in contexts where out-of-vocabulary words are anticipated, e.g. for detecting misspelled words in school assessments. To that end, we introduce the task of comparing a handwriting image to text. To solve the problem, we propose an unrestricted binary classifier, consisting of a HWR feature extractor and a multimodal classification head which convolves the feature extractor output with the vector representation of the input text. Our model's classification head is trained entirely on synthetic data created using a state-of-the-art generative adversarial network. We demonstrate that, while maintaining high recall, the classifier can be calibrated to achieve an average precision increase of 19.5% compared to addressing the task by directly using state-of-the-art HWR models. Such massive performance gains can lead to significant productivity increases in applications utilizing human-in-the-loop automation.
Abstract (translated)
过去几年中,深度学习架构的出现使得离线手写识别(HWR)性能得到了显著提高。然而,它仍然是一个具有挑战性的问题,并且实用的应用程序通常依赖于后处理技术通过词汇表或语言模型限制预测单词。尽管这些系统的性能得到了增强,但在预计缺少词汇表的单词的情况下,它们 less useful,例如在在学校评估中检测拼写错误的单词方面。为此,我们引入了比较手写图像和文本的任务。为了解决这个问题,我们提出了一个不受限制的二进制分类器,它由一个HWR特征提取器和一个多模式分类头组成,该分类头将特征提取器输出与输入文本的向量表示卷积。我们训练我们的分类头完全使用先进的生成对抗网络生成的模拟数据。我们证明,尽管保持高召回率,分类器可以校准以实现平均精度提高19.5%,而直接使用先进的HWR模型解决这个问题则无法达到这个水平。这种巨大的性能提升可以在利用人类参与的自动化应用中导致显著的生产率增加。
URL
https://arxiv.org/abs/2309.10158