Paper Reading AI Learner

Offline Detection of Misspelled Handwritten Words by Convolving Recognition Model Features with Text Labels


Abstract

Offline handwriting recognition (HWR) has improved significantly with the advent of deep learning architectures in recent years. Nevertheless, it remains a challenging problem and practical applications often rely on post-processing techniques for restricting the predicted words via lexicons or language models. Despite their enhanced performance, such systems are less usable in contexts where out-of-vocabulary words are anticipated, e.g. for detecting misspelled words in school assessments. To that end, we introduce the task of comparing a handwriting image to text. To solve the problem, we propose an unrestricted binary classifier, consisting of a HWR feature extractor and a multimodal classification head which convolves the feature extractor output with the vector representation of the input text. Our model's classification head is trained entirely on synthetic data created using a state-of-the-art generative adversarial network. We demonstrate that, while maintaining high recall, the classifier can be calibrated to achieve an average precision increase of 19.5% compared to addressing the task by directly using state-of-the-art HWR models. Such massive performance gains can lead to significant productivity increases in applications utilizing human-in-the-loop automation.

Abstract (translated)

过去几年中,深度学习架构的出现使得离线手写识别(HWR)性能得到了显著提高。然而,它仍然是一个具有挑战性的问题,并且实用的应用程序通常依赖于后处理技术通过词汇表或语言模型限制预测单词。尽管这些系统的性能得到了增强,但在预计缺少词汇表的单词的情况下,它们 less useful,例如在在学校评估中检测拼写错误的单词方面。为此,我们引入了比较手写图像和文本的任务。为了解决这个问题,我们提出了一个不受限制的二进制分类器,它由一个HWR特征提取器和一个多模式分类头组成,该分类头将特征提取器输出与输入文本的向量表示卷积。我们训练我们的分类头完全使用先进的生成对抗网络生成的模拟数据。我们证明,尽管保持高召回率,分类器可以校准以实现平均精度提高19.5%,而直接使用先进的HWR模型解决这个问题则无法达到这个水平。这种巨大的性能提升可以在利用人类参与的自动化应用中导致显著的生产率增加。

URL

https://arxiv.org/abs/2309.10158

PDF

https://arxiv.org/pdf/2309.10158.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot