Abstract
For the bachelor project 2021 of Professor Lippert's research group, handwritten entries of historical patient records needed to be digitized using Optical Character Recognition (OCR) methods. Since the data will be used in the future, a high degree of accuracy is naturally required. Especially in the medical field this has even more importance. Ensemble Learning is a method that combines several machine learning models and is claimed to be able to achieve an increased accuracy for existing methods. For this reason, Ensemble Learning in combination with OCR is investigated in this work in order to create added value for the digitization of the patient records. It was possible to discover that ensemble learning can lead to an increased accuracy for OCR, which methods were able to achieve this and that the size of the training data set did not play a role here.
Abstract (translated)
2021年Lippert教授研究小组的学士项目中,需要使用光学字符识别(OCR)方法将历史患者的记录数字化。由于这些数据将来会被使用,因此对准确性有很高的要求,特别是在医疗领域更是如此。集成学习是一种结合多个机器学习模型的方法,并且据称能够提高现有方法的精度。为此,在本工作中探讨了集成学习与OCR相结合的可能性,以期为患者记录的数字化创造更多价值。研究发现,集成学习确实可以提高OCR的准确性,并确定了哪些方法能够实现这一点,同时发现训练数据集的大小在此过程中并不起决定性作用。
URL
https://arxiv.org/abs/2509.16221