Abstract
This paper presents a comprehensive evaluation of the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM). We assess the model's performance across a range of OCR tasks, including scene text recognition, handwritten text recognition, handwritten mathematical expression recognition, table structure recognition, and information extraction from visually-rich document. The evaluation reveals that GPT-4V performs well in recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks. Based on these observations, we delve deeper into the necessity of specialized OCR models and deliberate on the strategies to fully harness the pretrained general LMMs like GPT-4V for OCR downstream tasks. The study offers a critical reference for future research in OCR with LMMs. Evaluation pipeline and results are available at this https URL.
Abstract (translated)
本文对GPT-4V(Vision),一个 recently发布的Large Multimodal Model(LMM)进行了Optical Character Recognition(OCR)能力进行全面评估。我们在一系列OCR任务中评估了模型的性能,包括场景文本识别、手写文本识别、手写数学表达式识别、表结构识别和从视觉丰富的文档中提取信息。评估显示,GPT-4V在识别和理解拉丁文本方面表现良好,但在多语言场景和复杂任务上表现不佳。基于这些观察结果,我们深入研究了专用OCR模型的必要性,并考虑了如何充分利用预训练的一般LMMs(如GPT-4V)进行OCR下游任务的策略。该研究为未来的OCR研究提供了重要的参考。评估流程和结果可在此链接查看:https://url.cn/xyz6h4yx
URL
https://arxiv.org/abs/2310.16809