Abstract
Currently, a substantial volume of document data exists in an unstructured format, encompassing Portable Document Format (PDF) files and images. Extracting information from these documents presents formidable challenges due to diverse table styles, complex forms, and the inclusion of different languages. Several open-source toolkits, such as Camelot, Plumb a PDF (pdfnumber), and Paddle Paddle Structure V2 (PP-StructureV2), have been developed to facilitate table extraction from PDFs or images. However, each toolkit has its limitations. Camelot and pdfnumber can solely extract tables from digital PDFs and cannot handle image-based PDFs and pictures. On the other hand, PP-StructureV2 can comprehensively extract image-based PDFs and tables from pictures. Nevertheless, it lacks the ability to differentiate between diverse application scenarios, such as wired tables and wireless tables, digital PDFs, and image-based PDFs. To address these issues, we have introduced the PDF table extraction (PdfTable) toolkit. This toolkit integrates numerous open-source models, including seven table recognition models, four Optical character recognition (OCR) recognition tools, and three layout analysis models. By refining the PDF table extraction process, PdfTable achieves adaptability across various application scenarios. We substantiate the efficacy of the PdfTable toolkit through verification on a self-labeled wired table dataset and the open-source wireless Publicly Table Reconition Dataset (PubTabNet). The PdfTable code will available on Github: this https URL.
Abstract (translated)
目前,大量的文档数据以非结构化方式存在,包括可移动文档格式(PDF)文件和图像。从这些文档中提取信息因不同的表格样式、复杂的表格和不同语言的存在而带来了巨大的挑战。为了解决这个问题,已经开发了几种开源工具包,如Camelot、Plumb和Paddle Paddle Structure V2(PP-StructureV2),以帮助从PDF或图像中提取表格。然而,每个工具包都有其局限性。Camelot和pdfnumber只能从数字PDF中提取表格,而不能处理基于图像的PDF和图片。另一方面,PP-StructureV2可以全面提取基于图像的PDF和表格。然而,它缺乏区分不同应用场景的能力,例如有线表格和无线表格、数字PDF和基于图像的PDF。为解决这些问题,我们引入了PDF表格提取(PdfTable)工具包。这个工具包整合了多个开源模型,包括七个表格识别模型、四个光学字符识别(OCR)识别工具和三个排版分析模型。通过优化PDF表格提取过程,PdfTable在各种应用场景中实现可扩展性。我们通过验证自标签的有线表格数据集和开源无线公开表格重建数据集(PubTabNet)来证实PdfTable工具包的有效性。PdfTable代码将在Github上发布:https://this URL。
URL
https://arxiv.org/abs/2409.05125