Paper Reading AI Learner

PdfTable: A Unified Toolkit for Deep Learning-Based Table Extraction

2024-09-08 15:08:51
Lei Sheng, Shuai-Shuai Xu

Abstract

Currently, a substantial volume of document data exists in an unstructured format, encompassing Portable Document Format (PDF) files and images. Extracting information from these documents presents formidable challenges due to diverse table styles, complex forms, and the inclusion of different languages. Several open-source toolkits, such as Camelot, Plumb a PDF (pdfnumber), and Paddle Paddle Structure V2 (PP-StructureV2), have been developed to facilitate table extraction from PDFs or images. However, each toolkit has its limitations. Camelot and pdfnumber can solely extract tables from digital PDFs and cannot handle image-based PDFs and pictures. On the other hand, PP-StructureV2 can comprehensively extract image-based PDFs and tables from pictures. Nevertheless, it lacks the ability to differentiate between diverse application scenarios, such as wired tables and wireless tables, digital PDFs, and image-based PDFs. To address these issues, we have introduced the PDF table extraction (PdfTable) toolkit. This toolkit integrates numerous open-source models, including seven table recognition models, four Optical character recognition (OCR) recognition tools, and three layout analysis models. By refining the PDF table extraction process, PdfTable achieves adaptability across various application scenarios. We substantiate the efficacy of the PdfTable toolkit through verification on a self-labeled wired table dataset and the open-source wireless Publicly Table Reconition Dataset (PubTabNet). The PdfTable code will available on Github: this https URL.

Abstract (translated)

目前,大量的文档数据以非结构化方式存在,包括可移动文档格式(PDF)文件和图像。从这些文档中提取信息因不同的表格样式、复杂的表格和不同语言的存在而带来了巨大的挑战。为了解决这个问题,已经开发了几种开源工具包,如Camelot、Plumb和Paddle Paddle Structure V2(PP-StructureV2),以帮助从PDF或图像中提取表格。然而,每个工具包都有其局限性。Camelot和pdfnumber只能从数字PDF中提取表格,而不能处理基于图像的PDF和图片。另一方面,PP-StructureV2可以全面提取基于图像的PDF和表格。然而,它缺乏区分不同应用场景的能力,例如有线表格和无线表格、数字PDF和基于图像的PDF。为解决这些问题,我们引入了PDF表格提取(PdfTable)工具包。这个工具包整合了多个开源模型,包括七个表格识别模型、四个光学字符识别(OCR)识别工具和三个排版分析模型。通过优化PDF表格提取过程,PdfTable在各种应用场景中实现可扩展性。我们通过验证自标签的有线表格数据集和开源无线公开表格重建数据集(PubTabNet)来证实PdfTable工具包的有效性。PdfTable代码将在Github上发布:https://this URL。

URL

https://arxiv.org/abs/2409.05125

PDF

https://arxiv.org/pdf/2409.05125.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot