DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents

Abstract
Abstract (translated)
URL
PDF

Abstract

Document semantic segmentation is a promising avenue that can facilitate document analysis tasks, including optical character recognition (OCR), form classification, and document editing. Although several synthetic datasets have been developed to distinguish handwriting from printed text, they fall short in class variety and document diversity. We demonstrate the limitations of training on existing datasets when solving the National Archives Form Semantic Segmentation dataset (NAFSS), a dataset which we introduce. To address these limitations, we propose the most comprehensive document semantic segmentation synthesis pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources to create the Document Element Layer INtegration Ensemble 8K, or DELINE8K dataset. Our customized dataset exhibits superior performance on the NAFSS benchmark, demonstrating it as a promising tool in further research. The DELINE8K dataset is available at this https URL.

Abstract (translated)

文档语义分割是一个有前途的方法，可以促进文档分析任务，包括光学字符识别（OCR）、表单分类和文档编辑。尽管已经开发了几个合成数据集来区分手写文本和打印文本，但它们在类别的多样性和文档多样性方面都存在不足。我们证明了在解决国家档案馆形式语义分割数据集（NAFSS）时训练现有数据集的局限性。为了应对这些局限性，我们提出了目前最全面的文档语义分割合成管道，结合了来自10个以上来源的预打印文本、手写文本和文档背景，创建了文档元素层集成集8K或DELINE8K数据集。我们的定制数据集在NAFSS基准测试中表现出卓越的性能，证明其在进一步研究中具有前景的工具。DELINE8K数据集可在以下链接处获得。

URL

https://arxiv.org/abs/2404.19259

PDF

https://arxiv.org/pdf/2404.19259.pdf

DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents

Abstract

Abstract (translated)

URL

PDF Copy

PDF