Page Stream Segmentation with Convolutional Neural Nets Combining Textual and Visual Features

Abstract
Abstract (translated)
URL
PDF

Abstract

In recent years, (retro-)digitizing paper-based files became a major undertaking for private and public archives as well as an important task in electronic mailroom applications. As a first step, the workflow involves scanning and Optical Character Recognition (OCR) of documents. Preservation of document contexts of single page scans is a major requirement in this context. To facilitate workflows involving very large amounts of paper scans, page stream segmentation (PSS) is the task to automatically separate a stream of scanned images into multi-page documents. In a digitization project together with a German federal archive, we developed a novel approach based on convolutional neural networks (CNN) combining image and text features to achieve optimal document separation results. Evaluation shows that our PSS architecture achieves an accuracy up to 93 % which can be regarded as a new state-of-the-art for this task.

Abstract (translated)

近年来，（回复）数字化纸质文件成为私人和公共档案馆的一项重要工作，也是电子信箱应用中的一项重要任务。作为第一步，工作流程涉及文档的扫描和光学字符识别（OCR）。在这种情况下，保存单页扫描的文档上下文是一项主要要求。为了方便涉及大量纸张扫描的工作流，页面流分割（PSS）的任务是自动将扫描图像流分离为多页文档。在一个与德国联邦档案馆一起的数字化项目中，我们开发了一种基于卷积神经网络（CNN）的结合图像和文本特征的新方法，以获得最佳的文档分离结果。评估结果表明，我们的PSS体系结构达到了高达93%的准确率，这可以被视为这项任务的最新技术。

URL

https://arxiv.org/abs/1710.03006

PDF

https://arxiv.org/pdf/1710.03006.pdf