Segmenting Messy Text: Detecting Boundaries in Text Derived from Historical Newspaper Images

Abstract
Abstract (translated)
URL
PDF

Abstract

Text segmentation, the task of dividing a document into sections, is often a prerequisite for performing additional natural language processing tasks. Existing text segmentation methods have typically been developed and tested using clean, narrative-style text with segments containing distinct topics. Here we consider a challenging text segmentation task: dividing newspaper marriage announcement lists into units of one announcement each. In many cases the information is not structured into sentences, and adjacent segments are not topically distinct from each other. In addition, the text of the announcements, which is derived from images of historical newspapers via optical character recognition, contains many typographical errors. As a result, these announcements are not amenable to segmentation with existing techniques. We present a novel deep learning-based model for segmenting such text and show that it significantly outperforms an existing state-of-the-art method on our task.

Abstract (translated)

文本分割，即将文档划分为段落，通常是对执行其他自然语言处理任务的一个先决条件。现有的文本分割方法通常使用干净、叙述性风格的文本，其中包含有明确主题的段落。在这里我们考虑一个具有挑战性的文本分割任务：将报纸结婚声明列表分割为每个声明的单位。在许多情况下，信息并没有划分为句子，相邻的段落也没有从属关系。此外，声明的文本是通过光学字符识别从历史报纸中提取的，因此包含许多排版错误。因此，这些声明无法使用现有技术进行分割。我们提出了一个基于深度学习的分割文本的新模型，并证明了它在我们的任务上显著超过了现有技术的水平。

URL

https://arxiv.org/abs/2312.12773

PDF

https://arxiv.org/pdf/2312.12773.pdf

Segmenting Messy Text: Detecting Boundaries in Text Derived from Historical Newspaper Images

Abstract

Abstract (translated)

URL

PDF Copy

PDF