Abstract
The sentence is a fundamental unit in many NLP applications. Sentence segmentation is widely used as the first preprocessing task, where an input text is split into consecutive sentences considering the end of the sentence (EOS) as their boundaries. This task formulation relies on a strong assumption that the input text consists only of sentences, or what we call the sentential units (SUs). However, real-world texts often contain non-sentential units (NSUs) such as metadata, sentence fragments, nonlinguistic markers, etc. which are unreasonable or undesirable to be treated as a part of an SU. To tackle this issue, we formulate a novel task of sentence identification, where the goal is to identify SUs while excluding NSUs in a given text. To conduct sentence identification, we propose a simple yet effective method which combines the beginning of the sentence (BOS) and EOS labels to determine the most probable SUs and NSUs based on dynamic programming. To evaluate this task, we design an automatic, language-independent procedure to convert the Universal Dependencies corpora into sentence identification benchmarks. Finally, our experiments on the sentence identification task demonstrate that our proposed method generally outperforms sentence segmentation baselines which only utilize EOS labels.
Abstract (translated)
句子是许多自然语言处理应用程序的基本单位。句子分割被广泛用作预处理任务的第一步,将输入文本分成连续的语句,并将句子的结束标记(EOS)视为其边界。这个任务的定义依赖于一个强有力的假设,即输入文本仅包含语句,或我们称之为句级单位(SUs)。然而,实际文本中常常包含非语句单位(NSU),例如 metadata、句子碎片、非语言学标志等,这些对象不合理或不希望被视为SU的一部分。要解决这个问题,我们提出了句子识别的新任务,其目标是在给定文本中识别SUs,同时排除NSU。为了进行句子识别,我们提出了一种简单但有效的方法,它结合句子的开始标记(BOS)和EOS标签,基于动态规划来确定最可能的SU和NSU。为了评估这个任务,我们设计了一种自动、语言无关的程序来将通用依赖关系数据集转换为句子识别基准。最后,我们进行的句子识别任务的实验结果表明,我们提出的方法 generally outperforms sentence segmentation baselines which only utilize EOS labels.
URL
https://arxiv.org/abs/2301.13352