Abstract
Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the-art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available.
Abstract (translated)
句子分割(SBD)是自然语言处理(NLP)的基础构建块之一,错误的句子分割严重影响了后续任务的输出质量。对于算法来说,特别是考虑到使用复杂的和不同的句子结构,这是一个具有挑战性的任务。在这个研究中,我们编辑了一个多样化的多语言法律数据集,包含超过130,000个标注的语句,涵盖了6种语言。我们的实验结果显示,现有的SBD模型在多语言法律数据上的性能较差。我们基于CRF、BiLSTM-CRF和变分Transformer训练和测试了单语言和多语言模型,展示了最先进的性能。我们还证明了我们的多语言模型在葡萄牙语测试集上的零样本设置中优于所有基准模型。为了鼓励社区进一步研究和开发,我们公开了我们的数据集、模型和代码。
URL
https://arxiv.org/abs/2305.01211