Paper Reading AI Learner

MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset

2023-05-02 05:52:03
Tobias Brugger, Matthias Stürmer, Joel Niklaus

Abstract

Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the-art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available.

Abstract (translated)

句子分割(SBD)是自然语言处理(NLP)的基础构建块之一,错误的句子分割严重影响了后续任务的输出质量。对于算法来说,特别是考虑到使用复杂的和不同的句子结构,这是一个具有挑战性的任务。在这个研究中,我们编辑了一个多样化的多语言法律数据集,包含超过130,000个标注的语句,涵盖了6种语言。我们的实验结果显示,现有的SBD模型在多语言法律数据上的性能较差。我们基于CRF、BiLSTM-CRF和变分Transformer训练和测试了单语言和多语言模型,展示了最先进的性能。我们还证明了我们的多语言模型在葡萄牙语测试集上的零样本设置中优于所有基准模型。为了鼓励社区进一步研究和开发,我们公开了我们的数据集、模型和代码。

URL

https://arxiv.org/abs/2305.01211

PDF

https://arxiv.org/pdf/2305.01211.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot