Abstract
Data analysis and machine learning are of preeminent importance in the legal domain, especially in tasks like clustering and text classification. In this study, we harnessed the power of natural language processing tools to enhance datasets meticulously curated by experts. This process significantly improved the classification workflow for legal texts using machine learning techniques. We considered the Sustainable Development Goals (SDGs) data from the United Nations 2030 Agenda as a practical case study. Data augmentation clustering-based strategy led to remarkable enhancements in the accuracy and sensitivity metrics of classification models. For certain SDGs within the 2030 Agenda, we observed performance gains of over 15%. In some cases, the example base expanded by a noteworthy factor of 5. When dealing with unclassified legal texts, data augmentation strategies centered around clustering prove to be highly effective. They provide a valuable means to expand the existing knowledge base without the need for labor-intensive manual classification efforts.
Abstract (translated)
数据分析和机器学习在法律领域具有至关重要的作用,尤其是在聚类和文本分类等任务中。在这项研究中,我们利用自然语言处理工具增强由专家精心策划的数据集。这一过程显著提高了使用机器学习技术对法律文本进行分类的分类工作流程。我们将联合国2030议程中的可持续发展目标(SDGs)作为一个实际案例研究。数据增强聚类为基础的策略在分类模型的准确性和敏感性指标方面取得了显著的提高。在2030议程的某些SDG中,我们观察到分类表现的提升超过15%。在某些情况下,示例基础扩大了5倍。当处理未分类的法律文本时,以聚类为中心的数据增强策略变得非常有效。它们为扩展现有的知识库提供了有力的手段,而无需进行繁重的人工分类努力。
URL
https://arxiv.org/abs/2404.08683