Paper Reading AI Learner

Text clustering applied to data augmentation in legal contexts

2024-04-08 16:18:33
Lucas Jos\'e Gon\c{c}alves Freitas, Tha\'is Rodrigues, Guilherme Rodrigues, Pamella Edokawa, Ariane Farias

Abstract

Data analysis and machine learning are of preeminent importance in the legal domain, especially in tasks like clustering and text classification. In this study, we harnessed the power of natural language processing tools to enhance datasets meticulously curated by experts. This process significantly improved the classification workflow for legal texts using machine learning techniques. We considered the Sustainable Development Goals (SDGs) data from the United Nations 2030 Agenda as a practical case study. Data augmentation clustering-based strategy led to remarkable enhancements in the accuracy and sensitivity metrics of classification models. For certain SDGs within the 2030 Agenda, we observed performance gains of over 15%. In some cases, the example base expanded by a noteworthy factor of 5. When dealing with unclassified legal texts, data augmentation strategies centered around clustering prove to be highly effective. They provide a valuable means to expand the existing knowledge base without the need for labor-intensive manual classification efforts.

Abstract (translated)

数据分析和机器学习在法律领域具有至关重要的作用,尤其是在聚类和文本分类等任务中。在这项研究中,我们利用自然语言处理工具增强由专家精心策划的数据集。这一过程显著提高了使用机器学习技术对法律文本进行分类的分类工作流程。我们将联合国2030议程中的可持续发展目标(SDGs)作为一个实际案例研究。数据增强聚类为基础的策略在分类模型的准确性和敏感性指标方面取得了显著的提高。在2030议程的某些SDG中,我们观察到分类表现的提升超过15%。在某些情况下,示例基础扩大了5倍。当处理未分类的法律文本时,以聚类为中心的数据增强策略变得非常有效。它们为扩展现有的知识库提供了有力的手段,而无需进行繁重的人工分类努力。

URL

https://arxiv.org/abs/2404.08683

PDF

https://arxiv.org/pdf/2404.08683.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot