Paper Reading AI Learner

Advantages of Domain Knowledge Injection for Legal Document Summarization: A Case Study on Summarizing Indian Court Judgments in English and Hindi

2026-02-07 05:55:18
Debtanu Datta, Rajdeep Mukherjee, Adrijit Goswami, Saptarshi Ghosh

Abstract

Summarizing Indian legal court judgments is a complex task not only due to the intricate language and unstructured nature of the legal texts, but also since a large section of the Indian population does not understand the complex English in which legal text is written, thus requiring summaries in Indian languages. In this study, we aim to improve the summarization of Indian legal text to generate summaries in both English and Hindi (the most widely spoken Indian language), by injecting domain knowledge into diverse summarization models. We propose a framework to enhance extractive neural summarization models by incorporating domain-specific pre-trained encoders tailored for legal texts. Further, we explore the injection of legal domain knowledge into generative models (including Large Language Models) through continual pre-training on large legal corpora in English and Hindi. Our proposed approaches achieve statistically significant improvements in both English-to-English and English-to-Hindi Indian legal document summarization, as measured by standard evaluation metrics, factual consistency metrics, and legal domain-specific metrics. Furthermore, these improvements are validated through domain experts, demonstrating the effectiveness of our approaches.

Abstract (translated)

将印度法律判决的摘要工作进行总结是一项复杂任务,不仅因为法律文本的语言繁复且结构不规则,还由于许多印度人无法理解用英语书写的复杂的法律文件内容。因此,需要生成各种印度语言版本的摘要。在这项研究中,我们的目标是通过向多种多样的摘要模型注入领域知识来改进印度法律文本的总结工作,并生成英语和印地语(最广泛使用的印度语言)的摘要。 我们提出了一种框架,用以增强提取式神经网络摘要模型,该框架引入了专门针对法律文本预训练的编码器。此外,我们还探讨了通过在大量英文和印地文法律语料库上进行持续预训练来向生成式模型(包括大规模语言模型)注入法律领域知识的方法。 我们的方法在标准评估指标、事实一致性指标以及特定于法律领域的指标中实现了统计显著性改进,不仅限于从英语到英语的总结,还包括从英语到印地语的印度法律文件摘要。这些改进的有效性还通过领域专家的验证得到确认。

URL

https://arxiv.org/abs/2602.07382

PDF

https://arxiv.org/pdf/2602.07382.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot