Abstract
Summarizing Indian legal court judgments is a complex task not only due to the intricate language and unstructured nature of the legal texts, but also since a large section of the Indian population does not understand the complex English in which legal text is written, thus requiring summaries in Indian languages. In this study, we aim to improve the summarization of Indian legal text to generate summaries in both English and Hindi (the most widely spoken Indian language), by injecting domain knowledge into diverse summarization models. We propose a framework to enhance extractive neural summarization models by incorporating domain-specific pre-trained encoders tailored for legal texts. Further, we explore the injection of legal domain knowledge into generative models (including Large Language Models) through continual pre-training on large legal corpora in English and Hindi. Our proposed approaches achieve statistically significant improvements in both English-to-English and English-to-Hindi Indian legal document summarization, as measured by standard evaluation metrics, factual consistency metrics, and legal domain-specific metrics. Furthermore, these improvements are validated through domain experts, demonstrating the effectiveness of our approaches.
Abstract (translated)
将印度法律判决的摘要工作进行总结是一项复杂任务,不仅因为法律文本的语言繁复且结构不规则,还由于许多印度人无法理解用英语书写的复杂的法律文件内容。因此,需要生成各种印度语言版本的摘要。在这项研究中,我们的目标是通过向多种多样的摘要模型注入领域知识来改进印度法律文本的总结工作,并生成英语和印地语(最广泛使用的印度语言)的摘要。 我们提出了一种框架,用以增强提取式神经网络摘要模型,该框架引入了专门针对法律文本预训练的编码器。此外,我们还探讨了通过在大量英文和印地文法律语料库上进行持续预训练来向生成式模型(包括大规模语言模型)注入法律领域知识的方法。 我们的方法在标准评估指标、事实一致性指标以及特定于法律领域的指标中实现了统计显著性改进,不仅限于从英语到英语的总结,还包括从英语到印地语的印度法律文件摘要。这些改进的有效性还通过领域专家的验证得到确认。
URL
https://arxiv.org/abs/2602.07382