Abstract
This paper examines how to make large language models reliable for high-stakes legal work by reducing hallucinations. It distinguishes three AI paradigms: (1) standalone generative models ("creative oracle"), (2) basic retrieval-augmented systems ("expert archivist"), and (3) an advanced, end-to-end optimized RAG system ("rigorous archivist"). The authors introduce two reliability metrics -False Citation Rate (FCR) and Fabricated Fact Rate (FFR)- and evaluate 2,700 judicial-style answers from 12 LLMs across 75 legal tasks using expert, double-blind review. Results show that standalone models are unsuitable for professional use (FCR above 30%), while basic RAG greatly reduces errors but still leaves notable misgrounding. Advanced RAG, using techniques such as embedding fine-tuning, re-ranking, and self-correction, reduces fabrication to negligible levels (below 0.2%). The study concludes that trustworthy legal AI requires rigor-focused, retrieval-based architectures emphasizing verification and traceability, and provides an evaluation framework applicable to other high-risk domains.
Abstract (translated)
本文探讨了如何通过减少幻觉来使大型语言模型在高风险的法律工作中更加可靠。文章区分了三种人工智能范式:(1)独立生成模型(“创意神谕”),(2)基础检索增强系统(“专家档案管理员”),以及(3)高级端到端优化的RAG系统(“严谨的档案管理员”)。作者引入了两个可靠性指标——虚假引用率(FCR)和虚构事实率(FFR),并对12种大型语言模型生成的2,700个司法风格的回答进行了评估,这些回答涵盖了75项法律任务,并通过专家双盲评审进行审查。结果显示,独立模型不适合专业使用(FCR超过30%),而基础RAG大大减少了错误,但仍留下了显著的事实错误。高级RAG系统采用嵌入微调、重新排序和自我校正等技术,将虚假信息降低到了可以忽略不计的水平(低于0.2%)。研究得出结论认为,值得信赖的法律人工智能需要侧重于验证和可追溯性的检索型架构,并提供了一个适用于其他高风险领域的评估框架。
URL
https://arxiv.org/abs/2601.15476