Abstract
Large language models (LLMs) are being increasingly integrated into legal applications, including judicial decision support, legal practice assistance, and public-facing legal services. While LLMs show strong potential in handling legal knowledge and tasks, their deployment in real-world legal settings raises critical concerns beyond surface-level accuracy, involving the soundness of legal reasoning processes and trustworthy issues such as fairness and reliability. Systematic evaluation of LLM performance in legal tasks has therefore become essential for their responsible adoption. This survey identifies key challenges in evaluating LLMs for legal tasks grounded in real-world legal practice. We analyze the major difficulties involved in assessing LLM performance in the legal domain, including outcome correctness, reasoning reliability, and trustworthiness. Building on these challenges, we review and categorize existing evaluation methods and benchmarks according to their task design, datasets, and evaluation metrics. We further discuss the extent to which current approaches address these challenges, highlight their limitations, and outline future research directions toward more realistic, reliable, and legally grounded evaluation frameworks for LLMs in legal domains.
Abstract (translated)
大型语言模型(LLMs)正在越来越多地被集成到法律应用中,包括司法决策支持、法律实务辅助以及面向公众的法律服务。尽管LLMs在处理法律知识和任务方面展现出巨大潜力,但它们在现实世界中的法律环境中部署时,除了表面准确性之外,还引发了关于法律推理过程的有效性和诸如公平性与可靠性等信任问题的重要关注。因此,系统评估LLMs在法律任务中的表现已成为其负责任采用的关键所在。本综述旨在识别在基于实际法律实践的背景下评估LLMs所面临的挑战。我们分析了评价LLMs在法律领域中性能时遇到的主要困难,包括结果正确性、推理可靠性以及信任度问题。在此基础上,我们回顾并分类现有的评估方法和基准测试依据其任务设计、数据集及评估指标。此外,本文还讨论当前方法解决这些挑战的程度,并强调它们的局限性,同时概述了未来研究方向,旨在为法律领域的LLMs建立更加现实、可靠且具备法律基础的评价框架。
URL
https://arxiv.org/abs/2601.15267