Abstract
Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs' capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models' long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs. Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. These benchmarks support intricate manipulation of the length of test cases, and can easily produce text samples up to 128k tokens. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval. The evaluation results demonstrate the limitations of current LLMs, especially in ultra-long-context settings. Our code is available at this https URL.
Abstract (translated)
最近,大型语言模型(LLM)社区越来越关注增强LLMs处理极其长文档的能力。随着各种长文本技术和模型的出现,对模型长文本能力的精确和详细评估变得越来越重要。现有的长文本评估基准,如L-Eval和LongBench,基于开源数据集构建长文本测试集,主要关注问答和总结任务。这些数据集包括长度不同的测试样本(从2k到32k+)相互交织,使得在不同长度范围内评估模型能力具有挑战性。此外,它们也没有涵盖最新的LLM声称的极长设置(100k+个标记)。在本文中,我们介绍了Ada-LEval,一个用于评估LLM长文本理解能力的可调整基准。Ada-LEval包括两个具有挑战性的子集:TSort和BestAnswer,使得对LLM长文本能力的更可靠的评估成为可能。这些基准支持对测试用例长度进行精细操作,并能轻松生成长达128k个标记的文本样本。我们用Ada-LEval评估了4个最先进的闭源API模型和6个开源模型。评估结果表明,当前LLM在超长文本设置中的局限性尤为明显。我们的代码可在此处访问:https://url.cn/adalearn
URL
https://arxiv.org/abs/2404.06480