Paper Reading AI Learner

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

2024-04-09 17:30:48
Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen

Abstract

Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs' capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models' long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs. Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. These benchmarks support intricate manipulation of the length of test cases, and can easily produce text samples up to 128k tokens. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval. The evaluation results demonstrate the limitations of current LLMs, especially in ultra-long-context settings. Our code is available at this https URL.

Abstract (translated)

最近,大型语言模型(LLM)社区越来越关注增强LLMs处理极其长文档的能力。随着各种长文本技术和模型的出现,对模型长文本能力的精确和详细评估变得越来越重要。现有的长文本评估基准,如L-Eval和LongBench,基于开源数据集构建长文本测试集,主要关注问答和总结任务。这些数据集包括长度不同的测试样本(从2k到32k+)相互交织,使得在不同长度范围内评估模型能力具有挑战性。此外,它们也没有涵盖最新的LLM声称的极长设置(100k+个标记)。在本文中,我们介绍了Ada-LEval,一个用于评估LLM长文本理解能力的可调整基准。Ada-LEval包括两个具有挑战性的子集:TSort和BestAnswer,使得对LLM长文本能力的更可靠的评估成为可能。这些基准支持对测试用例长度进行精细操作,并能轻松生成长达128k个标记的文本样本。我们用Ada-LEval评估了4个最先进的闭源API模型和6个开源模型。评估结果表明,当前LLM在超长文本设置中的局限性尤为明显。我们的代码可在此处访问:https://url.cn/adalearn

URL

https://arxiv.org/abs/2404.06480

PDF

https://arxiv.org/pdf/2404.06480.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot