Abstract
Handling long-context inputs is crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. The underlying cause of this limitation has remained unclear. In this work, we provide a comprehensive explanation for this issue. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose \textbf{LongPPL}, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Our experiments demonstrate that LongPPL strongly correlates with performance on various long-context benchmarks (e.g., Pearson correlation of -0.96), significantly outperforming traditional PPL in predictive accuracy. Additionally, we introduce \textbf{LongCE} (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens, leading to consistent improvements across diverse benchmarks. In summary, these contributions offer deeper insights into the limitations of PPL and present effective solutions for accurately evaluating and enhancing the long-context capabilities of LLMs. Code is available at this https URL.
Abstract (translated)
处理长上下文输入对于大型语言模型(LLMs)在扩展对话、文档摘要和多示例情境学习等任务中至关重要。尽管最近的方法已经扩展了LLMs的上下文窗口,并将困惑度(PPL)作为标准评估指标,但PPL已被证明不可靠用于评估长上下文能力。这种限制的根本原因一直不明确。在这项工作中,我们为这一问题提供了全面解释。我们发现,通过平均所有标记,PPL忽视了对于长上下文理解至关重要的关键标记,从而掩盖了模型在长上下文场景中的真实性能。为此,我们提出了**LongPPL**,这是一种新颖的度量标准,它采用长-短上下文对比方法来识别关键标记并重点关注它们。我们的实验表明,LongPPL与各种长上下文基准测试(例如皮尔逊相关性为-0.96)的表现强相关,并在预测准确性方面显著优于传统的PPL。此外,我们还引入了**LongCE**(长上下文交叉熵)损失,这是一种重新加权策略,在微调过程中优先考虑关键标记,从而在各种基准测试中持续改进模型性能。总之,这些贡献深入探讨了PPL的局限性,并提出了有效的方法来准确评估和增强LLMs的长上下文能力。代码可在该链接获取:https://...。
URL
https://arxiv.org/abs/2410.23771