Paper Reading AI Learner

What is Wrong with Perplexity for Long-context Language Modeling?

2024-10-31 09:39:28
Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, Yisen Wang

Abstract

Handling long-context inputs is crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. The underlying cause of this limitation has remained unclear. In this work, we provide a comprehensive explanation for this issue. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose \textbf{LongPPL}, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Our experiments demonstrate that LongPPL strongly correlates with performance on various long-context benchmarks (e.g., Pearson correlation of -0.96), significantly outperforming traditional PPL in predictive accuracy. Additionally, we introduce \textbf{LongCE} (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens, leading to consistent improvements across diverse benchmarks. In summary, these contributions offer deeper insights into the limitations of PPL and present effective solutions for accurately evaluating and enhancing the long-context capabilities of LLMs. Code is available at this https URL.

Abstract (translated)

处理长上下文输入对于大型语言模型(LLMs)在扩展对话、文档摘要和多示例情境学习等任务中至关重要。尽管最近的方法已经扩展了LLMs的上下文窗口,并将困惑度(PPL)作为标准评估指标,但PPL已被证明不可靠用于评估长上下文能力。这种限制的根本原因一直不明确。在这项工作中,我们为这一问题提供了全面解释。我们发现,通过平均所有标记,PPL忽视了对于长上下文理解至关重要的关键标记,从而掩盖了模型在长上下文场景中的真实性能。为此,我们提出了**LongPPL**,这是一种新颖的度量标准,它采用长-短上下文对比方法来识别关键标记并重点关注它们。我们的实验表明,LongPPL与各种长上下文基准测试(例如皮尔逊相关性为-0.96)的表现强相关,并在预测准确性方面显著优于传统的PPL。此外,我们还引入了**LongCE**(长上下文交叉熵)损失,这是一种重新加权策略,在微调过程中优先考虑关键标记,从而在各种基准测试中持续改进模型性能。总之,这些贡献深入探讨了PPL的局限性,并提出了有效的方法来准确评估和增强LLMs的长上下文能力。代码可在该链接获取:https://...。

URL

https://arxiv.org/abs/2410.23771

PDF

https://arxiv.org/pdf/2410.23771.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot