Make Your LLM Fully Utilize the Context

Abstract
Abstract (translated)
URL
PDF

Abstract

While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information. Based on this intuition, our study presents information-intensive (IN2) training, a purely data-driven solution to overcome lost-in-the-middle. Specifically, IN2 training leverages a synthesized long-context question-answer dataset, where the answer requires (1) fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens), and (2) the integration and reasoning of information from two or more short segments. Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILl-in-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5->26.9 F1 score on NarrativeQA), while maintaining a comparable performance on short-context tasks (e.g., 59.3->59.2 accuracy on MMLU). Github Link: this https URL.

Abstract (translated)

虽然许多当代大型语言模型（LLMs）可以处理长输入，但它们仍然很难在长上下文中完全利用信息，这被称为迷失在中间的挑战。我们假设这源于在长上下文训练期间缺乏明确的监督，这没有强调任何长上下文中的位置都可能持有关键信息。根据这个直觉，我们的研究提出了信息密集型（IN2）训练，这是一种完全数据驱动的解决方案来克服迷失在中间的挑战。具体来说，IN2训练利用合成长上下文问题-答案数据集，其中答案需要（1）在合成长上下文（4K-32K个词）中的短片段（~128个词）进行精细信息意识，以及（2）来自两个或更多短片段的信息整合和推理。通过在Mistral-7B上应用这一信息密集型训练，我们提出了FILM-7B（FILM-在中间）。为了全面评估FILM-7B在利用长上下文方面的能力，我们设计了一个涵盖各种上下文风格（文档、代码和结构化数据）和信息检索模式（前向、后向和双向检索）的三个探针任务。探针结果表明，FILM-7B可以稳健地从其32K个上下文窗口中的不同位置检索信息。除了这些探针任务之外，FILM-7B在现实世界的长上下文任务中的性能显著提高（例如，在NarrativeQA上的23.5->26.9 F1得分），同时它在短上下文任务中的性能与预相当（例如，在MMLU上的59.3->59.2准确率）。Github链接：https://github.com/。

URL

https://arxiv.org/abs/2404.16811

PDF

https://arxiv.org/pdf/2404.16811.pdf

Make Your LLM Fully Utilize the Context

Abstract

Abstract (translated)

URL

PDF Copy

PDF