Paper Reading AI Learner

Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models

2024-09-03 17:25:54
Yuhao Wu, Ming Shan Hee, Zhiqing Hu, Roy Ka-Wei Lee

Abstract

The abilities of long-context language models (LMs) are often evaluated using the "Needle-in-a-Haystack" (NIAH) test, which comprises tasks designed to assess a model's ability to identify specific information ("needle") within large text sequences ("haystack"). While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation--a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, Spinning the Golden Thread (SGT), which tests models' ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the Spinning the Golden Thread, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.

Abstract (translated)

Long-context language models(LMs)的性能通常使用“杂乱的稻草”测试(NIAH)进行评估,该测试包括旨在评估模型在大型文本序列中查找特定信息(稻草)的能力的任务。虽然这些基准测试了模型对长文本输入序列的理解程度,但它们并没有有效地衡量长形式文本生成质量——这是应用程序(如设计提案和创意写作)的关键方面。为了填补这一空白,我们引入了一个新的长形式文本评估基准,Spinning the Golden Thread(SGT),该基准测试模型在生成长文本序列中识别特定事件或约束的能力。在這個基准中,我们要求具有特定事件的 long-context LMs 创建长形式文本,并评估它们将这些元素融入文本中的能力。我们在四种不同的场景、三种提示指令和两个不同的生成长度设置(16K 和 32K)上对十种 long-context LMs 进行了评估。虽然這些模型在 NIAH 基准测试中表现良好,但没有一个模型在 Spinning the Golden Thread 上表现出令人满意的性能,这引起了人们对它们能否生成符合指令的连贯长形式文本的担忧。此外,随着生成文本长度的增加,所有模型都表现出明显的性能下降。

URL

https://arxiv.org/abs/2409.02076

PDF

https://arxiv.org/pdf/2409.02076.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot