Paper Reading AI Learner

An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques

2025-07-07 15:34:05
Walid Mohamed Aly, Taysir Hassan A. Soliman, Amr Mohamed AbdelAziz

Abstract

Large Language Models (LLMs) continue to advance natural language processing with their ability to generate human-like text across a range of tasks. Despite the remarkable success of LLMs in Natural Language Processing (NLP), their performance in text summarization across various domains and datasets has not been comprehensively evaluated. At the same time, the ability to summarize text effectively without relying on extensive training data has become a crucial bottleneck. To address these issues, we present a systematic evaluation of six LLMs across four datasets: CNN/Daily Mail and NewsRoom (news), SAMSum (dialog), and ArXiv (scientific). By leveraging prompt engineering techniques including zero-shot and in-context learning, our study evaluates the performance using the ROUGE and BERTScore metrics. In addition, a detailed analysis of inference times is conducted to better understand the trade-off between summarization quality and computational efficiency. For Long documents, introduce a sentence-based chunking strategy that enables LLMs with shorter context windows to summarize extended inputs in multiple stages. The findings reveal that while LLMs perform competitively on news and dialog tasks, their performance on long scientific documents improves significantly when aided by chunking strategies. In addition, notable performance variations were observed based on model parameters, dataset properties, and prompt design. These results offer actionable insights into how different LLMs behave across task types, contributing to ongoing research in efficient, instruction-based NLP systems.

Abstract (translated)

大型语言模型(LLMs)通过其生成类似人类文本的能力,在自然语言处理方面取得了持续的进步,这一能力跨越了各种任务。尽管LLMs在自然语言处理(NLP)领域取得了非凡的成功,但它们在不同领域和数据集中的文本摘要性能尚未得到全面评估。同时,有效总结文本而不依赖于大量训练数据的能力已经成为一个关键瓶颈。为了解决这些问题,我们提出了对六种大型语言模型在四个数据集上的系统性评价:CNN/Daily Mail和NewsRoom(新闻),SAMSum(对话)以及ArXiv(科学)。通过利用包括零样本学习和上下文学习在内的提示工程技术,我们的研究使用ROUGE和BERTScore指标评估了这些模型的性能。此外,还进行了详细的推理时间分析,以便更好地理解摘要质量和计算效率之间的权衡。 针对长文档,我们引入了一种基于句子分割的战略,这使得具有较短上下文窗口的LLMs能够通过多阶段处理来总结较长输入内容。研究结果表明,在新闻和对话任务中,LLMs表现出了竞争性,而在科学文献摘要上,当借助分块策略时,其性能显著提升。此外,根据模型参数、数据集属性以及提示设计的不同,观察到了明显的性能差异。 这些成果为不同大型语言模型在各种任务类型中的行为提供了可操作的见解,并对高效的指令式自然语言处理系统的研究做出了贡献。

URL

https://arxiv.org/abs/2507.05123

PDF

https://arxiv.org/pdf/2507.05123.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot