Abstract
As large language models (LLMs) expand the power of natural language processing to handle long inputs, rigorous and systematic analyses are necessary to understand their abilities and behavior. A salient application is summarization, due to its ubiquity and controversy (e.g., researchers have declared the death of summarization). In this paper, we use financial report summarization as a case study because financial reports not only are long but also use numbers and tables extensively. We propose a computational framework for characterizing multimodal long-form summarization and investigate the behavior of Claude 2.0/2.1, GPT-4/3.5, and Command. We find that GPT-3.5 and Command fail to perform this summarization task meaningfully. For Claude 2 and GPT-4, we analyze the extractiveness of the summary and identify a position bias in LLMs. This position bias disappears after shuffling the input for Claude, which suggests that Claude has the ability to recognize important information. We also conduct a comprehensive investigation on the use of numeric data in LLM-generated summaries and offer a taxonomy of numeric hallucination. We employ prompt engineering to improve GPT-4's use of numbers with limited success. Overall, our analyses highlight the strong capability of Claude 2 in handling long multimodal inputs compared to GPT-4.
Abstract (translated)
随着大型自然语言处理模型(LLMs)将自然语言处理的力量扩展到处理长输入,进行严谨和系统的分析以了解其能力和行为是必要的。一个显著的应用是总结,因为它的普遍性和争议(例如,研究人员宣称总结已经过时了)。在本文中,我们使用财务报告总结作为一个案例研究,因为财务报告不仅很长,而且使用大量的数字和表格。我们提出了一个计算框架来表征多模态长形式总结,并研究了Claude 2.0/2.1,GPT-4/3.5和Command的行为。我们发现,GPT-3.5和Command无法以有意义的方式完成总结任务。对于Claude 2和GPT-4,我们分析总结的提取性,并指出LLMs中存在的位置偏见。这种位置偏见在Shuffle输入后消失,这表明Claude具有识别重要信息的能力。我们还对LLM生成的总结中使用数字数据进行了全面调查,并为数字异端行为提供了一个分类。我们采用提示工程来提高GPT-4在有限成功情况下使用数字的能力。总的来说,我们的分析突出了Claude 2在处理长多模态输入方面的强大能力与GPT-4之间的显著差异。
URL
https://arxiv.org/abs/2404.06162