Do Multi-Document Summarization Models Synthesize?

Abstract
Abstract (translated)
URL
PDF

Abstract

Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately \emph{synthesize} inputs with respect to a key property or aspect. For example, a synopsis of film reviews all written about a particular movie should reflect the average critic consensus. As a more consequential example, consider narrative summaries that accompany biomedical \emph{systematic reviews} of clinical trial results. These narratives should fairly summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this type of synthesis? To assess this we perform a suite of experiments that probe the degree to which conditional generation models trained for summarization using standard methods yield outputs that appropriately synthesize inputs. We find that existing models do partially perform synthesis, but do so imperfectly. In particular, they are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., the ratio of positive to negative movie reviews). We propose a simple, general method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or \emph{abstaining} when the model produces no good candidate. This approach improves model synthesis performance. We hope highlighting the need for synthesis (in some summarization settings), motivates further research into multi-document summarization methods and learning objectives that explicitly account for the need to synthesize.

Abstract (translated)

多文档摘要包括对一组输入的简明摘要。在一些应用中,摘要应该准确地对关键属性或方面进行合成。例如,所有关于某一电影的评论的摘要都应该反映平均评论家的共识。作为更严重的后果,可以考虑跟踪医学背景下对临床试验结果的系统综述,这些综述应该公正地总结每个试验可能产生的互相冲突的结果。在本文中,我们问:现代多文档摘要模型是否在一定程度上隐含了这种类型的合成?为了评估这个问题,我们进行了一组实验,探究使用标准方法训练条件生成模型是否能够产生适当的合成输出。我们发现,现有模型在一定程度上实现了合成,但不完美。特别是,它们对输入排序变化的反应过度,对输入组合变化的反应不足(例如,正面电影评论的比率)。我们提出了一种简单而通用的方法,通过生成一个明确多样化的备选输出序列来提高模型合成能力。然后从这些序列中选择与输入期望的总平方量最匹配的字符串,或如果模型没有好的例子,则“拒绝”选择。这种方法提高了模型合成性能。我们希望强调合成的必要性(在某些摘要设置中),激励进一步研究多文档摘要方法以及明确考虑合成必要性的学习目标和任务。

URL

https://arxiv.org/abs/2301.13844

PDF

https://arxiv.org/pdf/2301.13844.pdf