Abstract
The diversity across outputs generated by large language models shapes the perception of their quality and utility. Prompt leaks, templated answer structure, and canned responses across different interactions are readily noticed by people, but there is no standard score to measure this aspect of model behavior. In this work we empirically investigate diversity scores on English texts. We find that computationally efficient compression algorithms capture information similar to what is measured by slow to compute $n$-gram overlap homogeneity scores. Further, a combination of measures -- compression ratios, self-repetition of long $n$-grams and Self-BLEU and BERTScore -- are sufficient to report, as they have low mutual correlation with each other. The applicability of scores extends beyond analysis of generative models; for example, we highlight applications on instruction-tuning datasets and human-produced texts. We release a diversity score package to facilitate research and invite consistency across reports.
Abstract (translated)
大规模语言模型的输出多样性塑造了人们对其质量和应用的感知。提示泄漏、模板化答案结构和不同交互中的罐头回答等,都被人们轻易地注意到了。然而,目前尚无标准评分来衡量这种模型行为的多样性。在这项工作中,我们通过经验研究了英语文本上的多样性评分。我们发现,计算效率的压缩算法捕获了与慢到计算$n$阶短语覆盖同义性分数相似的信息。此外,一种度量方法——压缩比、长$n$阶的自我重复和自监督$BLEU$和BERTScore——是足够的,可以报告,因为它们之间低互相关。评分的应用不仅限于对生成模型的分析;例如,我们在指令定制数据集和人类生成的文本中进行了强调。我们发布了一个多样性评分包,以促进研究和报告的一致性。
URL
https://arxiv.org/abs/2403.00553