Abstract
In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered. Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities.
Abstract (translated)
在生成常识推理任务(如CommonGen)中,生成式大型语言模型(LLMs)会构造包含所有给定概念的句子。然而,在关注指令遵循能力时,如果提示指定了概念顺序,LLMs必须生成符合指定顺序的句子。为解决这一问题,我们提出了Ordered CommonGen,这是一个旨在评估LLM组合泛化能力和指令遵循能力的基准测试。该基准通过测量有序覆盖率来评估模型是否按规定的顺序生成概念,从而同时评估这两种能力。我们使用36种不同的LLMs进行了全面分析,并发现尽管大多数LLMs通常理解了指令意图,但对特定概念顺序模式的偏见常常导致输出多样性低或即使改变概念顺序后结果仍然相同的问题。此外,即使是遵循指令最严格的LLM也只能达到约75%的有序覆盖率,这突显出在提高指令遵循能力和组合泛化能力方面存在改进的需求。
URL
https://arxiv.org/abs/2506.15629