Paper Reading AI Learner

Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability

2025-06-18 17:00:54
Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

Abstract

In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered. Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities.

Abstract (translated)

在生成常识推理任务(如CommonGen)中,生成式大型语言模型(LLMs)会构造包含所有给定概念的句子。然而,在关注指令遵循能力时,如果提示指定了概念顺序,LLMs必须生成符合指定顺序的句子。为解决这一问题,我们提出了Ordered CommonGen,这是一个旨在评估LLM组合泛化能力和指令遵循能力的基准测试。该基准通过测量有序覆盖率来评估模型是否按规定的顺序生成概念,从而同时评估这两种能力。我们使用36种不同的LLMs进行了全面分析,并发现尽管大多数LLMs通常理解了指令意图,但对特定概念顺序模式的偏见常常导致输出多样性低或即使改变概念顺序后结果仍然相同的问题。此外,即使是遵循指令最严格的LLM也只能达到约75%的有序覆盖率,这突显出在提高指令遵循能力和组合泛化能力方面存在改进的需求。

URL

https://arxiv.org/abs/2506.15629

PDF

https://arxiv.org/pdf/2506.15629.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot