Abstract
Large language models excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning. Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members. Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT baselines. These results demonstrate that cohort-level RL effectively enhances reasoning consistency in LLMs.
Abstract (translated)
大型语言模型在许多任务上表现出色,但仍难以进行一致且稳健的推理。我们引入了基于群体的一致性学习(CC-Learn),这是一种强化学习框架,通过训练来自共享程序抽象的类似问题群体来提高LLM推理的可靠性。为了强制执行群体级别的一致性,我们定义了一个复合目标,结合了群体准确性、有效问题分解的检索奖励以及对平凡或无效查找的拒绝惩罚,这些是强化学习可以直接优化的目标,而不是监督微调可以做到的。优化这个奖励引导模型采用一致的推理模式贯穿所有群体成员。在具有挑战性的推理基准测试(包括ARC-Challenge和StrategyQA)上的实验表明,CC-Learn相比预训练和SFT基线提升了准确性和推理稳定性。这些结果证明了基于群体的RL有效地增强了LLM中的推理一致性。
URL
https://arxiv.org/abs/2506.15662