Abstract
Modern large language models (LLMs) are often evaluated and deployed under a \emph{one-shot, greedy} inference protocol, especially in professional settings that require deterministic behavior. This regime can systematically under-estimate a fixed model's true capability: many errors arise not from missing knowledge, but from premature commitment under internal ambiguity. We introduce \emph{Reinforcement Inference}, an entropy-aware inference-time control strategy that uses the model's own uncertainty to selectively invoke a second, more deliberate reasoning attempt, enabling stronger performance \emph{without any retraining}. On 12,032 MMLU-Pro questions across 14 subjects, using DeepSeek-v3.2 with deterministic decoding in a zero-shot setting, Reinforcement Inference improves accuracy from 60.72\% to 84.03\%, while only incurring 61.06\% additional inference calls. A 100\% re-asking ablation reaches 84.35\%, indicating that uncertainty-aware selection captures most of the attainable improvement with substantially less compute. Moreover, a \emph{prompt-only} ablation underperforms the baseline, suggesting that the gains are not explained by generic `` your output had high entropy, think step-by-step'' prompting alone. Beyond providing a practical inference-time upgrade, our results suggest a broader \emph{entropy-aware} paradigm for measuring and expanding model capability: because modern decoder-based models generate outputs autoregressively, entropy and related confidence measures arise naturally as first-class control signals during generation. The resulting gap between one-pass greedy inference and uncertainty-conditioned deliberation offers a diagnostic lens on an LLM's latent reasoning horizon and motivates future training objectives that explicitly constrain correctness--confidence alignment.
Abstract (translated)
现代大型语言模型(LLMs)通常在**一次性、贪婪式**的推理协议下进行评估和部署,尤其是在需要确定性行为的专业环境中。这种模式会系统地低估固定模型的真实能力:许多错误并非源于知识缺失,而是由于内在模糊性导致过早承诺造成的。我们引入了一种名为**强化推理**(Reinforcement Inference)的方法,这是一种基于熵感知的推理时间控制策略,它利用模型自身的不确定性来选择性地调用第二次更为仔细的推理尝试,从而在无需重新训练的情况下提高性能。 在零样本设置中使用DeepSeek-v3.2进行确定性解码,在涵盖14个科目的12,032个MMLU-Pro问题上,强化推理将准确性从60.72%提升至84.03%,同时仅增加了61.06%的额外推理调用。当完全使用重询策略时(即100%重新询问),准确率可达到84.35%,这表明不确定性感知选择能够捕捉到大部分改进,且计算成本显著减少。此外,在一个仅依赖于提示而没有其他改动的情况下,其性能低于基线模型,这表明提升效果并非仅仅由通用的“您的输出具有高熵,请逐步思考”这类提示所带来的。 除了提供一种实用的推理时间升级之外,我们的研究结果还指出现代模型能力评估和扩展的新范式——**基于熵感知**:由于现代解码器型模型以自回归方式生成输出,熵及相关的置信度指标自然成为生成过程中的首要控制信号。一次性贪婪式推理与不确定性条件下的仔细推断之间的差距为理解LLM潜在的推理范围提供了一种诊断视角,并激励未来的训练目标明确地约束正确性-信心一致性。 这种方法不仅有助于提高大型语言模型的实际性能,还开启了关于如何更好地测量和扩展这些模型能力的新研究方向。
URL
https://arxiv.org/abs/2602.08520