Abstract
Many tasks require learned models to strategically gather relevant information over multiple rounds of interaction before actually acting on a task. Strategic information gathering requires models to know not only how to effectively acquire information, but also when to stop gathering information and make a decision, in order to avoid overthinking or getting derailed when acting. In this paper, we formalize this problem and introduce Counterfactuals and Reasoning for Termination (CaRT), an approach for teaching LLMs when to stop seeking information. To appropriately learn when to terminate, CaRT fine-tunes LLMs using counterfactual pairs of trajectories, one where termination is appropriate and a minimally modified version of the same trajectory where it is not. It trains the LLM to explain the rationale for the termination decision in either case via verbal reasoning, and imbues this capability into the base LLM via fine-tuning. We instantiate CaRT in two domains: interactive medical diagnosis and math problem solving. In both domains, we find that CaRT improves the efficiency of information gathering and task success rate compared to other fine-tuning methods.
Abstract (translated)
许多任务要求经过训练的模型在实际执行任务之前,通过多轮互动战略性地收集相关信息。战略性的信息搜集不仅需要模型知道如何有效地获取信息,还需要了解何时停止信息搜集并作出决策,以避免过度思考或偏离行动目标。在这篇论文中,我们形式化了这一问题,并提出了反事实推理与终止策略(Counterfactuals and Reasoning for Termination, CaRT),这是一种教导大型语言模型(LLM)何时停止寻找信息的方法。 为了恰当地学习何时应该结束信息搜集,CaRT 使用反事实的轨迹对来微调 LLM。这些轨迹对中的一条显示了适时终止的信息搜集是合适的,另一条则是在前者基础上经过最小改动后显示出不适当的终止情况。这种方法训练 LLM 通过口头推理在每种情况下解释终止决策的理由,并将这种能力融入到基础模型之中。 我们在两个领域内实现了 CaRT:交互式医学诊断和数学问题求解。在这两个领域中,我们发现与其它微调方法相比,CaRT 能够提高信息搜集的效率以及任务的成功率。
URL
https://arxiv.org/abs/2510.08517