Abstract
Effective human-agent collaboration is increasingly prevalent in real-world applications. Current trends in such collaborations are predominantly unidirectional, with users providing instructions or posing questions to agents, where agents respond directly without seeking necessary clarifications or confirmations. However, the evolving capabilities of these agents require more proactive engagement, where agents should dynamically participate in conversations to clarify user intents, resolve ambiguities, and adapt to changing circumstances. Existing prior work under-utilize the conversational capabilities of language models (LMs), thereby optimizing agents as better followers rather than effective speakers. In this work, we introduce SpeakRL, a reinforcement learning (RL) method that enhances agents' conversational capabilities by rewarding proactive interactions with users, such as asking right clarification questions when necessary. To support this, we curate SpeakER, a synthetic dataset that includes diverse scenarios from task-oriented dialogues, where tasks are resolved through interactive clarification questions. We present a systematic analysis of reward design for conversational proactivity and propose a principled reward formulation for teaching agents to balance asking with acting. Empirical evaluations demonstrate that our approach achieves a 20.14% absolute improvement in task completion over base models without increasing conversation turns even surpassing even much larger proprietary models, demonstrating the promise of clarification-centric user-agent interactions.
Abstract (translated)
有效的人类与代理协作在现实世界应用中越来越普遍。当前此类合作的趋势主要为单向,用户给代理提供指令或提出问题,而代理则直接回复而不寻求必要的澄清或确认。然而,这些代理的进化能力要求更积极地参与对话,以便动态参与到对话中去,以澄清用户的意图、解决歧义并适应变化的情况。现有的先驱工作未能充分利用语言模型(LM)的对话能力,因此优化了代理作为更好的跟随者而非有效的发言者的角色。在此项工作中,我们引入SpeakRL,这是一种强化学习(RL)方法,通过奖励积极与用户互动的行为来增强代理的对话能力,例如在必要时提出恰当的澄清问题。为此,我们整理了SpeakER,这是一个合成数据集,其中包括来自任务导向对话的各种场景,在这些场景中,任务是通过交互式的澄清提问得以解决的。我们还对促进对话主动性的奖励设计进行了系统的分析,并提出了一个原则化的奖励公式来教授代理在询问与行动之间取得平衡。实证评估表明,我们的方法实现了20.14%的任务完成率相对于基线模型绝对提升,在不增加会话轮次的情况下甚至超越了更大规模的专有模型,这证明了以澄清为中心的人机互动的巨大潜力。
URL
https://arxiv.org/abs/2512.13159