Paper Reading AI Learner

SpeakRL: Synergizing Reasoning, Speaking, and Acting in Language Models with Reinforcement Learning

2025-12-15 10:08:53
Emre Can Acikgoz, Jinoh Oh, Jie Hao, Joo Hyuk Jeon, Heng Ji, Dilek Hakkani-T\"ur, Gokhan Tur, Xiang Li, Chengyuan Ma, Xing Fan

Abstract

Effective human-agent collaboration is increasingly prevalent in real-world applications. Current trends in such collaborations are predominantly unidirectional, with users providing instructions or posing questions to agents, where agents respond directly without seeking necessary clarifications or confirmations. However, the evolving capabilities of these agents require more proactive engagement, where agents should dynamically participate in conversations to clarify user intents, resolve ambiguities, and adapt to changing circumstances. Existing prior work under-utilize the conversational capabilities of language models (LMs), thereby optimizing agents as better followers rather than effective speakers. In this work, we introduce SpeakRL, a reinforcement learning (RL) method that enhances agents' conversational capabilities by rewarding proactive interactions with users, such as asking right clarification questions when necessary. To support this, we curate SpeakER, a synthetic dataset that includes diverse scenarios from task-oriented dialogues, where tasks are resolved through interactive clarification questions. We present a systematic analysis of reward design for conversational proactivity and propose a principled reward formulation for teaching agents to balance asking with acting. Empirical evaluations demonstrate that our approach achieves a 20.14% absolute improvement in task completion over base models without increasing conversation turns even surpassing even much larger proprietary models, demonstrating the promise of clarification-centric user-agent interactions.

Abstract (translated)

有效的人类与代理协作在现实世界应用中越来越普遍。当前此类合作的趋势主要为单向,用户给代理提供指令或提出问题,而代理则直接回复而不寻求必要的澄清或确认。然而,这些代理的进化能力要求更积极地参与对话,以便动态参与到对话中去,以澄清用户的意图、解决歧义并适应变化的情况。现有的先驱工作未能充分利用语言模型(LM)的对话能力,因此优化了代理作为更好的跟随者而非有效的发言者的角色。在此项工作中,我们引入SpeakRL,这是一种强化学习(RL)方法,通过奖励积极与用户互动的行为来增强代理的对话能力,例如在必要时提出恰当的澄清问题。为此,我们整理了SpeakER,这是一个合成数据集,其中包括来自任务导向对话的各种场景,在这些场景中,任务是通过交互式的澄清提问得以解决的。我们还对促进对话主动性的奖励设计进行了系统的分析,并提出了一个原则化的奖励公式来教授代理在询问与行动之间取得平衡。实证评估表明,我们的方法实现了20.14%的任务完成率相对于基线模型绝对提升,在不增加会话轮次的情况下甚至超越了更大规模的专有模型,这证明了以澄清为中心的人机互动的巨大潜力。

URL

https://arxiv.org/abs/2512.13159

PDF

https://arxiv.org/pdf/2512.13159.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot