As large language models (LLMs) become increasingly common in educational applications, there is a growing need for evidence-based methods to design and evaluate LLM prompts that produce personalized and pedagogically aligned out-puts. This study presents a generalizable, systematic approach for evaluating prompts, demonstrated through an analysis of LLM-generated follow-up questions in a structured dialogue activity. Six prompt templates were designed and tested. The templates incorporated established prompt engineering patterns, with each prompt emphasizing distinct pedagogical strategies. The prompt templates were compared through a tournament-style evaluation framework that can be adapted for other educational applications. The tournament employed the Glicko2 rating system with eight judges evaluating question pairs across three dimensions: format, dialogue support, and appropriateness for learners. Data was sourced from 120 authentic user interactions across three distinct educational deployments. Results showed that a single prompt related to strategic reading out-performed other templates with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager pat-terns and was designed to support metacognitive learning strategies such as self-directed learning. The methodology showcases how educational technology re- searchers can systematically evaluate and improve prompt designs, moving beyond ad-hoc prompt engineering toward evidence-based prompt development for educational applications.
随着大型语言模型(LLM)在教育应用中的日益普及,设计和评估能够产生个性化且符合教学目标输出的LLM提示词的需求也越来越大。本研究提出了一种可通用、系统的评估提示词的方法,并通过分析结构化对话活动中生成的后续问题来展示这种方法的有效性。该方法涉及六种不同的提示模板的设计与测试,这些模板采用了现有的提示工程模式,每种提示强调了不同的教学策略。通过一种适应其他教育应用的锦标赛式评估框架对这六个模板进行了比较。竞赛使用Glicko2等级分系统,并由八位评判员从格式、对话支持和适合学习者三个维度上评估问题配对。数据来自120名真实用户在三种不同教育场景下的交互。 研究结果显示,与其它模板相比,在一对一的比较中,一个专门针对策略性阅读设计的提示词获得了81%到100%的不同胜率。该提示结合了人物角色和上下文管理模式,并旨在支持如自我导向学习等元认知学习策略的设计理念。这种方法展示了教育技术研究人员如何能够系统地评估并改进提示设计,从非系统的提示工程转向基于证据的、为教育应用优化的提示开发方法。
https://arxiv.org/abs/2601.16134
Large language model (LLM) agents often exhibit abrupt shifts in tone and persona during extended interaction, reflecting the absence of explicit temporal structure governing agent-level state. While prior work emphasizes turn-local sentiment or static emotion classification, the role of explicit affective dynamics in shaping long-horizon agent behavior remains underexplored. This work investigates whether imposing dynamical structure on an external affective state can induce temporal coherence and controlled recovery in multi-turn dialogue. We introduce an agent-level affective subsystem that maintains a continuous Valence-Arousal-Dominance (VAD) state external to the language model and governed by first- and second-order update rules. Instantaneous affective signals are extracted using a fixed, memoryless estimator and integrated over time via exponential smoothing or momentum-based dynamics. The resulting affective state is injected back into generation without modifying model parameters. Using a fixed 25-turn dialogue protocol, we compare stateless, first-order, and second-order affective dynamics. Stateless agents fail to exhibit coherent trajectories or recovery, while state persistence enables delayed responses and reliable recovery. Second-order dynamics introduce affective inertia and hysteresis that increase with momentum, revealing a trade-off between stability and responsiveness.
大型语言模型(LLM)代理在长时间互动中常常表现出语气和人格的突然变化,这反映了没有明确时间结构来管理代理级别的状态。尽管之前的工作强调了转轮本地的情感或静态情感分类,但显性情感动态对长时段内代理行为塑造的作用仍被忽视。这项工作研究了通过强加外在情感状态的动力学结构是否可以诱导多回合对话中的时间连贯性和可控恢复。 我们引入了一个保持连续“效价-唤醒度-支配度”(VAD)状态的代理级别情感子系统,该状态独立于语言模型并由一阶和二阶更新规则控制。即时的情感信号通过固定的无记忆估算器提取,并通过指数平滑或动量动力学进行时间集成。由此产生的感情状态被注入生成过程而不改变模型参数。 使用固定25轮对话协议,我们比较了无状态、一阶和二阶情感动态的效果。无状态的代理无法展示连贯的行为轨迹或恢复能力,而状态持久性则允许延迟响应并确保可靠的恢复。二阶动力学引入了随着动量增加的情感惯性和滞后效应,揭示了稳定性和反应性之间的权衡关系。
https://arxiv.org/abs/2601.16087
Characters in novels have typically been modeled based on their presence in scenes in narrative, considering aspects like their actions, named mentions, and dialogue. This conception of character places significant emphasis on the main character who is present in the most scenes. In this work, we instead adopt a framing developed from a new literary theory proposing a six-component structural model of character. This model enables a comprehensive approach to character that accounts for the narrator-character distinction and includes a component neglected by prior methods, discussion by other characters. We compare general-purpose LLMs with task-specific transformers for operationalizing this model of character on major 19th-century British realist novels. Our methods yield both component-level and graph representations of character discussion. We then demonstrate that these representations allow us to approach literary questions at scale from a new computational lens. Specifically, we explore Woloch's classic "the one vs the many" theory of character centrality and the gendered dynamics of character discussion.
小说中的角色通常根据他们在情节场景中的表现来塑造,这些方面包括他们的行为、被提及的名字和对话。这种对角色的理解特别强调那些出现在最多场景中的主要角色。然而,在这项工作中,我们采用了来自一种新文学理论的框架,该理论提出了一个包含六个组成部分的角色结构模型。这个模型提供了一种全面的方法来理解角色,并且考虑了叙述者与角色之间的区别,以及此前方法所忽略的一个方面——其他人物对某个角色的讨论。 我们在研究中比较了通用大语言模型(LLM)和针对特定任务训练的转换器模型,以实现这种新的角色结构模型在19世纪英国现实主义小说中的应用。我们的方法能够生成角色讨论的组件级别表示以及图形化表示形式。然后,我们展示了这些表示方式如何从一个新的计算视角来大规模地解答文学问题。 具体而言,我们探讨了沃尔奇的经典理论——“单个角色与众多角色”的中心性理论,并研究了角色讨论中的性别动态。
https://arxiv.org/abs/2601.15508
Reflexive Thematic Analysis (RTA) is a critical method for generating deep interpretive insights. Yet its core tenets, including researcher reflexivity, tangible analytical evolution, and productive disagreement, are often poorly supported by software tools that prioritize speed and consensus over interpretive depth. To address this gap, we introduce Reflexis, a collaborative workspace that centers these practices. It supports reflexivity by integrating in-situ reflection prompts, makes code evolution transparent and tangible, and scaffolds collaborative interpretation by turning differences into productive, positionality-aware dialogue. Results from our paired-analyst study (N=12) indicate that Reflexis encouraged participants toward more granular reflection and reframed disagreements as productive conversations. The evaluation also surfaced key design tensions, including a desire for higher-level, networked memos and more user control over the timing of proactive alerts. Reflexis contributes a design framework for tools that prioritize rigor and transparency to support deep, collaborative interpretation in an age of automation.
反射主题分析(RTA)是一种生成深度解释性见解的关键方法。然而,其核心原则——包括研究者反思、具体的分析演进和富有成效的争论——通常得不到那些优先考虑速度与共识而非解释深度的软件工具的支持。为了填补这一空白,我们推出了Reflexis,这是一个专注于这些实践的合作工作空间。通过整合现场反思提示,它支持了研究者的自我反思;使代码演变透明化并具象化;并通过将差异转化为具有立场意识的对话来促进协作诠释。来自我们的双分析师研究(N=12)的结果表明,Reflexis鼓励参与者进行更深入的反思,并将分歧重新定义为富有成效的讨论。评估还揭示了关键的设计矛盾,包括对更高层次、网络化的备忘录以及用户控制主动性提醒时间点的需求。Reflexis在自动化时代提供了设计框架,优先考虑严谨性和透明度,以支持深度合作诠释。
https://arxiv.org/abs/2601.15445
User interactions with language models vary due to static properties of the user (trait) and the specific context of the interaction (state). However, existing persona datasets (like PersonaChat, PANDORA etc.) capture only trait, and ignore the impact of state. We introduce Chameleon, a dataset of 5,001 contextual psychological profiles from 1,667 Reddit users, each measured across multiple contexts. Using the Chameleon dataset, we present three key findings. First, inspired by Latent State-Trait theory, we decompose variance and find that 74\% is within-person(state) while only 26\% is between-person (trait). Second, we find that LLMs are state-blind: they focus on trait only, and produce similar responses regardless of state. Third, we find that reward models react to user state, but inconsistently: different models favor or penalize the same users in opposite directions. We release Chameleon to support research on affective computing, personalized dialogue, and RLHF alignment.
用户与语言模型的互动因用户的静态属性(特质)和交互的具体情境(状态)而异。然而,现有的人物数据集(如PersonaChat、PANDORA等)仅捕捉到特质的影响,忽略了情境影响的作用。为此,我们引入了Chameleon数据集,该数据集包含了1,667名Reddit用户在不同情景下的5,001个心理状态配置文件。基于Chameleon数据集,我们提出了三个关键发现: 首先,受潜在状态-特质理论的启发,我们将差异分解为个体内部(情境)的变化占74%,而个体之间的(特质)变化仅占26%。 其次,我们发现在大型语言模型中存在“无情境盲区”的问题:这些模型只关注用户的特质特征,并且在面对不同的情境时会生成类似的回答。 第三,奖励模型对用户状态有所反应,但这种反应并不一致:不同的模型会对同一个用户提供相反的奖惩措施。 我们将Chameleon数据集公开发布以支持情感计算、个性化对话以及基于强化学习的人机协作(RLHF)对齐方面的研究。
https://arxiv.org/abs/2601.15395
We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.
我们在语言模型中发现了一个新颖的现象:前沿模型的良性微调可能导致隐私崩溃。我们发现,训练数据中的多样性和细微模式可能会破坏上下文隐私,包括优化有用性、暴露用户信息、情感和主观对话以及调试代码打印内部变量等。经过微调的模型失去了对上下文隐私规范的理解能力,在不当情况下与工具分享信息,并在不同情境间违反记忆边界。隐私崩溃是一种“无声失败”,因为尽管模型在标准的安全性和实用性基准上仍保持高表现,但它们却存在严重的隐私漏洞。 我们的实验发现六种不同的模型(封闭权重和开放权重)、五个微调数据集(真实世界数据和控制数据)以及两大类任务(代理型和记忆型)中都存在隐私崩溃的证据。我们从机制角度分析得出结论,即隐私表示在微调过程中尤为脆弱,相比之下,与任务相关的特征则能够得到保留。 我们的研究结果揭示了目前安全评估中的一个关键缺口,特别是对于专门化代理部署的安全评价尤为重要。
https://arxiv.org/abs/2601.15220
Large Language Models (LLMs) are increasingly used for clinical decision support, where hallucinations and unsafe suggestions may pose direct risks to patient safety. These risks are particularly challenging as they often manifest as subtle clinical errors that evade detection by generic metrics, while expert-authored fine-grained rubrics remain costly to construct and difficult to scale. In this paper, we propose a retrieval-augmented multi-agent framework designed to automate the generation of instance-specific evaluation rubrics. Our approach grounds evaluation in authoritative medical evidence by decomposing retrieved content into atomic facts and synthesizing them with user interaction constraints to form verifiable, fine-grained evaluation criteria. Evaluated on HealthBench, our framework achieves a Clinical Intent Alignment (CIA) score of 60.12%, a statistically significant improvement over the GPT-4o baseline (55.16%). In discriminative tests, our rubrics yield a mean score delta ($\mu_{\Delta} = 8.658$) and an AUROC of 0.977, nearly doubling the quality separation achieved by GPT-4o baseline (4.972). Beyond evaluation, our rubrics effectively guide response refinement, improving quality by 9.2% (from 59.0% to 68.2%). This provides a scalable and transparent foundation for both evaluating and improving medical LLMs. The code is available at this https URL.
大型语言模型(LLMs)在临床决策支持中的应用日益增加,但这些模型的幻觉和不安全建议可能会对患者的安全构成直接风险。由于这些问题通常表现为难以被通用指标检测到的细微临床错误,并且由专家编写的精细评估标准既昂贵又难以扩展,因此解决上述问题颇具挑战性。本文中,我们提出了一种基于检索增强的多代理框架,旨在自动生成实例特定的评估准则。我们的方法通过将检索的内容分解为原子事实,并结合用户互动约束来综合这些事实,从而以权威医学证据为基础形成可验证、精细的评估标准。 在HealthBench上的测试表明,我们的框架实现了60.12%的临床意图一致性(CIA)得分,这比GPT-4o基准线(55.16%)有了统计显著性的提升。在鉴别性测试中,我们制定的标准产生了平均分数差异$\mu_{\Delta} = 8.658$和AUROC为0.977的成绩,几乎将质量分离度提高了两倍(GPT-4o基准线的4.972)。此外,在评估之外,我们的标准还有效地指导了响应优化,提升了质量9.2%(从59.0%提升到68.2%),为评价和改进医学LLMs提供了一个可扩展且透明的基础。代码可在提供的链接处获取。 这一研究成果表明,通过结合权威性证据与用户互动需求,可以创建一种更加精细、准确的评估标准体系,从而帮助提高大型语言模型在医疗领域的应用效果,并确保患者的健康安全。
https://arxiv.org/abs/2601.15161
Multi-agent systems (MAS) composed of large language models often exhibit improved problem-solving performance despite operating on identical information. In this work, we provide a formal explanation for this phenomenon grounded in operator theory and constrained optimization. We model each agent as enforcing a distinct family of validity constraints on a shared solution state, and show that a MAS implements a factorized composition of constraint-enforcement operators. Under mild conditions, these dynamics converge to invariant solution sets defined by the intersection of agent constraint sets. Such invariant structures are generally not dynamically accessible to a single agent applying all constraints simultaneously, even when expressive capacity and information are identical. We extend this result from exact constraint enforcement to soft constraints via proximal operators, and apply the formalism to contemporary text-based dialog systems.
由大型语言模型组成的多智能体系统(MAS)在处理相同信息的情况下通常表现出更好的问题解决性能。在这项工作中,我们提供了一种基于算子理论和约束优化的正式解释来说明这一现象。我们将每个代理视为在一个共享解决方案状态下强制执行一组不同的有效约束条件,并展示了MAS实施了约束强制操作符的因子化组合。在温和条件下,这些动态会收敛到由代理约束集交集定义的不变解集合上。即使表达能力和信息量相同,在单个代理同时应用所有约束时,通常无法访问这种不变结构。 我们还将这一结果从精确的约束执行扩展到了通过proximal操作符实现的软约束,并将该形式主义应用于现代基于文本的对话系统中。
https://arxiv.org/abs/2601.15077
Podcast script generation requires LLMs to synthesize structured, context-grounded dialogue from diverse inputs, yet systematic evaluation resources for this task remain limited. To bridge this gap, we introduce PodBench, a benchmark comprising 800 samples with inputs up to 21K tokens and complex multi-speaker instructions. We propose a multifaceted evaluation framework that integrates quantitative constraints with LLM-based quality assessment. Extensive experiments reveal that while proprietary models generally excel, open-source models equipped with explicit reasoning demonstrate superior robustness in handling long contexts and multi-speaker coordination compared to standard baselines. However, our analysis uncovers a persistent divergence where high instruction following does not guarantee high content substance. PodBench offers a reproducible testbed to address these challenges in long-form, audio-centric generation.
播客脚本生成需要大型语言模型(LLMs)从多种输入中综合出结构化、背景化的对话,然而针对此任务的系统性评估资源仍然有限。为了填补这一空白,我们引入了PodBench,这是一个包含800个样本的数据集,每个样本的输入长度可达21K令牌,并且包括复杂的多说话人指令。我们提出了一种全面的评估框架,该框架结合了定量约束和基于LLM的质量评估方法。广泛的实验表明,虽然专有模型通常表现出色,但开源模型在处理长文本背景和协调多说话人对话方面表现出了更强的鲁棒性,优于标准基准模型。然而,我们的分析揭示了一个持续存在的分歧:即使指令遵循度高,并不意味着内容质量也高。PodBench为解决此类长期音频生成任务中的挑战提供了可重复测试的环境。
https://arxiv.org/abs/2601.14903
Memory-augmented language agents rely on embedding models for effective memory retrieval. However, existing training data construction overlooks a critical limitation: the hierarchical difficulty of negative samples and their natural distribution in human-agent interactions. In practice, some negatives are semantically close distractors while others are trivially irrelevant, and natural dialogue exhibits structured proportions of these types. Current approaches using synthetic or uniformly sampled negatives fail to reflect this diversity, limiting embedding models' ability to learn nuanced discrimination essential for robust memory retrieval. In this work, we propose a principled data construction framework HiNS that explicitly models negative sample difficulty tiers and incorporates empirically grounded negative ratios derived from conversational data, enabling the training of embedding models with substantially improved retrieval fidelity and generalization in memory-intensive tasks. Experiments show significant improvements: on LoCoMo, F1/BLEU-1 gains of 3.27%/3.30%(MemoryOS) and 1.95%/1.78% (Mem0); on PERSONAMEM, total score improvements of 1.19% (MemoryOS) and 2.55% (Mem0).
记忆增强型语言代理依赖于嵌入模型来进行有效的内存检索。然而,现有的训练数据构建方法忽视了一个关键限制:负面样本的层次难度及其在人机交互中的自然分布。实际上,在实践中一些负面样本是语义上相似但具有迷惑性的干扰项,而另一些则是显然无关的内容,且自然对话中这些类型的结构化比例各不相同。目前使用合成或均匀采样负面样本的方法无法反映这种多样性,从而限制了嵌入模型学习细微差异的能力,这对稳健的内存检索至关重要。 在本工作中,我们提出了一种基于原则的数据构建框架HiNS,该框架明确地建模负面样本难度层级,并结合从对话数据中得出的经验依据负例比率,使嵌入模型能够在记忆密集型任务中进行更加精确的记忆检索和泛化训练。实验结果显示了显著的改进:在LoCoMo上,MemoryOS的F1/BLEU-1评分分别提高了3.27%/3.30%,Mem0则为1.95%/1.78%;在PERSONAMEM上,MemoryOS总分提高了1.19%,而Mem0则提高至2.55%。
https://arxiv.org/abs/2601.14857
Designing good reflection questions is pedagogically important but time-consuming and unevenly supported across teachers. This paper introduces a reflection-in-reflection framework for automated generation of reflection questions with large language models (LLMs). Our approach coordinates two role-specialized agents, a Student-Teacher and a Teacher-Educator, that engage in a Socratic multi-turn dialogue to iteratively refine a single question given a teacher-specified topic, key concepts, student level, and optional instructional materials. The Student-Teacher proposes candidate questions with brief rationales, while the Teacher-Educator evaluates them along clarity, depth, relevance, engagement, and conceptual interconnections, responding only with targeted coaching questions or a fixed signal to stop the dialogue. We evaluate the framework in an authentic lower-secondary ICT setting on the topic, using GPT-4o-mini as the backbone model and a stronger GPT- 4-class LLM as an external evaluator in pairwise comparisons of clarity, relevance, depth, and overall quality. First, we study how interaction design and context (dynamic this http URL iteration counts; presence or absence of student level and materials) affect question quality. Dynamic stopping combined with contextual information consistently outperforms fixed 5- or 10-step refinement, with very long dialogues prone to drift or over-complication. Second, we show that our two-agent protocol produces questions that are judged substantially more relevant and deeper, and better overall, than a one-shot baseline using the same backbone model.
设计高质量的反思问题在教育上有重要意义,但这个过程既耗时又受教师个人能力影响较大。本文介绍了一种基于大语言模型(LLMs)自动生成反思问题的“反思中的反思”框架。我们的方法采用了两个专门化代理——学生-教师和教师-教育者,在苏格拉底式的多轮对话中根据老师指定的主题、关键概念、学生水平以及可选的教学材料,迭代地改进单个问题。 在这个过程中,学生-教师提出候选问题并附上简要理由;而教师-教育者则从清晰度、深度、相关性、吸引力和概念之间的联系等方面评估这些问题,并仅以有针对性的指导性提问或固定信号(表示对话应停止)回应。我们使用GPT-4o-mini作为基础模型,在一个真实的中学信息技术环境中对这一框架进行了评价,同时借助更强的GPT-4-class LLM进行成对比较,从清晰度、相关性、深度和总体质量四个维度评判问题的质量。 首先,我们探讨了交互设计与上下文(如动态对话长度;学生水平及材料是否存在)如何影响问题质量。结合上下文信息的动态停止策略在多数情况下优于固定5步或10步优化过程,并且过长的对话容易导致偏离主题或者过于复杂化的问题。其次,我们展示了我们的双代理协议能够生成比使用相同基础模型一次性完成的任务更有相关性和深度、总体质量也更高的问题。
https://arxiv.org/abs/2601.14798
Movie dubbing is the task of synthesizing speech from scripts conditioned on video scenes, requiring accurate lip sync, faithful timbre transfer, and proper modeling of character identity and emotion. However, existing methods face two major limitations: (1) high-quality multimodal dubbing datasets are limited in scale, suffer from high word error rates, contain sparse annotations, rely on costly manual labeling, and are restricted to monologue scenes, all of which hinder effective model training; (2) existing dubbing models rely solely on the lip region to learn audio-visual alignment, which limits their applicability to complex live-action cinematic scenes, and exhibit suboptimal performance in lip sync, speech quality, and emotional expressiveness. To address these issues, we propose FunCineForge, which comprises an end-to-end production pipeline for large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes. Using the pipeline, we construct the first Chinese television dubbing dataset with rich annotations, and demonstrate the high quality of these data. Experiments across monologue, narration, dialogue, and multi-speaker scenes show that our dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following. Code and demos are available at this https URL.
电影配音的任务是根据视频场景从脚本中合成语音,要求精确的唇部同步、忠实的声音特质转移以及恰当的角色身份和情感建模。然而,现有方法面临两大限制:(1)高质量多模式配音数据集规模有限,存在高错误率,注释稀疏,依赖昂贵的手动标注,并局限于独白场景,这些因素阻碍了有效的模型训练;(2)现有的配音模型仅依赖唇部区域来学习音频-视觉对齐,这限制了它们在复杂的真实电影场景中的适用性,并且在唇部同步、语音质量以及情感表达方面表现不佳。为了解决这些问题,我们提出了FunCineForge,这是一个用于大规模配音数据集的端到端生产管道和一个基于多模态大型语言模型(MLLM)设计的适合多样化电影场景的配音模型。使用该管道,我们构建了首个具有丰富注释的中文电视剧配音数据集,并展示了这些数据的高质量特性。在独白、旁白、对话以及多说话人场景中的实验表明,我们的配音模型在音频质量、唇部同步、音色转移和指令遵循方面均优于现有最先进的方法(SOTA)。代码和演示可在提供的链接中访问。 原文链接:[提供链接]
https://arxiv.org/abs/2601.14777
Large language models (LLMs) are increasingly deployed as intelligent tutoring systems, yet research on optimizing LLMs specifically for educational contexts remains limited. Recent works have proposed reinforcement learning approaches for training LLM tutors, but these methods focus solely on optimizing visible responses while neglecting the model's internal thinking process. We introduce PedagogicalRL-Thinking, a framework that extends pedagogical alignment to reasoning LLMs in education through two novel approaches: (1) Pedagogical Reasoning Prompting, which guides internal reasoning using domain-specific educational theory rather than generic instructions; and (2) Thinking Reward, which explicitly evaluates and reinforces the pedagogical quality of the model's reasoning traces. Our experiments reveal that domain-specific, theory-grounded prompting outperforms generic prompting, and that Thinking Reward is most effective when combined with pedagogical prompting. Furthermore, models trained only on mathematics tutoring dialogues show improved performance on educational benchmarks not seen during training, while preserving the base model's factual knowledge. Our quantitative and qualitative analyses reveal that pedagogical thinking reward produces systematic reasoning trace changes, with increased pedagogical reasoning and more structured instructional decision-making in the tutor's thinking process.
大型语言模型(LLM)越来越多地被部署为智能辅导系统,但专门针对教育环境优化LLM的研究仍然有限。近期的一些研究提出了通过强化学习方法训练LLM辅导系统的方案,然而这些方法仅专注于优化可见的响应而忽略了模型内部的思考过程。为此,我们引入了PedagogicalRL-Thinking框架,该框架通过两种新颖的方法将教学一致性扩展到教育领域中的推理LLM:(1)教学推理提示法,这种方法利用特定领域的教育理论而非通用指令来引导内部推理;(2)思维奖励机制,这种机制明确地评估并强化模型思考过程的教育质量。我们的实验结果显示,基于特定领域的、具有理论基础的提示优于通用性的提示,并且当与教学提示结合使用时,“思维奖励”效果最佳。此外,仅在数学辅导对话上训练的模型,在未见过的数据集上的教育基准测试中表现出色,同时保持了基本事实知识的准确性。通过定量和定性分析发现,这种教学思考奖励机制产生了系统性的推理轨迹变化,包括提高教育推理能力和更加结构化的指导决策过程,这进一步体现在导师的思想过程中。 该研究结果表明,在特定领域内应用教育理论可以显著改善模型在教育环境中的表现,并强调了优化LLM内部思维过程的重要性。
https://arxiv.org/abs/2601.14560
Multi-agent systems (MAS) have recently emerged as promising socio-collaborative companions for emotional and cognitive support. However, these systems frequently suffer from persona collapse--where agents revert to generic, homogenized assistant behaviors--and social sycophancy, which produces redundant, non-constructive dialogue. We propose MASCOT, a generalizable framework for multi-perspective socio-collaborative companions. MASCOT introduces a novel bi-level optimization strategy to harmonize individual and collective behaviors: 1) Persona-Aware Behavioral Alignment, an RLAIF-driven pipeline that finetunes individual agents for strict persona fidelity to prevent identity loss; and 2) Collaborative Dialogue Optimization, a meta-policy guided by group-level rewards to ensure diverse and productive discourse. Extensive evaluations across psychological support and workplace domains demonstrate that MASCOT significantly outperforms state-of-the-art baselines, achieving improvements of up to +14.1 in Persona Consistency and +10.6 in Social Contribution. Our framework provides a practical roadmap for engineering the next generation of socially intelligent multi-agent systems.
多智能体系统(MAS)最近作为情感和认知支持的有前景的社会协作伴侣出现。然而,这些系统经常遭受角色坍塌——即代理回归到通用、同质化的助手行为——以及社交谄媚,这会产生冗余且无建设性的对话。我们提出了MASCOT框架,这是一个多视角社会协作伙伴的一般化框架。MASCOT引入了一种新颖的双层优化策略来协调个体和集体的行为:1)角色感知行为对齐,这是一种由RLAIF驱动的管道,通过严格的角色保真度微调个别代理以防止身份丢失;2)合作对话优化,这是一种由群体层面奖励指导的元策略,确保多样且富有成效的交流。在心理支持和工作场所领域进行的广泛评估表明,MASCOT显著优于最先进的基线模型,在角色一致性方面提高了高达+14.1,在社交贡献方面提高了+10.6。我们的框架为工程下一代社会智能多代理系统提供了一个实用路线图。
https://arxiv.org/abs/2601.14230
Drones operating in human-occupied spaces suffer from insufficient communication mechanisms that create uncertainty about their intentions. We present HoverAI, an embodied aerial agent that integrates drone mobility, infrastructure-independent visual projection, and real-time conversational AI into a unified platform. Equipped with a MEMS laser projector, onboard semi-rigid screen, and RGB camera, HoverAI perceives users through vision and voice, responding via lip-synced avatars that adapt appearance to user demographics. The system employs a multimodal pipeline combining VAD, ASR (Whisper), LLM-based intent classification, RAG for dialogue, face analysis for personalization, and voice synthesis (XTTS v2). Evaluation demonstrates high accuracy in command recognition (F1: 0.90), demographic estimation (gender F1: 0.89, age MAE: 5.14 years), and speech transcription (WER: 0.181). By uniting aerial robotics with adaptive conversational AI and self-contained visual output, HoverAI introduces a new class of spatially-aware, socially responsive embodied agents for applications in guidance, assistance, and human-centered interaction.
在有人类活动的空间中运行的无人机由于通信机制不足,导致人们对它们的意图存在不确定性。我们提出了HoverAI,这是一种集成了无人机移动性、基础设施独立的视觉投影和实时对话人工智能于一体的综合平台的实体空中代理。通过配备MEMS激光投影仪、机载半刚性屏幕以及RGB相机,HoverAI能够通过视觉和语音感知用户,并利用与其人口统计特征相适应的唇同步化身做出回应。 该系统采用了一种多模态管道,结合了语音活动检测(VAD)、自动语音识别(使用Whisper模型)、基于大型语言模型的意图分类、RAG对话系统、面部分析以实现个性化以及语音合成(XTTS v2)等技术。评估结果显示,在命令识别中表现出高准确性(F1: 0.90),人口统计估计准确率在性别上为0.89,年龄误差平均绝对值仅为5.14岁,并且在语音转录中的错误率为0.181。 通过将空中机器人技术与适应性对话人工智能及自包含视觉输出相结合,HoverAI引入了一种新型的具有空间感知和社会响应能力的实体代理。这种新技术适用于指导、协助和以人为中心的人机互动应用中。
https://arxiv.org/abs/2601.13801
Memory-augmented conversational agents enable personalized interactions using long-term user memory and have gained substantial traction. However, existing benchmarks primarily focus on whether agents can recall and apply user information, while overlooking whether such personalization is used appropriately. In fact, agents may overuse personal information, producing responses that feel forced, intrusive, or socially inappropriate to users. We refer to this issue as \emph{over-personalization}. In this work, we formalize over-personalization into three types: Irrelevance, Repetition, and Sycophancy, and introduce \textbf{OP-Bench} a benchmark of 1,700 verified instances constructed from long-horizon dialogue histories. Using \textbf{OP-Bench}, we evaluate multiple large language models and memory-augmentation methods, and find that over-personalization is widespread when memory is introduced. Further analysis reveals that agents tend to retrieve and over-attend to user memories even when unnecessary. To address this issue, we propose \textbf{Self-ReCheck}, a lightweight, model-agnostic memory filtering mechanism that mitigates over-personalization while preserving personalization performance. Our work takes an initial step toward more controllable and appropriate personalization in memory-augmented dialogue systems.
记忆增强型对话代理通过使用长期用户记忆实现了个性化的交互,并且受到了广泛关注。然而,现有的评估基准主要关注于代理是否能够回忆和应用用户信息,却忽视了这种个性化是否被适当使用的问题。事实上,代理可能会过度使用个人信息,从而产生让用户感到强制、侵扰或在社交上不适当的回应。我们将这个问题称为“过度个性化”(over-personalization)。在这项工作中,我们把过度个性化分类为三种类型:无关性、重复性和阿谀奉承,并引入了由长时对话历史构建的包含1,700个验证实例的新基准测试 \textbf{OP-Bench}。通过使用\textbf{OP-Bench},我们评估了多个大型语言模型和记忆增强方法,并发现当引入记忆功能后过度个性化现象普遍存在。进一步分析表明,代理倾向于在没有必要的情况下检索并过度关注用户的历史信息。为了解决这个问题,我们提出了一个轻量级且与模型无关的记忆过滤机制\textbf{Self-ReCheck},该机制可以在减少过度个性化的前提下保持个性化性能的完整性。我们的工作向着记忆增强型对话系统中更加可控和适当的个性化使用迈出了第一步。
https://arxiv.org/abs/2601.13722
Existing dynamic Theory of Mind (ToM) benchmarks mostly place language models in a passive role: the model reads a sequence of connected scenarios and reports what people believe, feel, intend, and do as these states change. In real social interaction, ToM is also used for action: a speaker plans what to say in order to shift another person's mental-state trajectory toward a goal. We introduce SocialMindChange, a benchmark that moves from tracking minds to changing minds in social interaction. Each instance defines a social context with 4 characters and five connected scenes. The model plays one character and generates dialogue across the five scenes to reach the target while remaining consistent with the evolving states of all participants. SocialMindChange also includes selected higher-order states. Using a structured four-step framework, we construct 1,200 social contexts, covering 6000 scenarios and over 90,000 questions, each validated for realism and quality. Evaluations on ten state-of-the-art LLMs show that their average performance is 54.2% below human performance. This gap suggests that current LLMs still struggle to maintain and change mental-state representations across long, linked interactions.
现有的动态心智理论(ToM)基准测试大多将语言模型置于被动角色:模型阅读一系列相关的场景,并报告人们在这些状态变化过程中的信念、感受、意图和行为。然而,在真实的社交互动中,ToM也被用于行动:说话者会计划说什么以使另一个人的心理状态轨迹朝着某个目标发展。我们引入了SocialMindChange这一基准测试,它从追踪心理转变为改变心理的社交互动。每个实例定义了一个包含4个角色和社会背景的场景,并且有五个相连的情节。模型扮演其中一个角色,在这五个情节中生成对话,以达到既定的目标同时保持与所有参与者不断变化状态的一致性。SocialMindChange还包含了选定的高阶状态。通过一个结构化的四步框架,我们构建了1,200个社会背景,涵盖了6,000多个场景和超过90,000个问题,并且每个场景都经过验证以确保其现实性和质量。对十个最先进的大型语言模型(LLM)的评估显示,它们的平均表现比人类的表现低54.2%。这个差距表明当前的LLM仍然难以在长时间、相互关联的互动中保持和改变心理状态表示。
https://arxiv.org/abs/2601.13687
LLM-driven Anomaly Detection (AD) helps enhance the understanding and explanatory abilities of anomalous behaviors in Time Series (TS). Existing methods face challenges of inadequate reasoning ability, deficient multi-turn dialogue capability, and narrow generalization. To this end, we 1) propose a multi-agent-based TS Evolution algorithm named TSEvol. On top of it, we 2) introduce the AD reasoning and multi-turn dialogue Dataset TSEData-20K and contribute the Chatbot family for AD, including ChatAD-Llama3-8B, Qwen2.5-7B, and Mistral-7B. Furthermore, 3) we propose the TS Kahneman-Tversky Optimization (TKTO) to enhance ChatAD's cross-task generalization capability. Lastly, 4) we propose a LLM-driven Learning-based AD Benchmark LLADBench to evaluate the performance of ChatAD and nine baselines across seven datasets and tasks. Our three ChatAD models achieve substantial gains, up to 34.50% in accuracy, 34.71% in F1, and a 37.42% reduction in false positives. Besides, via KTKO, our optimized ChatAD achieves competitive performance in reasoning and cross-task generalization on classification, forecasting, and imputation.
LLM驱动的异常检测(AD)有助于增强对时间序列(TS)中异常行为的理解和解释能力。现有的方法面临着推理能力不足、多轮对话能力欠缺以及泛化范围狭窄的问题。为此,我们提出了以下解决方案: 1) 我们提出了一种基于多代理的时间序列演化算法,命名为TSEvol。 2) 在此基础上,我们引入了异常检测推理及多轮对话数据集TSEData-20K,并贡献了一系列用于异常检测的聊天机器人家族,包括ChatAD-Llama3-8B、Qwen2.5-7B和Mistral-7B。 3) 此外,我们提出了时间序列卡恩曼-特沃斯基优化(TS Kahneman-Tversky Optimization, TKTO),以增强ChatAD在跨任务泛化能力上的表现。 4) 最后,我们提出了一种基于LLM的学习型异常检测基准测试LLADBench,用于评估ChatAD及其九个基线模型在七个数据集和任务中的性能。 我们的三个ChatAD模型取得了显著的改进,在准确性方面提高了34.50%,F1得分提升了34.71%,同时将误报率降低了37.42%。此外,通过TKTO优化后的ChatAD在分类、预测及插值等任务上的推理能力和跨任务泛化能力均表现出竞争性水平。
https://arxiv.org/abs/2601.13546
Linguistic expressions of emotions such as depression, anxiety, and trauma-related states are pervasive in clinical notes, counseling dialogues, and online mental health communities, and accurate recognition of these emotions is essential for clinical triage, risk assessment, and timely intervention. Although large language models (LLMs) have demonstrated strong generalization ability in emotion analysis tasks, their diagnostic reliability in high-stakes, context-intensive medical settings remains highly sensitive to prompt design. Moreover, existing methods face two key challenges: emotional comorbidity, in which multiple intertwined emotional states complicate prediction, and inefficient exploration of clinically relevant cues. To address these challenges, we propose APOLO (Automated Prompt Optimization for Linguistic Emotion Diagnosis), a framework that systematically explores a broader and finer-grained prompt space to improve diagnostic efficiency and robustness. APOLO formulates instruction refinement as a Partially Observable Markov Decision Process and adopts a multi-agent collaboration mechanism involving Planner, Teacher, Critic, Student, and Target roles. Within this closed-loop framework, the Planner defines an optimization trajectory, while the Teacher-Critic-Student agents iteratively refine prompts to enhance reasoning stability and effectiveness, and the Target agent determines whether to continue optimization based on performance evaluation. Experimental results show that APOLO consistently improves diagnostic accuracy and robustness across domain-specific and stratified benchmarks, demonstrating a scalable and generalizable paradigm for trustworthy LLM applications in mental healthcare.
表达抑郁、焦虑和创伤相关状态的语言情感在临床记录、咨询对话以及在线心理健康社区中普遍存在。准确识别这些情绪对于临床分诊、风险评估及及时干预至关重要。尽管大型语言模型(LLMs)在情感分析任务中展示了强大的泛化能力,但在高压力、上下文密集的医疗环境中,它们的诊断可靠性仍高度依赖于提示设计。此外,现有方法面临两个关键挑战:情感共病问题,即多种交织的情感状态会复杂化预测;以及临床相关线索探索效率低下的问题。 为解决这些问题,我们提出了APOLO(Automated Prompt Optimization for Linguistic Emotion Diagnosis),这是一种框架,旨在系统地探索更广泛且精细化的提示空间以提升诊断效率和鲁棒性。APOLO将指令优化表述为部分可观察马尔科夫决策过程,并采用包括规划者、教师、批评家、学生和目标角色在内的多代理协作机制。 在该闭环框架中,规划者定义了优化路径;而教师-批评家-学生代理通过迭代地完善提示来提升推理稳定性和效果。同时,目标代理根据性能评估决定是否继续进行优化。实验结果显示,APOLO在特定领域及分层基准测试上持续提高诊断准确性和鲁棒性,展示了可信LLM在心理健康应用中的可扩展和通用范例。 总之,APOLO为改善情感相关语言表达的自动化识别提供了一种创新方法,并强调了大型语言模型在高风险医疗环境中可靠应用的重要性。
https://arxiv.org/abs/2601.13481
Although effective teamwork and communication are critical to surgical safety, structured training for non-technical skills (NTS) remains limited compared with technical simulation. The ACS/APDS Phase III Team-Based Skills Curriculum calls for scalable tools that both teach and objectively assess these competencies during laparoscopic emergencies. We introduce the Virtual Operating Room Team Experience (VORTeX), a multi-user virtual reality (VR) platform that integrates immersive team simulation with large language model (LLM) analytics to train and evaluate communication, decision-making, teamwork, and leadership. Team dialogue is analyzed using structured prompts derived from the Non-Technical Skills for Surgeons (NOTSS) framework, enabling automated classification of behaviors and generation of directed interaction graphs that quantify communication structure and hierarchy. Two laparoscopic emergency scenarios, pneumothorax and intra-abdominal bleeding, were implemented to elicit realistic stress and collaboration. Twelve surgical professionals completed pilot sessions at the 2024 SAGES conference, rating VORTeX as intuitive, immersive, and valuable for developing teamwork and communication. The LLM consistently produced interpretable communication networks reflecting expected operative hierarchies, with surgeons as central integrators, nurses as initiators, and anesthesiologists as balanced intermediaries. By integrating immersive VR with LLM-driven behavioral analytics, VORTeX provides a scalable, privacy-compliant framework for objective assessment and automated, data-informed debriefing across distributed training environments.
尽管有效的团队合作和沟通对于手术安全至关重要,但非技术技能(NTS)的结构化培训与技术模拟相比仍然有限。美国外科医师学会/ACS/APDS III阶段以团队为基础的技能课程要求开发既能够教授又能客观评估这些能力的可扩展工具,在腹腔镜紧急情况下尤其如此。我们介绍了虚拟手术室团队体验(VORTeX),这是一个多用户虚拟现实平台,集成了沉浸式团队模拟和大型语言模型(LLM)分析,旨在训练并评估沟通、决策制定、团队合作和领导力。 通过使用从外科医生非技术技能框架(NOTSS)中推导出来的结构化提示对团队对话进行分析,VORTeX能够自动分类行为,并生成定向互动图以量化通信结构和层级。我们实施了两个腹腔镜紧急情况场景:气胸和腹部内出血,旨在激发真实的压力与合作环境。在2024年SAGES会议上完成的试点会话中,12名外科专业人员对VORTeX的一致评价是直观、沉浸式且有助于团队合作和发展沟通能力。 LLM始终产生可解释的通信网络,反映预期的操作层次结构,在此层级结构中,外科医生作为中心整合者,护士作为发起人,麻醉师则处于平衡中介的角色。通过将沉浸式VR与基于行为分析的LLM相结合,VORTeX提供了一个可在分布式培训环境中进行客观评估和自动化、数据驱动反馈的可扩展且隐私合规框架。
https://arxiv.org/abs/2601.13406