Computational musicology enables systematic analysis of performative and structural traits in recorded music, yet existing approaches remain largely tailored to notated, score-based repertoires. This study advances a methodology for analyzing voice-guitar interaction in Carlos Paredes's vocal collaborations - an oral-tradition context where compositional and performative layers co-emerge. Using source-separated stems, physics-informed harmonic modelling, and beat-level audio descriptors, we examine melodic, harmonic, and rhythmic relationships across eight recordings with four singers. Our commonality-diversity framework, combining multi-scale correlation analysis with residual-based detection of structural deviations, reveals that expressive coordination is predominantly piece-specific rather than corpus-wide. Diversity events systematically align with formal boundaries and textural shifts, demonstrating that the proposed approach can identify musically salient reorganizations with minimal human annotation. The framework further offers a generalizable computational strategy for repertoires without notated blueprints, extending Music Performance Analysis into oral-tradition and improvisation-inflected practices.
https://arxiv.org/abs/2603.12854
Text-to-image generation has advanced rapidly, but existing models still struggle with faithfully composing multiple objects and preserving their attributes in complex scenes. We propose coDrawAgents, an interactive multi-agent dialogue framework with four specialized agents: Interpreter, Planner, Checker, and Painter that collaborate to improve compositional generation. The Interpreter adaptively decides between a direct text-to-image pathway and a layout-aware multi-agent process. In the layout-aware mode, it parses the prompt into attribute-rich object descriptors, ranks them by semantic salience, and groups objects with the same semantic priority level for joint generation. Guided by the Interpreter, the Planner adopts a divide-and-conquer strategy, incrementally proposing layouts for objects with the same semantic priority level while grounding decisions in the evolving visual context of the canvas. The Checker introduces an explicit error-correction mechanism by validating spatial consistency and attribute alignment, and refining layouts before they are rendered. Finally, the Painter synthesizes the image step by step, incorporating newly planned objects into the canvas to provide richer context for subsequent iterations. Together, these agents address three key challenges: reducing layout complexity, grounding planning in visual context, and enabling explicit error correction. Extensive experiments on benchmarks GenEval and DPG-Bench demonstrate that coDrawAgents substantially improves text-image alignment, spatial accuracy, and attribute binding compared to existing methods.
https://arxiv.org/abs/2603.12829
Assistive robotics is an important subarea of robotics that focuses on the well-being of people with disabilities. A robotic guide dog is an assistive quadruped robot that helps visually impaired people in obstacle avoidance and navigation. Enabling language capabilities for robotic guide dogs goes beyond naively adding an existing dialog system onto a mobile robot. The novel challenges include grounding language in the dynamically changing environment and improving spatial awareness for the human handler. To address those challenges, we develop a novel dialog system for robotic guide dogs that uses LLMs to verbalize both navigational plans and scenes. The goal is to enable verbal communication for collaborative decision-making within the handler-robot team. In experiments, we conducted a human study to evaluate different verbalization strategies and a simulation study to assess the efficiency and accuracy in navigation tasks.
https://arxiv.org/abs/2603.12574
Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at this https URL.
https://arxiv.org/abs/2603.12572
SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substantially in politeness markers, sentence-final particles, and syntactic complexity. We propose a preference-based alignment approach to adapt Japanese SpeechLLMs for speech-worthy outputs: text that is concise, conversational, and readily synthesized as natural speech. To rigorously evaluate this task, we introduce SpokenElyza, a benchmark for Japanese speech-worthiness derived from ELYZA-tasks-100 with auditory verification by native experts. Experiments show that our approach achieves substantial improvement on SpokenElyza while largely preserving performance on the original written-style evaluation. We will release SpokenElyza to support future research on Japanese spoken dialog systems.
https://arxiv.org/abs/2603.12565
Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms. Across the seven models tested, we consistently observe the evaluation-blindness pattern: recommendation quality is largely preserved under contamination (utility preservation ratio approximately 1.0) while risk-inappropriate products appear in 65-93% of turns, a systematic safety failure poorly reflected by standard NDCG. Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, and persist without self-correction over 23-step trajectories; no agent across 1,563 contaminated turns explicitly questions tool-data reliability. Even narrative-only corruption (biased headlines, no numerical manipulation) induces significant drift while completely evading consistency monitors. A safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74, indicating that much of the evaluation gap becomes visible once safety is explicitly measured. These results motivate considering trajectory-level safety monitoring, beyond single-turn quality, for deployed multi-turn agents in high-stakes settings.
https://arxiv.org/abs/2603.12564
Early language development shapes children's later literacy and learning, yet many families have limited access to scalable, high-quality support at home. Recent advances in generative AI make it possible for social robots to move beyond scripted interactions and engage children in adaptive, conversational activities, but it remains unclear how to design such systems for pre-schoolers and how children engage with them over time in the home. We present ELLA (Early Language Learning Agent), an autonomous, generative AI-powered social robot that supports early language development through interactive storytelling, parent-selected language targets, and scaffolded dialogue. Using a multi-phased, human-centered process, we interviewed parents (n=7) and educators (n=5) and iteratively refined ELLA through twelve in-home design workshops. We then deployed ELLA with ten children for eight days. We report design insights from in-home workshops, characterize children's engagement and behaviors during deployment, and distill design implications for generative AI-powered social robots supporting early language learning at home.
https://arxiv.org/abs/2603.12508
Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.
多模态大型语言模型(MLLMs)越来越多地用于执行如导航GUI等视觉工作流程,其中下一步取决于经过验证的视觉组合条件(例如,“如果出现权限对话框且界面颜色为绿色,则点击允许”),并且过程可能会分支或提前终止。然而,这种能力仍被评价不足:现有基准测试主要关注浅层组合或独立约束,而不是深度嵌套的组合条件。在本文中,我们介绍了MM-CondChain,这是一个面向视觉基础深层组合推理的基准测试。每个基准实例组织为一个多层推理链,在每层中含有一个基于视觉证据构建、涉及多个对象、属性或关系的非平凡组合条件。为了正确回答问题,多模态大型语言模型必须详细感知图像,在每一步中对多个视觉元素进行推理,并遵循由此产生的执行路径以达到最终结果。为了大规模地构造此类工作流程式数据,我们提出了一条代理合成流水线:一个规划者分层生成组合条件,而可验证的程序化中间表示(VPIR)确保了每个层次的条件可以机械验证。然后,一个组装器将这些经过验证的层级整合为完整指令。使用这一管道,我们在三个视觉领域构建基准测试:自然图像、数据图表和GUI轨迹。在一系列多模态大型语言模型上的实验表明,即使是最强大的模型也仅能达到53.33 Path F1分数,在处理硬负例时得分急剧下降,并且随着深度或谓词复杂度的增长而显著降低,这证实了深层组合推理仍然是一个基本挑战。
https://arxiv.org/abs/2603.12266
Existing voice AI assistants treat every detected pause as an invitation to speak. This works in dyadic dialogue, but in multi-party settings, where an AI assistant participates alongside multiple speakers, pauses are abundant and ambiguous. An assistant that speaks on every pause becomes disruptive rather than useful. In this work, we formulate context-aware turn-taking: at every detected pause, given the full conversation context, our method decides whether the assistant should speak or stay silent. We introduce a benchmark of over 120K labeled conversations spanning three multi-party corpora. Evaluating eight recent large language models, we find that they consistently fail at context-aware turn-taking under zero-shot prompting. We then propose a supervised fine-tuning approach with reasoning traces, improving balanced accuracy by up to 23 percentage points. Our findings suggest that context-aware turn-taking is not an emergent capability; it must be explicitly trained.
现有的语音AI助手在检测到停顿时会将其视为发言的邀请。这种做法适用于二元对话,但在多方场景中则不同了,在这些场景里,一个AI助手与多位说话者共同参与交流,此时停顿频繁且意义模糊。如果一个助手在每次停顿后都发言,它就会变得干扰而非有用。为此,我们制定了以情境感知为导向的轮次接管方法:每次检测到停顿时,根据整个对话背景,我们的方法会决定AI助手是应该发言还是保持沉默。 我们引入了一个基准测试集,其中包括超过120K个带有标签的多方对话,这些对话来自三个不同的数据集合。在评估了八个近期的大规模语言模型后,我们发现它们在零样本提示的情况下均无法很好地进行情境感知轮次接管。随后,我们提出了一种基于推理痕迹的监督微调方法,并通过这种方法将均衡准确性提高了多达23个百分点。 我们的研究结果表明,情境感知轮次接管并不是一个自然而然就能具备的能力;它需要经过明确训练才能实现。
https://arxiv.org/abs/2603.11409
Medical question-answering benchmarks predominantly evaluate single-turn exchanges, failing to capture the iterative, clarification-seeking nature of real patient consultations. We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns. Unlike prior work relying on simulated dialogues, adversarial prompts, or exam-style questions, ThReadMed-QA captures authentic patient follow-up questions and verified physician responses, reflecting how patients naturally seek medical information online. We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician ground truth. Even the strongest model, GPT-5, achieves only 41.2% fully-correct responses. All five models degrade significantly from turn 0 to turn 2 (p < 0.001), with wrong-answer rates roughly tripling by the third turn. We identify a fundamental tension between single-turn capability and multi-turn reliability: models with the strongest initial performance (GPT-5: 75.2; Claude Haiku: 72.3 out of 100) exhibit the steepest declines by turn 2 (dropping 16.2 and 25.0 points respectively), while weaker models plateau or marginally improve. We introduce two metrics to quantify multi-turn failure modes: Conversational Consistency Score (CCS) and Error Propagation Rate (EPR). CCS reveals that nearly one in three Claude Haiku conversations swings between a fully correct and a completely wrong response within the same thread. EPR shows that a single wrong turn raises the probability of a subsequent wrong turn by 1.9-6.1x across all models.
医学问答基准测试主要评估单轮对话,未能捕捉到真实患者咨询中迭代、寻求澄清的特性。我们推出了ThReadMed-QA,这是一个包含2,437个完全回答的医患对话线程的数据集,这些线程从r/AskDocs论坛上提取而来,涵盖了多达9轮的8,204个问答对。与以往的工作依赖于模拟对话、对抗性提示或考试风格问题不同的是,ThReadMed-QA捕捉到了患者自然地在网上寻求医疗信息时的真实跟进问题和经过验证的医生回答。 我们评估了五个最先进的大模型(LLM)——GPT-5、GPT-4o、Claude Haiku、Gemini 2.5 Flash 和 Llama 3.3 70B,在一个包含238个对话(948个问答对)的分层测试集上,使用了一个基于医生实际反馈进行校准的大模型作为评判标准。即使是最强大的模型GPT-5,也只达到了41.2%完全正确的回答率。所有五个模型从第一轮到第三轮的回答错误率都显著增加(p<0.001),在第二轮时错误率几乎翻了三倍。 我们发现了一个单一回合能力与多回合可靠性之间的根本矛盾:初始性能最强的模型(GPT-5: 75.2,Claude Haiku: 72.3,满分是100分)在第二轮时下降幅度最大(分别跌落了16.2和25.0个点),而较弱的模型则趋于稳定或略有提升。 我们提出了两个量化多回合失败模式的指标:对话一致性评分(CCS)和错误传播率(EPR)。CCS表明,几乎每三个Claude Haiku对话中就有一个在同一个线程内从完全正确的回答转向了完全错误的回答。EPR显示,一个错误的回合会导致后续出现另一个错误的概率增加1.9-6.1倍。 这些发现揭示了一个重大的挑战:当前的大模型即使在单轮任务中的表现再好,也可能无法维持其准确性与可靠性,尤其是在多轮对话中。
https://arxiv.org/abs/2603.11281
Ableist microaggressions remain pervasive in everyday interactions, yet interventions to help people recognize them are limited. We present an experiment testing how AI-mediated dialogue influences recognition of ableism. 160 participants completed a pre-test, intervention, and a post-test across four conditions: AI nudges toward bias (Bias-Directed), inclusion (Neutral-Directed), unguided dialogue (Self-Directed), and a text-only non-dialogue (Reading). Participants rated scenarios on standardness of social experience and emotional impact; those in dialogue-based conditions also provided qualitative reflections. Quantitative results showed dialogue-based conditions produced stronger recognition than Reading, though trajectories diverged: biased nudges improved differentiation of bias from neutrality but increased overall negativity. Inclusive or no nudges remained more balanced, while Reading participants showed weaker gains and even declines. Qualitative findings revealed biased nudges were often rejected, while inclusive nudges were adopted as scaffolding. We contribute a validated vignette corpus, an AI-mediated intervention platform, and design implications highlighting trade-offs conversational systems face when integrating bias-related nudges.
无障碍偏见微侵犯在日常互动中仍然普遍存在,但帮助人们识别这些行为的干预措施却十分有限。我们进行了一项实验,测试了人工智能(AI)介导对话如何影响人们对无障碍偏见的认知。160名参与者完成了四个条件下的预测试、干预和后测:AI提示偏向于偏见(Bias-Directed)、包容(Neutral-Directed)、无引导对话(Self-Directed),以及非对话的文本形式(Reading)。参与者对场景在社会经验和情感影响方面进行了评分;而参加基于对话条件的人还提供了定性反思。定量结果显示,与阅读组相比,对话组产生了更强的认知效果,尽管轨迹有所分歧:带有偏见提示提高了人们区分偏见和中立的能力,但同时也增加了整体负面情绪。相比之下,包容或无提示的条件保持了较为平衡的状态,而阅读组则显示出较弱的增长甚至下降的趋势。定性结果表明,偏见提示经常被拒绝,而包容提示则被视为一种支持性的框架。我们贡献了一个经过验证的情景语料库、一个AI介导的干预平台,并强调了当对话系统在整合与偏见相关的提示时所面临的设计权衡问题。
https://arxiv.org/abs/2603.11274
The emergence of large language model (LLM)-based agent frameworks has shifted the primary challenge in building domain-expert AI agents from raw capability to effective encoding of domain expertise. Two dominant paradigms -- code-first development, which embeds expertise in deterministic pipelines, and prompt-first development, which captures expertise in static system prompts -- both treat agent construction as a discrete engineering phase preceding deployment. We argue that this sequential assumption creates a fundamental mismatch with the nature of domain expertise, which is substantially tacit, deeply personal, and continuously evolving. We propose Nurture-First Development (NFD), a paradigm in which agents are initialized with minimal scaffolding and progressively grown through structured conversational interaction with domain practitioners. The central mechanism is the Knowledge Crystallization Cycle, whereby fragmented knowledge embedded in operational dialogue is periodically consolidated into structured, reusable knowledge assets. We formalize NFD through: (1) a Three-Layer Cognitive Architecture organizing agent knowledge by volatility and personalization degree; (2) the Knowledge Crystallization Cycle with formal definitions of crystallization operations and efficiency metrics; and (3) an operational framework comprising a Dual-Workspace Pattern and Spiral Development Model. We illustrate the paradigm through a detailed case study on building a financial research agent for U.S. equity analysis and discuss the conditions, limitations, and broader implications of NFD for human-agent co-evolution.
大型语言模型(LLM)基础代理框架的出现,将构建领域专家AI代理的主要挑战从原始能力转向了有效编码领域专业知识。两种主导范式——以代码为主的开发方法和以提示为主的开发方法——都将代理建设视为部署前的一个离散工程阶段。前者通过确定性的管道嵌入专业知识,后者则通过静态系统提示捕获专业知识。 我们提出一个观点认为这种顺序假设与领域的本质特性产生了根本性不匹配:领域知识往往是隐含的、个性化的,并且持续演进变化。因此,我们提出了“培育优先开发”(Nurture-First Development, NFD)的概念,在这一范式中,代理初始状态仅有最简结构框架,并通过系统性的对话交互与领域专家共同逐步构建。 核心机制为知识结晶周期,其中嵌入在操作性对话中的碎片化知识定期被整合并转化为结构化的、可重用的知识资产。我们以以下三个要素来形式化NFD: 1. 三层认知架构:根据代理知识的易变性和个性化程度对其进行组织。 2. 知识结晶周期:明确了有关结晶操作的形式定义及效率指标。 3. 操作框架:包括双工作区模式和螺旋式发展模式。 通过详细构建一个金融研究领域的案例,即为美国股票分析创建AI代理的过程,来阐述这一范式,并探讨NFD对于人机共进化的条件、限制以及更广泛的含义。
https://arxiv.org/abs/2603.10808
The alignment of large language models (LLMs) has progressed substantially in single-agent settings through paradigms such as RLHF and Constitutional AI, with recent work exploring scalable alternatives such as RLAIF and evolving alignment objectives. However, these approaches remain limited in multi-stakeholder settings, where conflicting values arise and deliberative negotiation capabilities are required. This work proposes a multi-agent negotiation-based alignment framework that aligns LLMs to Collective Agency (CA)-an existing alignment objective introduced to promote the continual expansion of agency-while simultaneously improving conflict-resolution capability. To enable scalable training, two self-play instances of the same LLM, assigned opposing personas, engage in structured turn-based dialogue to synthesize mutually beneficial solutions. We generate synthetic moral-dilemma prompts and conflicting persona pairs, and optimize the policy via RLAIF using GRPO with an external LLM reward model. While rewards are computed from CA scores assigned to the final completion, gradients are applied to dialogue tokens to directly improve deliberative interaction dynamics. Experiments show that the resulting model achieves CA alignment comparable to a single-agent baseline while substantially improving conflict-resolution performance without degrading general language capabilities. These results suggest that negotiation-driven deliberation training provides a practical path toward LLMs that better support collective decision-making in value-conflict scenarios.
大型语言模型(LLMs)的对齐工作在单代理环境中通过强化学习人类反馈(RLHF)和宪法AI等范式取得了显著进展,最近的研究则探索了如RLAIF等可扩展替代方案以及不断演变的对齐目标。然而,在涉及多个利益相关者的场景中,由于存在冲突的价值观且需要进行协商能力时,这些方法仍然受到限制。本文提出了一种基于多代理谈判的对齐框架,该框架将LLMs与集体机构(CA)——一种现有的旨在促进持续扩展行动范围的目标对齐的同时,还提升了冲突解决能力。为了实现可扩展训练,两个由同一LLM实例化但分配了对立角色的自我博弈者通过结构化的回合制对话来合成互惠解决方案。 我们生成了人工道德困境提示和对立的角色配对,并通过使用GRPO(带有外部LLM奖励模型)进行RLAIF优化策略。虽然最终完成情况下的CA评分决定了奖励,但是梯度直接应用于对话令牌以改善协商互动动态。实验表明,在与单代理基准相比时,该方法在不损害通用语言能力的情况下达到了可比的CA对齐水平,并显著提升了冲突解决性能。 这些结果表明,由谈判驱动的审议训练为开发出更好支持价值冲突情景下集体决策的LLMs提供了一条切实可行的道路。
https://arxiv.org/abs/2603.10476
The performance of task-oriented dialogue models is strongly tied to how well they track dialogue states, which records and updates user information across multi-turn interactions. However, current multi-domain DST encounters two key challenges: the difficulty of effectively modeling dialogue history and the limited availability of annotated data, both of which hinder model performance. To tackle the aforementioned problems, we develop a dynamic knowledge fusion framework applicable to multi-domain DST. The model operates in two stages: first, an encoder-only network trained with contrastive learning encodes dialogue history and candidate slots, selecting relevant slots based on correlation scores; second, dynamic knowledge fusion leverages the structured information of selected slots as contextual prompts to enhance the accuracy and consistency of dialogue state tracking. This design enables more accurate integration of dialogue context and domain knowledge. Results obtained from multi-domain dialogue benchmarks indicate that our method notably improves both tracking accuracy and generalization, validating its capability in handling complex dialogue scenarios.
任务导向对话模型的性能与其跟踪对话状态的能力密切相关,该能力记录和更新用户信息以应对多轮交互。然而,当前跨领域的对话状态追踪(DST)面临两大挑战:有效建模对话历史的难度以及标注数据有限的问题,这些问题限制了模型的表现。为了解决上述问题,我们开发了一个适用于多领域DST的动态知识融合框架。该模型分为两个阶段运行:首先,一个通过对比学习训练的编码器网络将对话历史和候选槽位进行编码,并基于相关性评分选择相关槽位;其次,动态知识融合利用选定槽位的结构化信息作为上下文提示来增强对话状态追踪的准确性和一致性。这种设计能够更精确地整合对话背景与领域知识。在多领域的对话基准测试中获得的结果表明,我们的方法显著提高了跟踪精度和泛化能力,验证了其处理复杂对话场景的能力。
https://arxiv.org/abs/2603.10367
This technical report presents Sabiá-4 and Sabiazinho-4, a new generation of Portuguese language models with a focus on Brazilian Portuguese language. The models were developed through a four-stage training pipeline: continued pre-training on Portuguese and Brazilian legal corpora, long-context extension to 128K tokens, supervised fine-tuning on instruction data spanning chat, code, legal tasks, and function calling, and preference alignment. We evaluate the models on six benchmark categories: conversational capabilities in Brazilian Portuguese, knowledge of Brazilian legislation, long-context understanding, instruction following, standardized exams, and agentic capabilities including tool use and web navigation. Results show that Sabiá-4 and Sabiazinho-4 achieve a favorable cost-performance trade-off compared to other models, positioning them in the upper-left region of the pricing-accuracy chart. The models show improvements over previous generations in legal document drafting, multi-turn dialogue quality, and agentic task completion.
这份技术报告介绍了Sabiá-4和Sabiazinho-4,这是新一代专注于巴西葡萄牙语的语言模型。这些模型通过一个四阶段的训练流程开发而成:在葡萄牙语和巴西法律语料库上继续进行预训练、将长上下文扩展到128K令牌、在覆盖聊天、代码、法律任务及函数调用等指令数据集上的监督微调,以及偏好对齐。我们评估了模型在六个基准类别中的表现:巴西葡萄牙语对话能力、了解巴西立法知识、理解长文本的能力、遵循指令的能力、标准化考试和包括工具使用与网络导航在内的代理功能。结果显示,Sabiá-4和Sabiazinho-4在成本性能比上优于其他模型,并且它们位于定价-准确性图表的左上方区域。相比之前的版本,这些模型在法律文件起草、多轮对话质量和任务完成方面都有所改进。
https://arxiv.org/abs/2603.10213
Large Language Models (LLMs) have emerged as a new paradigm for multi-agent systems. However, existing research on the behaviour of LLM-based multi-agents relies on ad hoc prompts and lacks a principled policy perspective. Different from reinforcement learning, we investigate whether prompt-as-action can be parameterized so as to construct a lightweight policy which consists of a sequence of state-action pairs to influence conversational behaviours without training. Our framework regards prompts as actions executed by LLMs, and dynamically constructs prompts through five components based on the current state of the agent. To test the effectiveness of parameterized control, we evaluated the dialogue flow based on five indicators: responsiveness, rebuttal, evidence usage, non-repetition, and stance shift. We conduct experiments using different LLM-driven agents in two discussion scenarios related to the general public and show that prompt parameterization can influence the dialogue dynamics. This result shows that policy-parameterised prompts offer a simple and effective mechanism to influence the dialogue process, which will help the research of multi-agent systems in the direction of social simulation.
大型语言模型(LLM)已成为多智能体系统的新范式。然而,现有的关于基于LLM的多智能体行为的研究依赖于临时性的提示,并缺乏一个原则性的政策视角。不同于强化学习,我们研究了是否可以通过参数化“提示作为动作”来构建一种轻量级策略,该策略由一系列状态-动作对组成,以影响对话行为而不必进行训练。我们的框架将提示视为LLM执行的动作,并根据代理当前的状态通过五个组件动态地构造提示。 为了测试参数控制的有效性,我们基于五项指标评估了对话流程:响应性、反驳能力、证据使用情况、非重复性和立场转变。我们在两个与公众相关的讨论场景中使用不同的由LLM驱动的智能体进行了实验,并发现提示参数化可以影响对话动态。这一结果表明,通过策略参数化的提示提供了一种简单而有效的机制来影响对话过程,这将有助于社会模拟方向上的多智能体系统研究。
https://arxiv.org/abs/2603.09890
Existing end-to-end modeling methods for modular task-oriented dialog systems are typically tailored to specific datasets, making it challenging to adapt to new dialog scenarios. In this work, we propose ESAinsTOD, a unified End-to-end Schema-Aware Instruction-tuning framework for general Task-Oriented Dialog modeling. This framework introduces a structured methodology to go beyond simply fine-tuning Large Language Models (LLMs), enabling flexible adaptation to various dialogue task flows and schemas. Specifically, we leverage full-parameter fine-tuning of LLMs and introduce two alignment mechanisms to make the resulting system both instruction-aware and schema-aware: (i) instruction alignment, which ensures that the system faithfully follows task instructions to complete various task flows from heterogeneous TOD datasets; and (ii) schema alignment, which encourages the system to make predictions adhering to the specified schema. In addition, we employ session-level end-to-end modeling, which allows the system to access the results of previously executed task flows within the dialogue history, to bridge the gap between the instruction-tuning paradigm and the real-world application of TOD systems. Empirical results show that while a fine-tuned LLM serves as a strong baseline, our structured approach provides significant additional benefits. In particular, our findings indicate that: (i) ESAinsTOD outperforms state-of-the-art models by a significant margin on end-to-end task-oriented dialog modeling benchmarks: CamRest676, In-Car and MultiWOZ; (ii) more importantly, it exhibits superior generalization capabilities across various low-resource settings, with the proposed alignment mechanisms significantly enhancing zero-shot performance; and (iii) our instruction-tuning paradigm substantially improves the model's robustness against data noise and cascading errors.
现有的端到端建模方法针对模块化任务导向对话系统通常是为特定数据集量身定制的,这使得适应新的对话场景变得具有挑战性。为此,我们提出了ESAinsTOD框架,这是一种统一的、面向通用任务导向对话建模的端到端结构感知指令微调框架。该框架引入了一种结构化的方法,超越了单纯对大规模语言模型(LLMs)进行精细调整,使灵活适应各种对话任务流程和模式成为可能。 具体而言,我们利用全参数细调LLMs,并引入两种对齐机制,确保系统既能理解指令又能遵循指定的模式:(i) 指令对齐,确保系统能够忠实地遵循任务指令完成从不同TOD数据集中异构的任务流程;(ii) 模式对齐,鼓励系统根据规定做出预测。此外,我们采用了会话级别的端到端建模方法,使系统能够在对话历史中访问之前执行的任务流的结果,从而弥合了指令微调范式与任务导向对话系统的实际应用之间的差距。 实证结果表明,尽管经过微调的LLM可以作为强大的基准模型,但我们的结构化方法提供了显著的额外优势。具体来说: (i) ESAinsTOD在端到端任务导向对话建模基准测试(CamRest676、In-Car和MultiWOZ)中大幅超越了最先进的模型; (ii) 更重要的是,在各种低资源设置下,它表现出强大的泛化能力,并且所提出的对齐机制显著提高了零样本性能; (iii) 我们的指令微调范式极大地提升了模型对抗数据噪声和级联错误的鲁棒性。 这一研究证明了ESAinsTOD框架在处理任务导向对话建模中的复杂性和多样性方面具备出色的能力。
https://arxiv.org/abs/2603.09691
In VR interactions with embodied conversational agents, users' emotional intent is often conveyed more by how something is said than by what is said. However, most VR agent pipelines rely on speech-to-text processing, discarding prosodic cues and often producing emotionally incongruent responses despite correct semantics. We propose an emotion-context-aware VR interaction pipeline that treats vocal emotion as explicit dialogue context in an LLM-based conversational agent. A real-time speech emotion recognition model infers users' emotional states from prosody, and the resulting emotion labels are injected into the agent's dialogue context to shape response tone and style. Results from a within-subjects VR study (N=30) show significant improvements in dialogue quality, naturalness, engagement, rapport, and human-likeness, with 93.3% of participants preferring the emotion-aware agent.
在虚拟现实(VR)中与具身对话代理互动时,用户的情感意图往往更多地通过说话的方式而非内容来传达。然而,大多数VR代理流程依赖于语音转文字处理,这会忽略语调线索,并且即便语义正确,也会产生情感不协调的回应。我们提出了一种情感上下文感知的VR交互管道,在基于大型语言模型(LLM)的对话代理中将声带情绪视为明确的对话背景。实时语音情感识别模型根据语调推断用户的情感状态,并将所得的情绪标签注入到代理的对话背景中,以塑造回应的语气和风格。一项针对30名参与者的VR内的研究结果表明,在对话质量、自然性、互动性、默契度和人性相似度方面有显著改进,93.3%的参与者更喜欢情感感知型代理。
https://arxiv.org/abs/2603.09324
Emotional Validation is a psychotherapy communication technique that involves recognizing, understanding, and explicitly acknowledging another person's feelings and actions, which strengthens alliance and reduces negative affect. To maximize the emotional support provided by validation, it is crucial to deliver it with appropriate timing and frequency. This study investigates validation timing detection from the speech perspective. Leveraging both paralinguistic and emotional information, we propose a paralinguistic- and emotion-aware model for validation timing detection without relying on textual context. Specifically, we first conduct continued self-supervised training and fine-tuning on different HuBERT backbones to obtain (i) a paralinguistics-aware Self-Supervised Learning (SSL) encoder and (ii) a multi-task speech emotion classification encoder. We then fuse these encoders and further fine-tune the combined model on the downstream validation timing detection task. Experimental evaluations on the TUT Emotional Storytelling Corpus (TESC) compare multiple models, fusion mechanisms, and training strategies, and demonstrate that the proposed approach achieves significant improvements over conventional speech baselines. Our results indicate that non-linguistic speech cues, when integrated with affect-related representations, carry sufficient signal to decide when validation should be expressed, offering a speech-first pathway toward more empathetic human-robot interaction.
情感确认是心理疗法中的一个沟通技巧,涉及识别、理解和明确承认他人的情感和行为,这有助于加强关系并减少负面情绪。为了最大限度地提高情感支持的效果,适时且频繁地提供情感确认至关重要。本研究从语音角度探讨了情感确认时机的检测问题。通过利用副语言和情感信息,我们提出了一种不需要文本上下文、结合副语言感知与情感意识的情感确认时机检测模型。 具体来说,首先在不同的HuBERT骨干网络上进行持续自监督训练和微调,以获得(i)一种具有副语言感知能力的自监督学习(SSL)编码器以及(ii)一个多任务语音情绪分类编码器。然后将这些编码器融合,并进一步对组合模型进行下游情感确认时机检测任务的微调。 在TUT情感叙事语料库(TESC)上进行了多项实验评估,比较了多种模型、融合机制和训练策略的效果,表明所提出的方法在传统语音基线模型上有显著改进。我们的结果表明,当非语言性语音线索与情绪相关的表示结合时,足以决定何时表达确认信息,为实现更富有同理心的人机互动提供了一种以语音为主导的路径。
https://arxiv.org/abs/2603.09307
Interleaved spoken language models (SLMs) alternately generate text and speech tokens, but decoding at full transformer depth for every step becomes costly, especially due to long speech sequences. We propose SPAR-K, a modality-aware early exit framework designed to accelerate interleaved SLM inference while preserving perceptual quality. SPAR-K introduces a speech alternating-depth schedule: most speech positions exit at a fixed intermediate layer, while periodic full-depth "refresh" steps mitigate distribution shift due to early exit. We evaluate our framework using Step-Audio-2-mini and GLM-4-Voice across four datasets spanning reasoning, factual QA, and dialogue tasks, measuring performance in terms of ASR transcription accuracy and perceptual quality. Experimental results demonstrate that SPAR-K largely preserves question-answering accuracy with a maximum accuracy drop of 0.82\% while reducing average speech decoding depth by up to 11\% on Step-Audio-2-mini and 5\% on GLM-4-Voice, both with negligible changes in MOS and WER and no auxiliary computation overhead. We further demonstrate that confidence-based early exit strategies, widely used in text LLMs, are suboptimal for SLMs, highlighting that the unique statistical nature of speech tokens necessitates a specialized early exit design.
交错式语言模型(SLMs)交替生成文本和语音标记,但每一步都进行全面变换解码变得非常昂贵,尤其是在长语音序列的情况下。我们提出了SPAR-K,这是一种感知模态的早期退出框架,旨在加速交错式SLM推理的同时保持感知质量。SPAR-K引入了一种语音交错深度调度:大多数语音位置在固定的中间层提前退出,而周期性的全深度“刷新”步骤则解决了由于提前退出导致的分布偏移问题。 我们使用了Step-Audio-2-mini和GLM-4-Voice模型,在包含推理、事实性问答及对话任务在内的四个数据集上评估了我们的框架。性能依据自动语音识别转录准确性和感知质量来衡量。实验结果表明,SPAR-K在保持最高0.82%的问答准确性下降的同时,将Step-Audio-2-mini和GLM-4-Voice模型中的平均语音解码深度分别减少了11%和5%,同时MOS(Mean Opinion Score)和WER(Word Error Rate)几乎未变且没有额外的计算开销。 此外,我们还展示了基于置信度的早期退出策略在文本LLMs中广泛使用时,在SLMs中效果不佳。这强调了语音标记的独特统计特性需要专门设计的早期退出方法。
https://arxiv.org/abs/2603.09215