Abstract
While voice-based AI systems have achieved remarkable generative capabilities, their interactions often feel conversationally broken. This paper examines the interactional friction that emerges in modular Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) pipelines. By analyzing a representative production system, we move beyond simple latency metrics to identify three recurring patterns of conversational breakdown: (1) Temporal Misalignment, where system delays violate user expectations of conversational rhythm; (2) Expressive Flattening, where the loss of paralinguistic cues leads to literal, inappropriate responses; and (3) Repair Rigidity, where architectural gating prevents users from correcting errors in real-time. Through system-level analysis, we demonstrate that these friction points should not be understood as defects or failures, but as structural consequences of a modular design that prioritizes control over fluidity. We conclude that building natural spoken AI is an infrastructure design challenge, requiring a shift from optimizing isolated components to carefully choreographing the seams between them.
Abstract (translated)
尽管基于语音的AI系统已经实现了显著的生成能力,但它们之间的互动往往感觉在对话上存在断裂。本文探讨了模块化Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) 管道中出现的交互摩擦。通过对一个典型的生产系统的分析,我们超越了简单的延迟指标,识别出三种反复出现的对话中断模式:(1) 时间错位,即系统延迟违反了用户对对话节奏的预期;(2) 表达扁平化,即失去副语言线索导致产生过于字面且不合适的回应;以及 (3) 修复僵硬,即架构隔离阻止用户实时纠正错误。通过系统层面分析,我们证明这些摩擦点不应被视为缺陷或失败,而是优先考虑控制而非流畅性的模块化设计的结构后果。最终结论是构建自然语音AI是一个基础设施设计挑战,需要从优化孤立组件转向精心安排它们之间的接口和协调机制。
URL
https://arxiv.org/abs/2512.11724