Abstract
Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.
Abstract (translated)
大型语言模型(LLMs)在实时语音聊天应用中广泛应用,通常与文本转语音(TTS)系统结合使用以生成音频响应。然而,其庞大的规模往往会导致用户输入结束和音频输出开始之间存在明显的延迟,从而导致用户体验不佳。这种延迟尤其明显于当LLMs被部署为单用户语音助手时,在计算能力有限的消费级硬件上运行的情况下。 我们发现,这种延迟主要是由于LLMs生成第一个句子所需的时间造成的,而TTS系统则需要这个句子作为输入来逐句合成音频响应。为了应对这一瓶颈,我们提出了一种名为预测性生成(PredGen)的新框架,该框架通过在用户输入时进行投机性解码来减轻甚至消除这种延迟。PredGen能够在用户说话的过程中生成候选响应,从而使系统可以尽早开始TTS处理过程,从而减少延迟。 模拟实验表明,在Lmsys和MT-Bench数据集上的使用情况中,所提出的方法能够有效将延迟降低大约2倍,并且在输入时仅产生很小的额外计算成本——这部分计算本来是闲置不用的。
URL
https://arxiv.org/abs/2506.15556