Paper Reading AI Learner

PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction

2025-06-18 15:29:02
Shufan Li, Aditya Grover

Abstract

Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.

Abstract (translated)

大型语言模型(LLMs)在实时语音聊天应用中广泛应用,通常与文本转语音(TTS)系统结合使用以生成音频响应。然而,其庞大的规模往往会导致用户输入结束和音频输出开始之间存在明显的延迟,从而导致用户体验不佳。这种延迟尤其明显于当LLMs被部署为单用户语音助手时,在计算能力有限的消费级硬件上运行的情况下。 我们发现,这种延迟主要是由于LLMs生成第一个句子所需的时间造成的,而TTS系统则需要这个句子作为输入来逐句合成音频响应。为了应对这一瓶颈,我们提出了一种名为预测性生成(PredGen)的新框架,该框架通过在用户输入时进行投机性解码来减轻甚至消除这种延迟。PredGen能够在用户说话的过程中生成候选响应,从而使系统可以尽早开始TTS处理过程,从而减少延迟。 模拟实验表明,在Lmsys和MT-Bench数据集上的使用情况中,所提出的方法能够有效将延迟降低大约2倍,并且在输入时仅产生很小的额外计算成本——这部分计算本来是闲置不用的。

URL

https://arxiv.org/abs/2506.15556

PDF

https://arxiv.org/pdf/2506.15556.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot