Paper Reading AI Learner

Toward Low-Latency End-to-End Voice Agents for Telecommunications Using Streaming ASR, Quantized LLMs, and Real-Time TTS

2025-08-05 07:39:35
Vignesh Ethiraj, Ashwath David, Sidhanth Menon, Divya Vijay

Abstract

We introduce a low-latency telecom AI voice agent pipeline for real-time, interactive telecommunications use, enabling advanced voice AI for call center automation, intelligent IVR (Interactive Voice Response), and AI-driven customer support. The solution is built for telecom, combining four specialized models by NetoAI: TSLAM, a 4-bit quantized Telecom-Specific Large Language Model (LLM); T-VEC, a Telecom-Specific Embedding Model; TTE, a Telecom-Specific Automatic Speech Recognition (ASR) model; and T-Synth, a Telecom-Specific Text-to-Speech (TTS) model. These models enable highly responsive, domain-adapted voice AI agents supporting knowledge-grounded spoken interactions with low latency. The pipeline integrates streaming ASR (TTE), conversational intelligence (TSLAM), retrieval augmented generation (RAG) over telecom documents, and real-time TTS (T-Synth), setting a new benchmark for telecom voice assistants. To evaluate the system, we built a dataset of 500 human-recorded telecom questions from RFCs, simulating real telecom agent queries. This framework allows analysis of latency, domain relevance, and real-time performance across the stack. Results show that TSLAM, TTE, and T-Synth deliver real-time factors (RTF) below 1.0, supporting enterprise, low-latency telecom deployments. These AI agents -- powered by TSLAM, TTE, and T-Synth -- provide a foundation for next-generation telecom AI, enabling automated customer support, diagnostics, and more.

Abstract (translated)

我们介绍了一种低延迟的电信AI语音代理管道,适用于实时互动通信场景。该解决方案旨在通过实现呼叫中心自动化、智能IVR(交互式语音应答)和基于人工智能的客户支持来推动先进语音AI的应用。 此方案由NetoAI开发的四个专业模型构成:TSLAM是一个4位量化版本的专门针对电信领域的大型语言模型;T-VEC是专门的嵌入模型;TTE是专门的自动语音识别(ASR)模型;而T-Synth则是专门的文本转语音(TTS)模型。这些模型结合在一起,能够支持快速响应、领域特定调整的语音AI代理,实现基于知识的基础对话交互,并具有低延迟的特点。 该管道集成了流式ASR(TTE)、会话智能(TSLAM)、电信文档上的检索增强生成(RAG),以及实时TTS(T-Synth),为电信语音助手设定了新的性能标准。为了评估这套系统,我们建立了一个包含500个由真人录制的电信问题的数据集,这些问题均源自RFCs,用于模拟真实的电信代理查询场景。这一框架能够全面分析延迟、领域相关性以及整个堆栈中的实时性能。 实验结果表明,TSLAM、TTE和T-Synth在实现实时因子(RTF)低于1.0的同时,支持企业级的低延迟能力。这些由TSLAM、TTE和T-Synth驱动的AI代理为下一代电信AI奠定了基础,推动了自动化客户支持、故障诊断等领域的进步和发展。

URL

https://arxiv.org/abs/2508.04721

PDF

https://arxiv.org/pdf/2508.04721.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot