Paper Reading AI Learner

Evaluating Kubernetes Performance for GenAI Inference: From Automatic Speech Recognition to LLM Summarization

2026-02-03 15:36:08
Sai Sindhur Malleni, Ra\'ul Sevilla, Aleksei Vasilevskii, Jos\'e Castillo Lema, Andr\'e Bauer

Abstract

As Generative AI (GenAI), particularly inference, rapidly emerges as a dominant workload category, the Kubernetes ecosystem is proactively evolving to natively support its unique demands. This industry paper demonstrates how emerging Kubernetes-native projects can be combined to deliver the benefits of container orchestration, such as scalability and resource efficiency, to complex AI workflows. We implement and evaluate an illustrative, multi-stage use case consisting of automatic speech recognition and summarization. First, we address batch inference by using Kueue to manage jobs that transcribe audio files with Whisper models and Dynamic Accelerator Slicer (DAS) to increase parallel job execution. Second, we address a discrete online inference scenario by feeding the transcripts to a Large Language Model for summarization hosted using llm-d, a novel solution utilizing the recent developments around the Kubernetes Gateway API Inference Extension (GAIE) for optimized routing of inference requests. Our findings illustrate that these complementary components (Kueue, DAS, and GAIE) form a cohesive, high-performance platform, proving Kubernetes' capability to serve as a unified foundation for demanding GenAI workloads: Kueue reduced total makespan by up to 15%; DAS shortened mean job completion time by 36%; and GAIE improved Time to First Token by 82\%.

Abstract (translated)

随着生成式人工智能(GenAI)尤其是推理工作负载迅速成为主导类型,Kubernetes生态系统正积极演进以原生支持其独特需求。本行业报告展示了如何结合新兴的原生Kubernetes项目,将容器编排的好处如可扩展性和资源效率提供给复杂的AI工作流程。我们实施并评估了一个具有代表性的多阶段用例,其中包括自动语音识别和摘要。首先,我们使用Kueue来管理批处理推理作业,并利用Whisper模型转录音频文件,同时采用动态加速器切片器(DAS)以增加并行作业执行的数量。其次,我们在一个离线推理场景中应用了这一方案,即将转录文本传递给由llm-d托管的大规模语言模型进行摘要生成,这是一种新颖的解决方案,利用Kubernetes Gateway API推理扩展(GAIE)来优化推理请求的路由。 我们的研究发现表明,这些互补组件(包括Kueue、DAS和GAIE)共同构成了一个连贯且高性能的平台,这证明了Kubernetes有能力成为复杂GenAI工作负载统一基础的理想选择:Kueue最多可将总耗时减少15%;DAS缩短平均作业完成时间36%;而GAIE则使首个令牌生成时间提高了82%。

URL

https://arxiv.org/abs/2602.04900

PDF

https://arxiv.org/pdf/2602.04900.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot