Abstract
As Generative AI (GenAI), particularly inference, rapidly emerges as a dominant workload category, the Kubernetes ecosystem is proactively evolving to natively support its unique demands. This industry paper demonstrates how emerging Kubernetes-native projects can be combined to deliver the benefits of container orchestration, such as scalability and resource efficiency, to complex AI workflows. We implement and evaluate an illustrative, multi-stage use case consisting of automatic speech recognition and summarization. First, we address batch inference by using Kueue to manage jobs that transcribe audio files with Whisper models and Dynamic Accelerator Slicer (DAS) to increase parallel job execution. Second, we address a discrete online inference scenario by feeding the transcripts to a Large Language Model for summarization hosted using llm-d, a novel solution utilizing the recent developments around the Kubernetes Gateway API Inference Extension (GAIE) for optimized routing of inference requests. Our findings illustrate that these complementary components (Kueue, DAS, and GAIE) form a cohesive, high-performance platform, proving Kubernetes' capability to serve as a unified foundation for demanding GenAI workloads: Kueue reduced total makespan by up to 15%; DAS shortened mean job completion time by 36%; and GAIE improved Time to First Token by 82\%.
Abstract (translated)
随着生成式人工智能(GenAI)尤其是推理工作负载迅速成为主导类型,Kubernetes生态系统正积极演进以原生支持其独特需求。本行业报告展示了如何结合新兴的原生Kubernetes项目,将容器编排的好处如可扩展性和资源效率提供给复杂的AI工作流程。我们实施并评估了一个具有代表性的多阶段用例,其中包括自动语音识别和摘要。首先,我们使用Kueue来管理批处理推理作业,并利用Whisper模型转录音频文件,同时采用动态加速器切片器(DAS)以增加并行作业执行的数量。其次,我们在一个离线推理场景中应用了这一方案,即将转录文本传递给由llm-d托管的大规模语言模型进行摘要生成,这是一种新颖的解决方案,利用Kubernetes Gateway API推理扩展(GAIE)来优化推理请求的路由。 我们的研究发现表明,这些互补组件(包括Kueue、DAS和GAIE)共同构成了一个连贯且高性能的平台,这证明了Kubernetes有能力成为复杂GenAI工作负载统一基础的理想选择:Kueue最多可将总耗时减少15%;DAS缩短平均作业完成时间36%;而GAIE则使首个令牌生成时间提高了82%。
URL
https://arxiv.org/abs/2602.04900