Paper Reading AI Learner

X-lifecycle Learning for Cloud Incident Management using LLMs

2024-02-15 06:19:02
Drishti Goel, Fiza Husain, Aditya Singh, Supriyo Ghosh, Anjaly Parayil, Chetan Bansal, Xuchao Zhang, Saravan Rajmohan

Abstract

Incident management for large cloud services is a complex and tedious process and requires significant amount of manual efforts from on-call engineers (OCEs). OCEs typically leverage data from different stages of the software development lifecycle [SDLC] (e.g., codes, configuration, monitor data, service properties, service dependencies, trouble-shooting documents, etc.) to generate insights for detection, root causing and mitigating of incidents. Recent advancements in large language models [LLMs] (e.g., ChatGPT, GPT-4, Gemini) created opportunities to automatically generate contextual recommendations to the OCEs assisting them to quickly identify and mitigate critical issues. However, existing research typically takes a silo-ed view for solving a certain task in incident management by leveraging data from a single stage of SDLC. In this paper, we demonstrate that augmenting additional contextual data from different stages of SDLC improves the performance of two critically important and practically challenging tasks: (1) automatically generating root cause recommendations for dependency failure related incidents, and (2) identifying ontology of service monitors used for automatically detecting incidents. By leveraging 353 incident and 260 monitor dataset from Microsoft, we demonstrate that augmenting contextual information from different stages of the SDLC improves the performance over State-of-The-Art methods.

Abstract (translated)

事件管理对于大型云计算服务来说是一个复杂且繁琐的过程,需要从维护人员(OMCs)那里进行大量的手动努力。OMCs通常利用软件开发生命周期(SDLC)的不同阶段(例如代码、配置、监控数据、服务属性、服务依赖关系、故障排除文档等)的数据来生成用于检测、根本原因分析和缓解事件的洞察。近年来,大型语言模型(LLMs)[例如 ChatGPT、GPT-4 和 Gemini]的进步为OMCs提供了自动为它们生成上下文建议的机会,从而快速识别和缓解关键问题。然而,现有的研究通常通过利用SDLC的单个阶段的数据来解决特定任务。在本文中,我们证明了从SDLC的不同阶段增加额外上下文数据可以提高两个关键且具有实际挑战性的任务:(1)自动生成与依赖关系故障相关的根本原因建议,(2)自动检测事件使用的服务监控的语义网络。通过利用来自微软的353个事件和260个监控数据集,我们证明了从SDLC的不同阶段增加上下文信息可以提高现有方法的性能。

URL

https://arxiv.org/abs/2404.03662

PDF

https://arxiv.org/pdf/2404.03662.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot