Paper Reading AI Learner

I/O in Machine Learning Applications on HPC Systems: A 360-degree Survey

2024-04-16 08:37:36
Noah Lewis, Jean Luca Bez, Suren Byna

Abstract

High-Performance Computing (HPC) systems excel in managing distributed workloads, and the growing interest in Artificial Intelligence (AI) has resulted in a surge in demand for faster methods of Machine Learning (ML) model training and inference. In the past, research on HPC I/O focused on optimizing the underlying storage system for modeling and simulation applications and checkpointing the results, causing writes to be the dominant I/O operation. These applications typically access large portions of the data written by simulations or experiments. ML workloads, in contrast, perform small I/O reads spread across a large number of random files. This shift of I/O access patterns poses several challenges to HPC storage systems. In this paper, we survey I/O in ML applications on HPC systems, and target literature within a 6-year time window from 2019 to 2024. We provide an overview of the common phases of ML, review available profilers and benchmarks, examine the I/O patterns encountered during ML training, explore I/O optimizations utilized in modern ML frameworks and proposed in recent literature, and lastly, present gaps requiring further R&D. We seek to summarize the common practices used in accessing data by ML applications and expose research gaps that could spawn further R&D.

Abstract (translated)

高性能计算(HPC)系统在管理分布式负载方面表现出色,随着人工智能(AI)需求的增加,对机器学习(ML)模型训练和推理的更快速方法的需求也在增加。在过去,研究主要集中在优化建模和仿真应用的底层存储系统以及检查点结果。导致写入操作成为主导的I/O操作。这些应用通常访问由仿真或实验编写的大型数据部分。与ML工作负载不同,ML工作负载在大型随机文件上执行小的I/O读取。这种I/O访问模式的变化给HPC存储系统带来了几个挑战。在本文中,我们对HPC系统中的ML应用程序的I/O进行了调查,目标文献是在2019年到2024年期间发表的6年内的文献。我们提供了ML的常见阶段的概述,回顾了可用的调试器和基准,研究了在ML训练过程中遇到的I/O模式,探讨了现代ML框架中使用的I/O优化以及在最近文献中提出的I/O优化,最后,我们提出了需要进一步研究的研究差距。我们希望简要概括ML应用程序访问数据时的常见做法,并揭示可能引发进一步研究需求的研究空白。

URL

https://arxiv.org/abs/2404.10386

PDF

https://arxiv.org/pdf/2404.10386.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot