Evolution of Aegis: Fault Diagnosis for AI Model Training Service in Production
TL;DR Summary
Aegis, a fault diagnosis system for AI model training services, uses distributed data analysis to localize failures, reducing idle time and restarts while improving performance in production environments.
Abstract
Despite the success of diagnosis systems in traditional cloud computing, these systems are not suitable for pinpointing faults in AI model training cloud scenarios due to the differences in computing paradigms between traditional cloud computing and model training. As one of the largest cloud providers, we present Aegis, a fault diagnosis system specifically designed for AI model training service. We share our experience in the motivation, design, and evolution of Aegis. Keeping easy-to-deploy as the primary principle, Aegis Phase-1 started by enhancing existing general-purpose diagnosis systems. After several months of evolution, Aegis Phase-2 cogitatively customized the collective communication library for sophisticated failure localization in runtime without modifying customer code. Besides the failure localization, we further equipped Aegis with the capabilities on handling performance degradation and failure checking before delivery. Aegis has been deployed in our production training cloud service for one year. Aegis decreases more than 97% of the idle time wasted by diagnosis, 84% of the training task restart count, and 71% of the performance degradation.
Mind Map
In-depth Reading
English Analysis
Bibliographic Information
- Title:
Evolution of Aegis: Fault Diagnosis for AI Model Training Service in Production - Authors: Jianbo Dong, Kun Qian, Pengcheng Zhang, Zhilong Zheng, Liang Chen, Fei Feng, Yichi Xu, Yikai Zhu, Gang Lu, Xue Li, Zhihui Ren, Zhicheng Wang, Bin Luo, Peng Zhang, Yang Liu, Yanqing Chen, Yu Guan, Weicheng Wang, Chaojie Yang, Yang Zhang, Man Yuan, Hanyu Zhao, Yong Li, Zihan Zhao, Shan Li, Xianlong Zeng, Zhiping Yao, Binzhang Fu, Ennan Zhai, Wei Lin, Chao Wang, Dennis Cai
- Affiliations: Alibaba Cloud
- Journal/Conference: The context suggests this paper is intended for publication at a reputable systems conference, specifically
NSDI(Networked Systems Design and Implementation), indicated bynsdi25fallin the original source link, implying a high-tier venue for systems research. - Publication Year: 2024 (based on internal references to 2024 events and the deployment of Aegis Phase-2 in June 2024).
- Abstract: Despite the advancements in diagnosis systems for traditional cloud computing, these are often inadequate for the unique challenges presented by AI model training cloud scenarios due to fundamental differences in computing paradigms. This paper introduces
Aegis, a fault diagnosis system specifically engineered for AI model training services, sharing insights into its motivation, design, and evolutionary phases. Starting withAegis Phase-1, which prioritizedeasy-to-deployprinciples by enhancing existing general-purpose diagnosis systems, the system evolved intoAegis Phase-2. The lattercognitively customizedthecollective communication library (CCL)for sophisticated runtime failure localization without requiring modifications to customer code. Beyond failure localization,Aegiswas further equipped to handleperformance degradationand implementfailure checking before delivery (CBD). Deployed in a production training cloud service for one year,Aegishas demonstrated significant improvements: decreasingidle timewasted by diagnosis by over97%, reducingtraining task restart countsby84%, and mitigatingperformance degradationby71%. - Original Source Link: https://ennanzhai.github.io/pub/nsdi25fall-aegis.pdf (Clarified as a pre-print or technical report link, likely for an upcoming conference.)
Executive Summary
Background & Motivation (Why)
The proliferation of large-scale AI model training has introduced new and complex challenges for cloud service providers. Traditional fault diagnosis systems, effective in general cloud computing, are often ill-suited for these AI model training environments. This is due to fundamental differences in computing paradigms, particularly the reliance on tightly synchronized collective communication among thousands of GPUs.
The core problem is the difficulty in pinpointing faults and performance degradation efficiently and accurately in production-scale AI model training clusters. This problem is crucial because:
-
High Failure Rates:
High-tier GPUsandhigh-speed network componentsused in AI training exhibit significantly higher failure rates than traditional hardware. -
Cascading Failures: Due to the synchronous nature of model training, a
single-point failurecan rapidly propagate, leading tocascading failuresacross the entire cluster. -
Ambiguous Error Indicators: Failures often manifest as generic
CCL timeoutsor widespread error reports, obscuring the true root cause amidst a flood ofsecondary issue errors. -
Inefficiency of Existing Tools: Prior
diagnosis systemseither requirecustomer code modification(which is impractical for public cloud providers due to privacy and deployment challenges) or are designed forpre-deployment checks(offline diagnosis), leading tosignificant idle timeandresource wastageduring runtime failures.The paper's novel approach is
Aegis, afault diagnosis systemspecifically designed forAI model training servicesin aproduction cloud environment. It aims to provideruntime diagnosiswithoutmodifying customer codeand to evolve its capabilities based on real-world operational experience.
Main Contributions / Findings (What)
The paper presents the design, evolution, and evaluation of Aegis, highlighting several key contributions:
- Phased Evolution of Diagnosis:
Aegisevolved through two main phases:- Phase-1: Enhanced existing general-purpose diagnosis systems by incorporating
training-output logsand developing atraining-specific runtime diagnosis procedure(usingCriticalError()andDistError()), backed by a comprehensiveoffline diagnosismechanism for complex cases. This phase addressedhost-side critical failuresfirst, recognizing their frequent misinterpretation as network issues. - Phase-2: Introduced
procedure-aware diagnosisbycustomizing the Collective Communication Library (CCL). This allowed for sophisticatedruntime failure localization(distinguishing betweencomputationandcommunicationfailures) withoutmodifying customer code, addressing the limitations ofoffline diagnosisand improvingGPU utilization.
- Phase-1: Enhanced existing general-purpose diagnosis systems by incorporating
- Performance Degradation Diagnosis:
Aegisincludes capabilities to identify and diagnoseperformance degradationthroughbasic correlating diagnosis(usingZ-Scoreoutlier analysis on various metrics) andenhanced procedure-aware diagnosis(leveraging customizedCCLmetrics likedurationandnetwork throughput). - Proactive Failure Prevention (Check Before Delivery - CBD): A
Check Before Delivery (CBD)procedure was implemented to efficiently check hosts for potential issues before they are delivered to customers, significantly reducingtask failuresduring theinitialization phaseand preventingunnecessary retries. - Significant Production Impact:
Aegishas been deployed in a productionLLM training cloud serviceat Alibaba Cloud for over a year, demonstrating:- Over
97%reduction inidle timewasted by diagnosis. 84%decrease intraining task restart counts.71%reduction inperformance degradation.
- Over
- Operational Insights and Lessons Learned: The paper shares valuable real-world experiences from operating large-scale
AI training clusters, including challenges withbatch link failures,heterogeneous devicesandmultiple sale modes,congestion control issues, and paramount importance ofcustomer privacy.
Prerequisite Knowledge & Related Work
Foundational Concepts
To understand the paper, several core concepts related to AI model training, cloud computing, and network systems are essential:
- AI Model Training: The process of feeding data to an
AI model(e.g., alarge language model (LLM)) to enable it to learn patterns and make predictions. This typically involves iterative computations and frequent data synchronization. - Large-Scale Model Training: Refers to training models with billions or trillions of parameters, requiring massive computational resources (
tens of thousands of high-tier GPUs) distributed across many interconnected machines. This scale amplifies the impact of any single failure. - Cloud Computing: A model for delivering on-demand computing services (e.g., servers, storage, databases, networking, software, analytics, intelligence) over the Internet ("the cloud").
PaaS (Platform-as-a-Service)mode provides a complete platform for development and deployment, whileIaaS (Infrastructure-as-a-Service)mode offers virtualized computing resources. - GPUs (Graphics Processing Units): Specialized electronic circuits designed to rapidly manipulate and alter memory to accelerate the creation of images, but are also highly effective for parallel computations required in AI training.
High-tier GPUslikeNVIDIA A100andH100are crucial for performance but also have higher failure rates. - Collective Communication: In distributed AI model training, multiple
GPUsandhosts(physical servers) must frequently exchange and aggregate data (e.g., gradients, model parameters).Collective communication libraries (CCL)likeNVIDIA NCCLfacilitate these synchronized operations (e.g.,all-reduce,all-gather). ACCL timeoutoccurs when acollective communicationoperation fails to complete within a specified time limit, often indicating a problem with aGPU,network, orsoftware. - Network Components:
- NICs (Network Interface Cards): Hardware components that connect a computer to a computer network.
- Switches: Network devices that connect devices in a computer network, forwarding data only to specific devices that need to receive it.
- PCIe (Peripheral Component Interconnect Express): A high-speed serial computer expansion bus standard used for connecting various hardware components, including
GPUsandNICs, within a singlehost. - NVLink: A high-bandwidth, energy-efficient, multi-lane communication link developed by NVIDIA for
GPU-to-GPUandGPU-to-CPUinterconnects within ahost. - RDMA (Remote Direct Memory Access): A technology that allows one computer to access memory in another computer without involving the operating system or
CPUof the remote computer, significantly reducing latency and increasing throughput forhigh-performance computing.
- Fault Diagnosis: The process of identifying the
root causeof a failure or abnormal behavior in a system. - Performance Degradation: A state where a system or component operates below its expected or optimal performance level, without necessarily crashing entirely.
- Idle Time: The period during which computational resources (e.g.,
GPUs) are available but not actively performing useful work, often due to waiting for diagnosis or recovery from a failure. - Z-Score: A statistical measure that describes a value's relationship to the mean of a group of values, measured in terms of standard deviations from the mean. It's used here for
outlier detection. Where:- : the individual data point.
- : the mean of the dataset.
- : the standard deviation of the dataset.
A high absolute
Z-scoreindicates that the data point is anoutlier.
- ECN (Explicit Congestion Notification): A network protocol mechanism that allows end-to-end notification of network congestion without dropping packets. When a network device experiences congestion, it can mark packets with an
ECNsignal, informing the sending application to reduce its transmission rate. A continuously highECN metriccan indicatenetwork congestion.
Previous Works
The paper categorizes previous work into Large-scale AI model training diagnosis systems and General abnormality diagnosis systems.
General Abnormality Diagnosis Systems
These systems, like Pingmesh [31, 32] for network latency measurement and analysis, Vigil [23], 007 [24], NetPoirot [25] for network diagnostics, Confluo [38] for high-speed network monitoring, and various host anomaly detection systems [23-25, 28, 31, 34, 35, 39, 42, 43, 46-48, 50, 52, 55, 57], are designed for traditional cloud computing scenarios. They typically localize root causes by tracing back source-destination paths through system component calls (e.g., 5-tuples, devices). The authors argue that these are not directly suitable for AI model training due to the synchronous nature and cascading failure patterns unique to model training. The existing diagnosis tools at Alibaba Cloud (Tool 1: Network monitoring and analysis, Tool 2: RDMA Pingmesh, Tool 3: Inband network diagnosis) fall into this category and have limitations in the AI training context (e.g., false positives, focus on single request/response).
Large-scale AI Model Training Diagnosis Systems
- SuperBench [60]: Developed by Microsoft, it provides a comprehensive
benchmark suiteforgray failure checkingbefore cluster deployment. It includescomputation/communication microbenchmarksandend-to-end model training benchmarks.- Limitation: It's an
offline diagnosistool. It cannotlocalize root causesof faults occurring during model trainingruntime. Running it after every failure istime-consuming(hours) andresource-intensive.
- Limitation: It's an
- MegaScale [36]: Developed by ByteDance, this system focuses on
runtime diagnosisby monitoringCUDA eventsfrom "critical code segments" in the customer model.- Limitation: It requires defining "critical code segments" and potentially
modifying customer codeto insertmonitor modules. This is impractical forpublic cloud service providerslike Alibaba Cloud, who serve diverse customers with proprietary models and cannot demand code modifications due toconfidentialityanddeployment complexity.
- Limitation: It requires defining "critical code segments" and potentially
Technological Evolution
The field has evolved from general-purpose cloud infrastructure monitoring to specialized tools for distributed AI training. Early diagnosis focused on individual host or network device health. With the rise of large-scale distributed training, the challenges shifted to inter-device synchronization and cascading failures. Solutions like SuperBench addressed pre-deployment validation, while MegaScale attempted runtime diagnosis but with intrusive methods. Aegis represents a further evolution by aiming for non-intrusive runtime diagnosis in a production cloud environment, specifically tailoring its approach to the unique characteristics of collective communication and the customer-provider relationship.
Differentiation
Aegis differentiates itself from SuperBench and MegaScale primarily through its focus on runtime diagnosis without requiring modifications to customer code, while also addressing performance degradation and pre-delivery checks.
- Compared to SuperBench:
Aegisprovidesruntime diagnosisfor failures and degradation that occur during active training, unlikeSuperBenchwhich is apre-deployment validationtool. WhileAegisincorporates somepre-delivery checks(CBD), its core strength is inruntime problem resolution. - Compared to MegaScale:
Aegisachievesruntime diagnosisbycustomizing the Collective Communication Library (CCL), which is a modular component, rather than requiring invasivemodifications to customer model codeortraining frameworks. This design choice is critical for apublic cloud service providerto ensuretransparency,generality, andcustomer privacy. - Compared to General Diagnosis Systems:
Aegisexplicitly accounts for thesynchronous natureofAI model trainingand the resultingcascading failurepatterns, allowing it to localizeroot causesmore effectively than systems designed for asynchronousgeneral cloud computingworkloads.
Methodology (Core Technology & Implementation Details)
The core idea behind Aegis is to provide a robust fault diagnosis system for AI model training services in a production cloud environment that is easy-to-deploy, transparent to customers, and effective in runtime. It addresses both task failures and performance degradation through a phased evolutionary approach and incorporates proactive checks. The underlying principles are to first exhaust host-side critical failures and then leverage training-specific information (especially from collective communication) to pinpoint root causes.
Aegis Overview
Aegis is designed to handle two primary types of abnormalities: training failure and performance degradation. Its evolution reflects a continuous effort to improve runtime diagnosis capabilities while minimizing disruption and ensuring customer privacy.
该图像是论文中图6的示意图,展示了Aegis故障诊断系统的整体架构,包含任务失败和性能下降的多阶段诊断流程,以及在线和离线日志数据的利用,体现系统的演进和错误定位机制。
Figure 6: Aegis overview.
The system's evolution is described in phases:
- Aegis Phase-1: Focused on enhancing existing diagnosis systems by integrating
training-output logsand establishing atraining-specific runtime diagnosis procedure. It also included a comprehensiveoffline diagnosisas a fallback. - Aegis Phase-2: Aimed to improve
runtime diagnosisfor more sophisticated cases bycustomizing the Collective Communication Library (CCL)to gathertraining procedure-aware information. - Performance Degradation: Developed specific mechanisms including
basic-metric correlating diagnosisandenhanced procedure-aware diagnosis. - Check Before Delivery (CBD): A proactive "pre-online" process to check cluster health before delivery to customers.
4.1 Phase-1: Enhancing Existing Systems
This phase started by improving existing general-purpose diagnosis systems by incorporating training-specific information and procedures.
4.1.1 Basic Error Diagnosis
The initial approach involved leveraging human expertise to analyze error and warning logs from various sources (OS dmesg, training log, CCL log, NIC driver, switch syslog, customized counters). This led to the identification of critical error patterns that were then automated.
Two main challenges were addressed:
- Not all reported errors are critical: Many errors are benign.
Aegisdistinguishes between critical errors thatintrinsically lead to task anomalies(e.g.,double-bit ECC errors,GPU missing,PCIe lane drops,NVLink issues) and less critical ones.Critical errorslead to immediatehost isolation.CriticalError(): A function that identifies errors that are strong indicators of a specific problem requiring immediateisolation. Categories include:Hardware failure: e.g.,double-bit ECC error,link down,GPU/NVLink/NIC missing,fan error,power error.Unrecoverable software failure: e.g.,GPU/NIC driver error.Unbearable performance degradation: e.g.,GPU/host overheat.
- Not all critical errors point to a clear location: Some errors (e.g.,
connection reset by peer,CCL timeout) are distributed across manyhosts, making it hard to pinpoint theroot cause.-
DistError(): A function that records and analyzes these distributed error patterns.The refined
fault diagnosis process(Algorithm 1 in Appendix A) works as follows:
-
-
Step 1: Check for Critical Errors: If any
machine(host) reportscritical errors(), thosehostsare immediatelyisolated, and thetraining taskisrestarted. -
Step 2: Check for Small-Scale Distributed Errors: If
distributed errors(LDE) are found on only one or twohosts(), thesehostsareisolated, and thetaskisrestarted. This is a trade-off to quickly resolve small-scale issues, even if it means isolating some normalhosts. -
Step 3: Analyze Large-Scale Distributed Errors: If
distributed errorsoccur across multiplemachines,RootDiag(LDE)is called.RootDiag(): This function analyzes thedistributed error reportsto identify if they cluster around a specificsourceordestination(e.g., if aGPUis theroot cause, connections from and to will crash first). If a faulty node is identified,CheckError(N)is performed, isisolated, and thetaskisrestarted.
-
Step 4: Systemic Issue Diagnosis: If
RootDiag()cannot pinpoint a specifichost, it suggestssystemic issueslikenetworkorconfiguration problems.ConfigCheck(): Checks for configuration errors using a predefined checklist and scripts.NetDiag(): Utilizes existingDCN (Data Center Network) diagnosis systems(Tools 1-3 discussed inS2.2) to check network components.- If
ConfigCheck()orNetDiag()identifies a problem (), the issue is repaired (Repair(Rc, RN)).
-
Step 5: Offline Diagnosis Fallback: If none of the above procedures can pinpoint the
root cause, allhostsused in thetraining taskareisolatedforoffline diagnosis(OfflineDiag(T)).Lesson Learned:
Exhausting host-side critical failures first is the most efficient way to diagnose. Inlarge-scale model training,host-side issuesare often misinterpreted asnetwork issues.71%ofdistributed failuresinitially appearednetwork-relatedbut were ultimatelyhost-side.
4.1.2 Offline Failure Diagnosis
Offline diagnosis is a fallback mechanism for complex issues that cannot be resolved with runtime information. It involves isolating suspicious hosts and performing thorough checks.
- Parallelized Offline Failure Localization: To speed up the process,
hostsundergoself-checks(stress testingCPU,GPU,PCIe,NVLink) in parallel.- If issues are found, the
hostis markedfaulty. - If no issues in single-node tests,
multi-host failure diagnosisis performed. This involves selectingtypical models(e.g.,MoE models,Multimodal models) that can reproduce the failure. The cluster is divided, and thereference modelis trained on segments to pinpoint the problematichost. Normalhostsare returned to the resource pool.
- If issues are found, the
- Topology-aware Parallel Localization: To prevent
parallel training tasksfrom interfering or obscuring network-relatedroot causes,hostsare split into subsets based on theirphysical network topology(e.g.,Pods,ToR groups). This ensures traffic from differentdiagnosing tasksdoes not compete for the samenetwork links. - Handling Missing Pieces (e.g., Core/Aggregation Switch Issues): The
offline diagnosisprocedure was enhanced after encountering cases whereCore/Aggregation switchmisbehavior (e.g.,silent packet lossfor large packets) was missed. This led to:- Supplementing
offline diagnosisto automatically handle such cases. - Enhancing
RDMA Pingmeshto covervaried packet lengthsin probes.
- Supplementing
4.2 Phase-2: Procedure-aware Diagnosis
Phase-1 still required offline diagnosis for many cases, leading to GPU utilization issues. Phase-2 focuses on improving runtime diagnosis by acquiring more training-specific information.
4.2.1 What is the ideal solution?
The goal was to enhance online task monitoring to provide more precise runtime information under strict constraints:
- High confidentiality: Diagnosis must not expose
customer modelsordata. - Minimal customer modifications:
Public cloud providerscannot forcecustomer codechanges. The solution must betransparent. - Low overhead: New
information collectionshould not impacttraining performance.
4.2.2 Customizing CCL is the bridge
The Collective Communication Library (CCL) was chosen as the ideal point for integration because:
- Modularity: In mainstream
training frameworks(e.g.,Megatron,DeepSpeed),CCLis anindependent pluginthat can be replaced withoutmodifying customer modelsorframeworks. - Boundary Role:
CCL"sits at the boundary" ofcomputationandcommunication, providing crucial information about bothhost-side processing timeandnetwork-side processing time.
Information Collection:
During training, the customized CCL records specific statistics for each communication operator () on each GPU ():
-
Collective launch count (CL_{i,j}): Number of times is launched by . -
Work request count (WR_{i,j}): Number ofwork requestsfor launched by . -
Work completion count (WC_{i,j}): Number ofwork requestsfor finished by . These metrics are chosen to besufficientandnecessarywhile remaininglightweightandeasy to deploy.
该图像是图13,展示了不同日期的运行时诊断百分比,柱状图清晰反映了诊断覆盖率随时间的变化趋势。
Figure 7: Customizing CCL for failure diagnosis.
Failure Localization with CCL Statistics:
- Scenario-1: Failure in Computation (Figure 7b):
- If a
GPU(, e.g., ) fails to launch a succeedingcollective operation(, e.g., ), otherGPUswillstallat and eventuallytimeout. - Diagnosis:
Aegisidentifies as theroot causeif itscollective launch countfor is less than otherGPUs() in the same group.
- If a
- Scenario-2: Failure in Communication (Figure 7c):
- If a
specific work requestin fails during transmission, allGPUsin the group will experienceCCL timeout. - Diagnosis:
Aegischecks and . In normalGPUs, . If (e.g., ), it indicates is related to theroot cause.NetDiag()is then performed on relatedsourcesanddestinations.
- If a
Limitations of Customizing CCL:
While effective and easy to deploy, CCL information alone might not fully uncover the ultimate root cause. However, its position at the computation/communication boundary is crucial for culprit localization. A practical challenge is maintaining customized CCL versions across various CUDA, driver, and CCL versions used by different customers.
5 Performance Degradation Diagnosis
Aegis also diagnoses performance degradation, which doesn't crash the training but significantly slows it down. Since offline diagnosis is not suitable here, specific runtime degradation diagnosis mechanisms are needed.
5.1 Basic Correlating Diagnosis
Similar to Phase-1, this leverages existing runtime statistics to detect performance degradation.
- Key Metric Selection:
Aegisidentifies single abnormal devices through two categories of metrics:- Abnormal operating metrics: Directly indicate abnormal conditions (e.g.,
Retran(retransmitted packets per second), which should ideally be zero). - Performance metrics: Reflect execution efficiency (e.g.,
Actual TensorFLOPS).Aegismonitors over20+ metrics(e.g.,CPU utilization,GPU utilization,temperature,PCIe utilization,Bandwidth utilization,Retransmission count,Switch port queue length,ECN count). Simple static thresholds are insufficient.
- Abnormal operating metrics: Directly indicate abnormal conditions (e.g.,
- Cross-host Correlating Diagnosis: Leverages the
synchronizing natureoftrainingwhere metrics acrosshostsshould follow similar patterns.Aegisuses aZ-Score outlier analyzerfordifferent metrics.- For each metric, an
outlier analyzercalculates theaverage value() andstandard deviation() over a period (e.g., 10 minutes). - A
hostis identified as anoutlierif itsmetric valueis consistently higher than . This simple yet effective method pinpointsabnormal nodes.
- For each metric, an
- Case study: An
LLM trainingtask experienced26% iteration time increase.Aegisidentified anabnormal NICwith highECN statistics(10-30Kper second) that exceeded theZ-Score threshold. Theroot causewassilent packet losson alink, forcing traffic to anoverloaded NICand causingnetwork congestion. Isolating thehostrestoredperformance. - Limitation: This method works when a few
hostsshowsignificantly different metrics. It struggles whenmultiple metricschange acrossall hosts, makingroot causedifficult to pinpoint.
5.2 Enhancing Procedure-aware Diagnosis
To address the limitations of basic correlating diagnosis, Aegis further customizes CCL for degradation diagnosis.
The customized CCL records:
Duration ofC_iinG_jin iterationI_k
:
* `Average duration of`C_i`in iteration`I_k
:
Network throughput for the lastLwork requests ofC_iinG_jin iterationI_k
: (where )
* `Average network throughput of`C_i`in iteration`I_k
:

*该图像是图表,展示了批处理故障期间的失效链接数量随时间(天)的变化趋势。从图中可以看出,失效链接数在前几天达到峰值,随后整体呈下降趋势。*
Figure 9: Customizing CCL for performance diagnosis.
- Computation Degradation (Figure 9a): If the duration of a
collective operationon a specificGPU() is significantly shorter than the average (, with ), it indicates that completed itscomputationmuch faster but thenstalledwaiting forsynchronization. This points tocomputation degradationon (or a precedingcomputationon was slow). - Communication Degradation (Figure 9b): If the
network throughputfor on () is significantly higher than the average (, with ), it suggestscommunication degradationbecause theGPUis waiting longer for communication. This identifies theGPU groupsuffering degradation, andRootDiag()principles are applied to find the exact source/destination.
6 Solving Problems Before Delivery
The observation that 73% of training tasks failed within the first 10 minutes (Figure 10) indicated issues present before training even properly started. This led to the Check Before Delivery (CBD) procedure.
该图像是图表,展示了生产环境中空闲时间的演变(图11)。图中通过柱状图表示空闲时间(小时),折线图表示任务规模(GPU数),显示了Aegis两个阶段实施前后空闲时间和任务规模的变化趋势。
Figure 10: Durations of training tasks in production.
Motivations for CBD:
- Frequent component updates: Regular updates to
training frameworks,CCL,container networks,NIC drivers, andswitchescan introduce new failures. - Post-usage failures:
Hostsmight develop faults after their last use but before being reallocated, leading to failures in new tasks. Diagnosing these is hard once ahostis delivered to acustomer.
CBD Procedure:
CBD is performed right before resources are handed over to customers. It ensures the environment is fully set up, catching issues that earlier physical host checks might miss (e.g., container network misconfigurations).
CBD needs to be efficient to avoid impacting user experience. The procedure is summarized in Table 1:
| Phase | Tasks | Time |
| Configuration check in parallel | Host configuration check | <1min |
| GPU configuration check | ||
| NIC configuration check | ||
| Single-host test in parallel | GPU kernels test | 3min |
| NVLink test | ||
| HBM test | ||
| PCIe test | ||
| CPU execution test | ||
| Dataset/Model/Checkpoint load test | ||
| Multi-hosts test in parallel | Collective communication test | 6min |
| Comput./Comm. overlap test |
Table 1: CBD task list (Manual transcription from the paper.)
- The total
CBDexecution takes less than10 minutes. - A
lightweight CBD version(under 1 minute) is available forPaaSmode, focusing onparallelized configuration checksandcritical quick local host tests. - If a significant number of
machinesfailCBD, recent updates are rolled back.CBDhas intercepted1% problematic hostsbefore delivery, preventingtraining task failures. It is amandatory procedure.
Experimental Setup
The evaluation of Aegis was conducted in a real-world production environment at Alibaba Cloud.
Context
- Deployment Environment:
Aegishas been online since September 2023, serving dozens oflarge-scale training clustersfor over a year. - Target Workload: Statistics are drawn from an
inner model training teamat Alibaba Cloud, which is involved in training one of thetop-tier LLMs(Large Language Models) globally. This provides a realistic and demanding testbed. - Scale of Operations: The
training task scaleincreased by more than40xduring the 16 months of observation.
Evaluation Metrics
The effectiveness of Aegis is quantified using several key metrics:
-
Idle Time Wasted by Diagnosis:
- Conceptual Definition: This metric measures the total duration, typically in hours, during which computational resources (primarily
GPUs) are available but remain unused because the system is undergoingfault diagnosisor waiting for resolution of a detected issue. Highidle timesignifies inefficient resource utilization and increased operational costs. - Mathematical Formula & Symbol Explanation: Not explicitly given, but implicitly defined as the cumulative time
GPUsare not performing useful work due due to diagnosis. Where:- : total number of
GPUsin the cluster. - The sum is taken over the entire observation period.
- : total number of
- Conceptual Definition: This metric measures the total duration, typically in hours, during which computational resources (primarily
-
Training Task Restart Count:
- Conceptual Definition: This metric quantifies the number of times a
training task(representing amodel training job) has to be terminated and re-initiated from a previouscheckpointor from scratch due tofailures. A highrestart countindicates poor systemreliabilityand significant delays inmodel developmentanddeployment. - Mathematical Formula & Symbol Explanation: Not explicitly given, simply a count.
- Conceptual Definition: This metric quantifies the number of times a
-
Performance Degradation Percentage ():
- Conceptual Definition: This metric quantifies the extent to which
training tasksoperate below their expectedperformancelevel. It measures the excess time spent intraining iterationsbeyond a definedstandard iteration time, relative to the total training time. Lowerperformance degradationindicates more efficient and stable training. - Mathematical Formula:
- Symbol Explanation:
- : The measured
iteration timefor the -th training iteration. This is obtained from themodel training log. - : The
standard iteration time, calculated as . - : The
average iteration time(presumably of a stable period or expected baseline). The paper defines relative to , with a20% slack(1.2x) to account for normal variations. - : Summation over all iterations where the measured
iteration timeexceeds thestandard iteration time. This captures only thedegraded portions. - : Summation over all measured
iteration timesfor the entiretraining task.
- : The measured
- Conceptual Definition: This metric quantifies the extent to which
Baselines
The paper's evaluation focuses on measuring the impact of Aegis's deployment over time, rather than comparing it against specific existing research systems in a controlled benchmark. The baselines are essentially the "before Aegis" or "previous state of the production system" at Alibaba Cloud.
-
Implicit Baseline 1: The state of the production cluster prior to the deployment of
Aegis Phase-1. -
Implicit Baseline 2: The state of the production cluster after
Aegis Phase-1but beforeAegis Phase-2andCBD, demonstrating the incremental improvements.This approach is typical for papers describing real-world
production systemdeployments, where the primary goal is to show the practical impact and value added by the new system within their specific operational context.
Results & Analysis
The results demonstrate the significant positive impact of Aegis's phased deployment on training stability and efficiency in a production LLM training environment.
Core Results
The quantitative improvements achieved by Aegis are substantial:
-
Idle Time Reduction: More than
97%decrease inidle timewasted by diagnosis. -
Training Task Restart Reduction:
84%decrease intraining task restart count. -
Performance Degradation Reduction:
71%decrease inperformance degradation.These figures highlight
Aegis's effectiveness in increasingGPU utilization, reducingoperational costs, and improving the overalldeveloper experienceforAI model training.
Data Presentation (Tables)
Figure 1: Failures in a representative cluster.
该图像是一张柱状图,展示了某代表性集群中连续10周内硬件、软件和网络故障的数量分布,硬件故障占比最高,软件和网络故障相对较少。
Figure 1: Failures in a representative cluster.
Analysis: This bar chart illustrates the frequency of hardware, software, and network failures over ten weeks in a production cluster. Hardware failures consistently dominate, with 100-230 critical failures per week. This underscores the primary challenge Aegis was built to address: the inherent unreliability of high-performance hardware in large-scale AI training environments and the need for robust diagnosis.
Figure 2: Types of failures encountered in production.
该图像是图7的示意图,展示了通过定制集体通信库(CCL)实现故障诊断的三种情况:(a)正常训练迭代,(b)计算故障,(c)通信故障。图中使用 等公式标注迭代与通信状态。
Figure 2: Types of failures encountered in production.
Analysis: This pie chart details the breakdown of failure types. GPU-related reasons (e.g., execution error, driver error, memory error, CUDA error, NVLINK error, ECC error) account for a significant 45.6% of all failures. NVLINK failures are 9.2% and PCIe errors are 10.4%. This further emphasizes the dominance of host-side hardware issues and the complexity introduced by specialized intra-host interconnects. It justifies Aegis's Phase-1 strategy of prioritizing host-side critical failures.
Figure 10: Durations of training tasks in production.
该图像是图表,展示了生产环境中空闲时间的演变(图11)。图中通过柱状图表示空闲时间(小时),折线图表示任务规模(GPU数),显示了Aegis两个阶段实施前后空闲时间和任务规模的变化趋势。
Figure 10: Durations of training tasks in production.
Analysis: This bar chart shows that 73% of training tasks failed within the first 10 minutes. Given that the initialization phase typically takes 5-20 minutes, this indicates a high rate of failures occurring before actual productive training begins. This finding directly motivated the development and deployment of Check Before Delivery (CBD) to proactively catch issues.
Figure 11: Evolution of idle time in production.
该图像是图2,一个饼图,展示了生产环境中遇到的各种故障类型及其所占百分比,如CPU错误、GPU执行错误和PCIe错误等。
Figure 11: Evolution of idle time in production.
Analysis: This figure shows the monthly accumulated idle time (bars) and the training task scale (line) from May 2023 to August 2024.
- Pre-Aegis Phase-1: High
idle timeis evident before September 2023. - Aegis Phase-1 (Online in Sep 2023): In September and October 2023, despite a doubling of
training scale,idle timedecreased by71%in the first month. - Scaling Impact (Nov 2023): A
4x boostintraining scalein November led to a temporary increase inidle time, highlighting how scaling introduces newcorner case issues. - Aegis Phase-2 (Online in Jun 2024): The deployment of
Phase-2resulted in a further91%reduction inidle timein July and August 2024, attributed to more failures being resolved byruntime diagnosiswithout resorting tooffline diagnosis.
Figure 12: Evolution of restart counts in production.
该图像是图3,展示了主机内部的网络拓扑结构,包含两个CPU以及各自连接的PCIe交换机、GPU和NIC,通过高带宽的主机内网络和后端网络相互连接。
Figure 12: Evolution of restart counts in production.
Analysis: This chart tracks training task restart counts over the same period.
- Pre-CBD: High
restart countsare observed in November 2023, coinciding with thetraining scaleincrease and indicating numerousinitialization phase failures. - CBD Deployment (Dec 2023):
CBDwas deployed in December 2023, leading to a44.8%decrease inrestart countsin the following month. Continuous optimization ofCBDeventually reducedrestart countsby84.6%. - August 2024 Increase: An increase in restarts in August 2024 is attributed to
planned experimentsandfine-tuningactivities by the model training team, rather thansystemic failures. This indicates thatCBDis effective in mitigating unplanned restarts.
Figure 13: Runtime diagnosis percentage.
该图像是图4,展示了两种不同的主机访问拓扑结构:(a) 单个ToR访问和 (b) 优化的多机架(Rail)访问,突出不同拓扑下的连接关系及访问延迟。
Figure 13: Runtime diagnosis percentage.
Analysis: This figure shows the percentage of failure cases diagnosed at runtime (vs. offline).
- Pre-Aegis Phase-2: Before
June 2024, theruntime diagnosis percentagewas lower, indicating a reliance onoffline diagnosisfor many cases. - Aegis Phase-2 (Online in Jun 2024): With
Phase-2deployment, theruntime diagnosis percentagegradually converges to nearly100%. This is a critical achievement as it means mosttraining task failurescan be automatically recovered withouthuman interferenceorresource-intensive offline procedures, drastically improvingGPU utilizationanduser experience.
Figure 14: Performance degradation percentage.
该图像是图5,展示了当前主流故障诊断方案的局限性,通过故障覆盖率和易部署性维度对比了多种方法,突出Aegis系统在任务级诊断中的优越性和易部署特性。
Figure 14: Performance degradation percentage.
Analysis: This bar chart displays the performance degradation percentage over time.
- Pre-Degradation Diagnosis: Before
June 2024,performance degradationvalues were consistently higher. - Degradation Diagnosis Deployment (June 2024): With the deployment of the
performance degradation diagnosisfeatures in June 2024, theperformance degradation percentagesignificantly decreased by71%in the subsequent months. This highlights the effectiveness of bothbasic correlating diagnosisandenhanced procedure-aware diagnosisin identifying and mitigatingperformance bottlenecks.
Figure 15: Number of failed links during the batch failure.
该图像是论文中图6的示意图,展示了Aegis故障诊断系统的整体架构,包含任务失败和性能下降的多阶段诊断流程,以及在线和离线日志数据的利用,体现系统的演进和错误定位机制。
Figure 15: Number of failed links during the batch failure.
Analysis: This line graph illustrates a specific experience regarding batch link failures caused by contamination during data center construction. It shows a swift increase in failed links when new machines are delivered, followed by a gradual decrease as cleaning efforts progress. This case study from the experience section demonstrates a practical challenge encountered and how Aegis (or the operational response informed by Aegis's insights) helped manage and eventually resolve a large-scale hardware reliability issue.
Ablations / Parameter Sensitivity
The paper implicitly conducts an ablation study through its phased evolutionary approach:
-
Impact of Phase-1: Demonstrated by the initial
71%reduction inidle timeafter its deployment, showing the value of enhancing existing systems withtraining-specific logsandhost-side critical error diagnosis. -
Impact of Phase-2: Further
91%saving inidle time(on top of Phase-1) and near100% runtime diagnosis percentageafter its deployment, highlighting the crucial contribution ofCCL customizationforprocedure-aware diagnosis. -
Impact of CBD: A
44.8%initial and84.6%overall decrease inrestart countsafterCBDdeployment, showing the benefits ofproactive checksbeforeresource delivery.Each phase builds upon the previous one, effectively demonstrating the incremental value of each
Aegiscomponent. TheZ-Scorethreshold () foroutlier detectionandCCL degradation parameters(, ) are mentioned as practical values determined in production.
Conclusion & Personal Thoughts
Conclusion Summary
This paper rigorously details the Evolution of Aegis, a fault diagnosis system specifically designed for AI model training services in a production cloud environment at Alibaba Cloud. Driven by the unique challenges of large-scale distributed AI training, Aegis evolved through strategic phases. Phase-1 enhanced existing diagnosis systems with training-specific error detection and an offline diagnosis fallback. Phase-2 significantly improved runtime diagnosis capabilities by customizing the Collective Communication Library (CCL) to gain procedure-aware insights without modifying customer code. Complementing these, Aegis also developed modules for performance degradation diagnosis and implemented Check Before Delivery (CBD) for proactive issue detection. The system has achieved remarkable results in production: decreasing idle time by 97%, reducing training task restarts by 84%, and mitigating performance degradation by 71%. These outcomes underscore Aegis's critical role in enhancing the reliability, efficiency, and user experience of AI model training in the cloud.
Limitations & Future Work
The authors candidly discuss several limitations and areas for future work:
- CCL Customization Depth: While
CCL customizationiseasy to deployand effective forculprit localization, it might not always providesufficient informationto determine theultimate root causeof failures, especially complex ones that intertwine multiple system layers. - CCL Version Management: The need to maintain
customized CCLversions for variousCUDA,driver, andCCLversions across customer environments poses an ongoing maintenance challenge. - Reboot or Repair Determination: Accurately and efficiently determining whether a
faulty hostrequires a simplerebootor a fullhardware repairremains anopen issue. Their currentSOP(Standard Operating Procedure), which includes stress tests and athree-strikes-and-repairrule, is practical but still leaves room for refinement. - Complex Failure Cases: Despite
Aegis's success,corner cases(e.g., highly specificcongestion control bugs,silent packet losswith unusual characteristics) still emerge, requiring continuous adaptation and enhancement of the system. - Customer Privacy vs. Diagnostic Depth: The inherent tension between the desire for deeper diagnostic insights (which might require more intrusive data collection) and the need to protect
customer privacyis a constant design constraint.
Personal Insights & Critique
This paper offers a highly valuable contribution by presenting a practical, battle-tested diagnosis system from a major cloud provider. Its strengths lie in:
-
Production Focus and Pragmatism: The paper doesn't just propose a theoretical solution; it describes a system that has been continuously evolved and deployed in a
large-scale production environmentunder real-world constraints. This makes the findings and lessons learned extremely relevant. -
Phased Evolutionary Approach: The iterative development of
Aegis(Phase-1, Phase-2, CBD) is a realistic reflection of how complex systems are built and improved. It provides a clearablation studyby showing the incremental benefits of each stage. -
Non-Intrusive Design Principle: The strong commitment to
diagnose faultswithoutmodifying customer codeis a critical design choice for apublic cloud service provider. The innovative use ofCCL customizationas a bridge is particularly clever, balancing diagnostic power withcustomer privacyandease of deployment. -
Comprehensive Problem Coverage: Addressing
task failures,performance degradation, andpre-delivery checkscovers the full lifecycle ofreliabilitychallenges inAI training. -
Transparent Lessons Learned: The "Experience and Lessons" section provides invaluable insights into the practical difficulties of operating such a service, including
hardware reliability,heterogeneous environments,configuration management, andnetwork issues. These are often overlooked in academic papers but are crucial for real-world impact.One potential area for deeper exploration (though perhaps beyond the scope of a single paper) could be:
-
Root Cause Analysis Automation: While
Aegisexcels atculprit localization, the paper notes thatroot cause analysisis often doneofflineafterisolation. Future work could explore more automated, AI-drivenroot cause analysistechniques that leverage the rich diagnostic data collected byAegisto provide deeper, actionable insights forsystem engineers. -
Generalizability of CCL Customization: While the
CCL customizationis effective, its direct transferability to extremely niche or highly specializedcollective communication patternsnot covered by mainstream frameworks might require further investigation.Overall,
Aegisis an impressive feat ofsystem engineeringthat tackles a complex, growing problem with an elegant and practical solution. Its success story is a testament to the power of iterative design, real-world data, and a deep understanding of operational challenges.
Similar papers
Recommended via semantic vector search.