Paper status: completed

Evolution of Aegis: Fault Diagnosis for AI Model Training Service in Production

Published:01/01/2011
Original Link
Price: 0.100000
12 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Aegis, a fault diagnosis system for AI model training services, uses distributed data analysis to localize failures, reducing idle time and restarts while improving performance in production environments.

Abstract

Despite the success of diagnosis systems in traditional cloud computing, these systems are not suitable for pinpointing faults in AI model training cloud scenarios due to the differences in computing paradigms between traditional cloud computing and model training. As one of the largest cloud providers, we present Aegis, a fault diagnosis system specifically designed for AI model training service. We share our experience in the motivation, design, and evolution of Aegis. Keeping easy-to-deploy as the primary principle, Aegis Phase-1 started by enhancing existing general-purpose diagnosis systems. After several months of evolution, Aegis Phase-2 cogitatively customized the collective communication library for sophisticated failure localization in runtime without modifying customer code. Besides the failure localization, we further equipped Aegis with the capabilities on handling performance degradation and failure checking before delivery. Aegis has been deployed in our production training cloud service for one year. Aegis decreases more than 97% of the idle time wasted by diagnosis, 84% of the training task restart count, and 71% of the performance degradation.

Mind Map

In-depth Reading

English Analysis

Bibliographic Information

  • Title: Evolution of Aegis: Fault Diagnosis for AI Model Training Service in Production
  • Authors: Jianbo Dong, Kun Qian, Pengcheng Zhang, Zhilong Zheng, Liang Chen, Fei Feng, Yichi Xu, Yikai Zhu, Gang Lu, Xue Li, Zhihui Ren, Zhicheng Wang, Bin Luo, Peng Zhang, Yang Liu, Yanqing Chen, Yu Guan, Weicheng Wang, Chaojie Yang, Yang Zhang, Man Yuan, Hanyu Zhao, Yong Li, Zihan Zhao, Shan Li, Xianlong Zeng, Zhiping Yao, Binzhang Fu, Ennan Zhai, Wei Lin, Chao Wang, Dennis Cai
  • Affiliations: Alibaba Cloud
  • Journal/Conference: The context suggests this paper is intended for publication at a reputable systems conference, specifically NSDI (Networked Systems Design and Implementation), indicated by nsdi25fall in the original source link, implying a high-tier venue for systems research.
  • Publication Year: 2024 (based on internal references to 2024 events and the deployment of Aegis Phase-2 in June 2024).
  • Abstract: Despite the advancements in diagnosis systems for traditional cloud computing, these are often inadequate for the unique challenges presented by AI model training cloud scenarios due to fundamental differences in computing paradigms. This paper introduces Aegis, a fault diagnosis system specifically engineered for AI model training services, sharing insights into its motivation, design, and evolutionary phases. Starting with Aegis Phase-1, which prioritized easy-to-deploy principles by enhancing existing general-purpose diagnosis systems, the system evolved into Aegis Phase-2. The latter cognitively customized the collective communication library (CCL) for sophisticated runtime failure localization without requiring modifications to customer code. Beyond failure localization, Aegis was further equipped to handle performance degradation and implement failure checking before delivery (CBD). Deployed in a production training cloud service for one year, Aegis has demonstrated significant improvements: decreasing idle time wasted by diagnosis by over 97%, reducing training task restart counts by 84%, and mitigating performance degradation by 71%.
  • Original Source Link: https://ennanzhai.github.io/pub/nsdi25fall-aegis.pdf (Clarified as a pre-print or technical report link, likely for an upcoming conference.)

Executive Summary

Background & Motivation (Why)

The proliferation of large-scale AI model training has introduced new and complex challenges for cloud service providers. Traditional fault diagnosis systems, effective in general cloud computing, are often ill-suited for these AI model training environments. This is due to fundamental differences in computing paradigms, particularly the reliance on tightly synchronized collective communication among thousands of GPUs.

The core problem is the difficulty in pinpointing faults and performance degradation efficiently and accurately in production-scale AI model training clusters. This problem is crucial because:

  1. High Failure Rates: High-tier GPUs and high-speed network components used in AI training exhibit significantly higher failure rates than traditional hardware.

  2. Cascading Failures: Due to the synchronous nature of model training, a single-point failure can rapidly propagate, leading to cascading failures across the entire cluster.

  3. Ambiguous Error Indicators: Failures often manifest as generic CCL timeouts or widespread error reports, obscuring the true root cause amidst a flood of secondary issue errors.

  4. Inefficiency of Existing Tools: Prior diagnosis systems either require customer code modification (which is impractical for public cloud providers due to privacy and deployment challenges) or are designed for pre-deployment checks (offline diagnosis), leading to significant idle time and resource wastage during runtime failures.

    The paper's novel approach is Aegis, a fault diagnosis system specifically designed for AI model training services in a production cloud environment. It aims to provide runtime diagnosis without modifying customer code and to evolve its capabilities based on real-world operational experience.

Main Contributions / Findings (What)

The paper presents the design, evolution, and evaluation of Aegis, highlighting several key contributions:

  1. Phased Evolution of Diagnosis: Aegis evolved through two main phases:
    • Phase-1: Enhanced existing general-purpose diagnosis systems by incorporating training-output logs and developing a training-specific runtime diagnosis procedure (using CriticalError() and DistError()), backed by a comprehensive offline diagnosis mechanism for complex cases. This phase addressed host-side critical failures first, recognizing their frequent misinterpretation as network issues.
    • Phase-2: Introduced procedure-aware diagnosis by customizing the Collective Communication Library (CCL). This allowed for sophisticated runtime failure localization (distinguishing between computation and communication failures) without modifying customer code, addressing the limitations of offline diagnosis and improving GPU utilization.
  2. Performance Degradation Diagnosis: Aegis includes capabilities to identify and diagnose performance degradation through basic correlating diagnosis (using Z-Score outlier analysis on various metrics) and enhanced procedure-aware diagnosis (leveraging customized CCL metrics like duration and network throughput).
  3. Proactive Failure Prevention (Check Before Delivery - CBD): A Check Before Delivery (CBD) procedure was implemented to efficiently check hosts for potential issues before they are delivered to customers, significantly reducing task failures during the initialization phase and preventing unnecessary retries.
  4. Significant Production Impact: Aegis has been deployed in a production LLM training cloud service at Alibaba Cloud for over a year, demonstrating:
    • Over 97% reduction in idle time wasted by diagnosis.
    • 84% decrease in training task restart counts.
    • 71% reduction in performance degradation.
  5. Operational Insights and Lessons Learned: The paper shares valuable real-world experiences from operating large-scale AI training clusters, including challenges with batch link failures, heterogeneous devices and multiple sale modes, congestion control issues, and paramount importance of customer privacy.

Prerequisite Knowledge & Related Work

Foundational Concepts

To understand the paper, several core concepts related to AI model training, cloud computing, and network systems are essential:

  • AI Model Training: The process of feeding data to an AI model (e.g., a large language model (LLM)) to enable it to learn patterns and make predictions. This typically involves iterative computations and frequent data synchronization.
  • Large-Scale Model Training: Refers to training models with billions or trillions of parameters, requiring massive computational resources (tens of thousands of high-tier GPUs) distributed across many interconnected machines. This scale amplifies the impact of any single failure.
  • Cloud Computing: A model for delivering on-demand computing services (e.g., servers, storage, databases, networking, software, analytics, intelligence) over the Internet ("the cloud"). PaaS (Platform-as-a-Service) mode provides a complete platform for development and deployment, while IaaS (Infrastructure-as-a-Service) mode offers virtualized computing resources.
  • GPUs (Graphics Processing Units): Specialized electronic circuits designed to rapidly manipulate and alter memory to accelerate the creation of images, but are also highly effective for parallel computations required in AI training. High-tier GPUs like NVIDIA A100 and H100 are crucial for performance but also have higher failure rates.
  • Collective Communication: In distributed AI model training, multiple GPUs and hosts (physical servers) must frequently exchange and aggregate data (e.g., gradients, model parameters). Collective communication libraries (CCL) like NVIDIA NCCL facilitate these synchronized operations (e.g., all-reduce, all-gather). A CCL timeout occurs when a collective communication operation fails to complete within a specified time limit, often indicating a problem with a GPU, network, or software.
  • Network Components:
    • NICs (Network Interface Cards): Hardware components that connect a computer to a computer network.
    • Switches: Network devices that connect devices in a computer network, forwarding data only to specific devices that need to receive it.
    • PCIe (Peripheral Component Interconnect Express): A high-speed serial computer expansion bus standard used for connecting various hardware components, including GPUs and NICs, within a single host.
    • NVLink: A high-bandwidth, energy-efficient, multi-lane communication link developed by NVIDIA for GPU-to-GPU and GPU-to-CPU interconnects within a host.
    • RDMA (Remote Direct Memory Access): A technology that allows one computer to access memory in another computer without involving the operating system or CPU of the remote computer, significantly reducing latency and increasing throughput for high-performance computing.
  • Fault Diagnosis: The process of identifying the root cause of a failure or abnormal behavior in a system.
  • Performance Degradation: A state where a system or component operates below its expected or optimal performance level, without necessarily crashing entirely.
  • Idle Time: The period during which computational resources (e.g., GPUs) are available but not actively performing useful work, often due to waiting for diagnosis or recovery from a failure.
  • Z-Score: A statistical measure that describes a value's relationship to the mean of a group of values, measured in terms of standard deviations from the mean. It's used here for outlier detection. Z=xμσZ = \frac{x - \mu}{\sigma} Where:
    • xx: the individual data point.
    • μ\mu: the mean of the dataset.
    • σ\sigma: the standard deviation of the dataset. A high absolute Z-score indicates that the data point is an outlier.
  • ECN (Explicit Congestion Notification): A network protocol mechanism that allows end-to-end notification of network congestion without dropping packets. When a network device experiences congestion, it can mark packets with an ECN signal, informing the sending application to reduce its transmission rate. A continuously high ECN metric can indicate network congestion.

Previous Works

The paper categorizes previous work into Large-scale AI model training diagnosis systems and General abnormality diagnosis systems.

General Abnormality Diagnosis Systems

These systems, like Pingmesh [31, 32] for network latency measurement and analysis, Vigil [23], 007 [24], NetPoirot [25] for network diagnostics, Confluo [38] for high-speed network monitoring, and various host anomaly detection systems [23-25, 28, 31, 34, 35, 39, 42, 43, 46-48, 50, 52, 55, 57], are designed for traditional cloud computing scenarios. They typically localize root causes by tracing back source-destination paths through system component calls (e.g., 5-tuples, devices). The authors argue that these are not directly suitable for AI model training due to the synchronous nature and cascading failure patterns unique to model training. The existing diagnosis tools at Alibaba Cloud (Tool 1: Network monitoring and analysis, Tool 2: RDMA Pingmesh, Tool 3: Inband network diagnosis) fall into this category and have limitations in the AI training context (e.g., false positives, focus on single request/response).

Large-scale AI Model Training Diagnosis Systems

  • SuperBench [60]: Developed by Microsoft, it provides a comprehensive benchmark suite for gray failure checking before cluster deployment. It includes computation/communication microbenchmarks and end-to-end model training benchmarks.
    • Limitation: It's an offline diagnosis tool. It cannot localize root causes of faults occurring during model training runtime. Running it after every failure is time-consuming (hours) and resource-intensive.
  • MegaScale [36]: Developed by ByteDance, this system focuses on runtime diagnosis by monitoring CUDA events from "critical code segments" in the customer model.
    • Limitation: It requires defining "critical code segments" and potentially modifying customer code to insert monitor modules. This is impractical for public cloud service providers like Alibaba Cloud, who serve diverse customers with proprietary models and cannot demand code modifications due to confidentiality and deployment complexity.

Technological Evolution

The field has evolved from general-purpose cloud infrastructure monitoring to specialized tools for distributed AI training. Early diagnosis focused on individual host or network device health. With the rise of large-scale distributed training, the challenges shifted to inter-device synchronization and cascading failures. Solutions like SuperBench addressed pre-deployment validation, while MegaScale attempted runtime diagnosis but with intrusive methods. Aegis represents a further evolution by aiming for non-intrusive runtime diagnosis in a production cloud environment, specifically tailoring its approach to the unique characteristics of collective communication and the customer-provider relationship.

Differentiation

Aegis differentiates itself from SuperBench and MegaScale primarily through its focus on runtime diagnosis without requiring modifications to customer code, while also addressing performance degradation and pre-delivery checks.

  • Compared to SuperBench: Aegis provides runtime diagnosis for failures and degradation that occur during active training, unlike SuperBench which is a pre-deployment validation tool. While Aegis incorporates some pre-delivery checks (CBD), its core strength is in runtime problem resolution.
  • Compared to MegaScale: Aegis achieves runtime diagnosis by customizing the Collective Communication Library (CCL), which is a modular component, rather than requiring invasive modifications to customer model code or training frameworks. This design choice is critical for a public cloud service provider to ensure transparency, generality, and customer privacy.
  • Compared to General Diagnosis Systems: Aegis explicitly accounts for the synchronous nature of AI model training and the resulting cascading failure patterns, allowing it to localize root causes more effectively than systems designed for asynchronous general cloud computing workloads.

Methodology (Core Technology & Implementation Details)

The core idea behind Aegis is to provide a robust fault diagnosis system for AI model training services in a production cloud environment that is easy-to-deploy, transparent to customers, and effective in runtime. It addresses both task failures and performance degradation through a phased evolutionary approach and incorporates proactive checks. The underlying principles are to first exhaust host-side critical failures and then leverage training-specific information (especially from collective communication) to pinpoint root causes.

Aegis Overview

Aegis is designed to handle two primary types of abnormalities: training failure and performance degradation. Its evolution reflects a continuous effort to improve runtime diagnosis capabilities while minimizing disruption and ensuring customer privacy.

Figure 6: Aegis overview. 该图像是论文中图6的示意图,展示了Aegis故障诊断系统的整体架构,包含任务失败和性能下降的多阶段诊断流程,以及在线和离线日志数据的利用,体现系统的演进和错误定位机制。

Figure 6: Aegis overview.

The system's evolution is described in phases:

  • Aegis Phase-1: Focused on enhancing existing diagnosis systems by integrating training-output logs and establishing a training-specific runtime diagnosis procedure. It also included a comprehensive offline diagnosis as a fallback.
  • Aegis Phase-2: Aimed to improve runtime diagnosis for more sophisticated cases by customizing the Collective Communication Library (CCL) to gather training procedure-aware information.
  • Performance Degradation: Developed specific mechanisms including basic-metric correlating diagnosis and enhanced procedure-aware diagnosis.
  • Check Before Delivery (CBD): A proactive "pre-online" process to check cluster health before delivery to customers.

4.1 Phase-1: Enhancing Existing Systems

This phase started by improving existing general-purpose diagnosis systems by incorporating training-specific information and procedures.

4.1.1 Basic Error Diagnosis

The initial approach involved leveraging human expertise to analyze error and warning logs from various sources (OS dmesg, training log, CCL log, NIC driver, switch syslog, customized counters). This led to the identification of critical error patterns that were then automated.

Two main challenges were addressed:

  1. Not all reported errors are critical: Many errors are benign. Aegis distinguishes between critical errors that intrinsically lead to task anomalies (e.g., double-bit ECC errors, GPU missing, PCIe lane drops, NVLink issues) and less critical ones. Critical errors lead to immediate host isolation.
    • CriticalError(): A function that identifies errors that are strong indicators of a specific problem requiring immediate isolation. Categories include:
      • Hardware failure: e.g., double-bit ECC error, link down, GPU/NVLink/NIC missing, fan error, power error.
      • Unrecoverable software failure: e.g., GPU/NIC driver error.
      • Unbearable performance degradation: e.g., GPU/host overheat.
  2. Not all critical errors point to a clear location: Some errors (e.g., connection reset by peer, CCL timeout) are distributed across many hosts, making it hard to pinpoint the root cause.
    • DistError(): A function that records and analyzes these distributed error patterns.

      The refined fault diagnosis process (Algorithm 1 in Appendix A) works as follows:

  • Step 1: Check for Critical Errors: If any machine (host) reports critical errors (LcE.size()>0LcE.size() > 0), those hosts are immediately isolated, and the training task is restarted.

  • Step 2: Check for Small-Scale Distributed Errors: If distributed errors (LDE) are found on only one or two hosts (LDE.size()<=2LDE.size() <= 2), these hosts are isolated, and the task is restarted. This is a trade-off to quickly resolve small-scale issues, even if it means isolating some normal hosts.

  • Step 3: Analyze Large-Scale Distributed Errors: If distributed errors occur across multiple machines, RootDiag(LDE) is called.

    • RootDiag(): This function analyzes the distributed error reports to identify if they cluster around a specific source or destination (e.g., if a GPU GjG_j is the root cause, connections from and to GjG_j will crash first). If a faulty node NN is identified, CheckError(N) is performed, NN is isolated, and the task is restarted.
  • Step 4: Systemic Issue Diagnosis: If RootDiag() cannot pinpoint a specific host, it suggests systemic issues like network or configuration problems.

    • ConfigCheck(): Checks for configuration errors using a predefined checklist and scripts.
    • NetDiag(): Utilizes existing DCN (Data Center Network) diagnosis systems (Tools 1-3 discussed in S2.2) to check network components.
    • If ConfigCheck() or NetDiag() identifies a problem (R!=NULLR || ! = NULL), the issue is repaired (Repair(Rc, RN)).
  • Step 5: Offline Diagnosis Fallback: If none of the above procedures can pinpoint the root cause, all hosts used in the training task are isolated for offline diagnosis (OfflineDiag(T)).

    Lesson Learned: Exhausting host-side critical failures first is the most efficient way to diagnose. In large-scale model training, host-side issues are often misinterpreted as network issues. 71% of distributed failures initially appeared network-related but were ultimately host-side.

4.1.2 Offline Failure Diagnosis

Offline diagnosis is a fallback mechanism for complex issues that cannot be resolved with runtime information. It involves isolating suspicious hosts and performing thorough checks.

  • Parallelized Offline Failure Localization: To speed up the process, hosts undergo self-checks (stress testing CPU, GPU, PCIe, NVLink) in parallel.
    • If issues are found, the host is marked faulty.
    • If no issues in single-node tests, multi-host failure diagnosis is performed. This involves selecting typical models (e.g., MoE models, Multimodal models) that can reproduce the failure. The cluster is divided, and the reference model is trained on segments to pinpoint the problematic host. Normal hosts are returned to the resource pool.
  • Topology-aware Parallel Localization: To prevent parallel training tasks from interfering or obscuring network-related root causes, hosts are split into subsets based on their physical network topology (e.g., Pods, ToR groups). This ensures traffic from different diagnosing tasks does not compete for the same network links.
  • Handling Missing Pieces (e.g., Core/Aggregation Switch Issues): The offline diagnosis procedure was enhanced after encountering cases where Core/Aggregation switch misbehavior (e.g., silent packet loss for large packets) was missed. This led to:
    • Supplementing offline diagnosis to automatically handle such cases.
    • Enhancing RDMA Pingmesh to cover varied packet lengths in probes.

4.2 Phase-2: Procedure-aware Diagnosis

Phase-1 still required offline diagnosis for many cases, leading to GPU utilization issues. Phase-2 focuses on improving runtime diagnosis by acquiring more training-specific information.

4.2.1 What is the ideal solution?

The goal was to enhance online task monitoring to provide more precise runtime information under strict constraints:

  • High confidentiality: Diagnosis must not expose customer models or data.
  • Minimal customer modifications: Public cloud providers cannot force customer code changes. The solution must be transparent.
  • Low overhead: New information collection should not impact training performance.

4.2.2 Customizing CCL is the bridge

The Collective Communication Library (CCL) was chosen as the ideal point for integration because:

  1. Modularity: In mainstream training frameworks (e.g., Megatron, DeepSpeed), CCL is an independent plugin that can be replaced without modifying customer models or frameworks.
  2. Boundary Role: CCL "sits at the boundary" of computation and communication, providing crucial information about both host-side processing time and network-side processing time.

Information Collection: During training, the customized CCL records specific statistics for each communication operator (CiC_i) on each GPU (GjG_j):

  • Collective launch count (CL_{i,j}): Number of times CiC_i is launched by GjG_j.

  • Work request count (WR_{i,j}): Number of work requests for CiC_i launched by GjG_j.

  • Work completion count (WC_{i,j}): Number of work requests for CiC_i finished by GjG_j. These metrics are chosen to be sufficient and necessary while remaining lightweight and easy to deploy.

    Figure 13: Runtime diagnosis percentage. 该图像是图13,展示了不同日期的运行时诊断百分比,柱状图清晰反映了诊断覆盖率随时间的变化趋势。

Figure 7: Customizing CCL for failure diagnosis.

Failure Localization with CCL Statistics:

  • Scenario-1: Failure in Computation (Figure 7b):
    • If a GPU (GnG_n, e.g., G2G_2) fails to launch a succeeding collective operation (CiC_i, e.g., C1C_1), other GPUs will stall at CiC_i and eventually timeout.
    • Diagnosis: Aegis identifies GnG_n as the root cause if its collective launch count for CiC_i is less than other GPUs (CLi,n<CLi,jnCL_{i,n} < CL_{i,j \neq n}) in the same group.
  • Scenario-2: Failure in Communication (Figure 7c):
    • If a specific work request in CiC_i fails during transmission, all GPUs in the group will experience CCL timeout.
    • Diagnosis: Aegis checks WRi,jWR_{i,j} and WCi,jWC_{i,j}. In normal GPUs, WRi,j=WCi,jWR_{i,j} = WC_{i,j}. If WRi,n<WCi,nWR_{i,n} < WC_{i,n} (e.g., G1G_1), it indicates GnG_n is related to the root cause. NetDiag() is then performed on related sources and destinations.

Limitations of Customizing CCL: While effective and easy to deploy, CCL information alone might not fully uncover the ultimate root cause. However, its position at the computation/communication boundary is crucial for culprit localization. A practical challenge is maintaining customized CCL versions across various CUDA, driver, and CCL versions used by different customers.

5 Performance Degradation Diagnosis

Aegis also diagnoses performance degradation, which doesn't crash the training but significantly slows it down. Since offline diagnosis is not suitable here, specific runtime degradation diagnosis mechanisms are needed.

5.1 Basic Correlating Diagnosis

Similar to Phase-1, this leverages existing runtime statistics to detect performance degradation.

  • Key Metric Selection: Aegis identifies single abnormal devices through two categories of metrics:
    1. Abnormal operating metrics: Directly indicate abnormal conditions (e.g., Retran (retransmitted packets per second), which should ideally be zero).
    2. Performance metrics: Reflect execution efficiency (e.g., Actual TensorFLOPS). Aegis monitors over 20+ metrics (e.g., CPU utilization, GPU utilization, temperature, PCIe utilization, Bandwidth utilization, Retransmission count, Switch port queue length, ECN count). Simple static thresholds are insufficient.
  • Cross-host Correlating Diagnosis: Leverages the synchronizing nature of training where metrics across hosts should follow similar patterns. Aegis uses a Z-Score outlier analyzer for different metrics.
    • For each metric, an outlier analyzer calculates the average value (λ\lambda) and standard deviation (δ\delta) over a period TT (e.g., 10 minutes).
    • A host is identified as an outlier if its metric value is consistently higher than λ+2δ\lambda + 2\delta. This simple yet effective method pinpoints abnormal nodes.
  • Case study: An LLM training task experienced 26% iteration time increase. Aegis identified an abnormal NIC with high ECN statistics (10-30K per second) that exceeded the Z-Score threshold. The root cause was silent packet loss on a link, forcing traffic to an overloaded NIC and causing network congestion. Isolating the host restored performance.
  • Limitation: This method works when a few hosts show significantly different metrics. It struggles when multiple metrics change across all hosts, making root cause difficult to pinpoint.

5.2 Enhancing Procedure-aware Diagnosis

To address the limitations of basic correlating diagnosis, Aegis further customizes CCL for degradation diagnosis. The customized CCL records:

  • Duration ofC_iinG_jin iterationI_k
: TDi,j,kTD_{i,j,k}
*   `Average duration of`C_i`in iteration`I_k

: TDi,k\overline{TD}_{i,k}

  • Network throughput for the lastLwork requests ofC_iinG_jin iterationI_k
: Ni,j,kN_{i,j,k} (where L=5L=5)
*   `Average network throughput of`C_i`in iteration`I_k

: Ni,k\overline{N}_{i,k}

![Figure 15: Number of failed links during the batch failure.](/files/papers/68ff207ddfd22418755f92c5/images/14.jpg)
*该图像是图表,展示了批处理故障期间的失效链接数量随时间(天)的变化趋势。从图中可以看出,失效链接数在前几天达到峰值,随后整体呈下降趋势。*

Figure 9: Customizing CCL for performance diagnosis.

  • Computation Degradation (Figure 9a): If the duration of a collective operation CiC_i on a specific GPU GjG_j (TDi,j,kTD_{i,j,k}) is significantly shorter than the average (TDi,j,k<αTDi,kTD_{i,j,k} < \alpha \overline{TD}_{i,k}, with α=0.8\alpha = 0.8), it indicates that GjG_j completed its computation much faster but then stalled waiting for synchronization. This points to computation degradation on GjG_j (or a preceding computation on GjG_j was slow).
  • Communication Degradation (Figure 9b): If the network throughput for CiC_i on GjG_j (Ni,j,kN_{i,j,k}) is significantly higher than the average (Ni,j,k>βNi,kN_{i,j,k} > \beta \overline{N}_{i,k}, with β=1.5\beta = 1.5), it suggests communication degradation because the GPU is waiting longer for communication. This identifies the GPU group G\mathbb{G} suffering degradation, and RootDiag() principles are applied to find the exact source/destination.

6 Solving Problems Before Delivery

The observation that 73% of training tasks failed within the first 10 minutes (Figure 10) indicated issues present before training even properly started. This led to the Check Before Delivery (CBD) procedure.

Figure 11: Evolution of idle time in production. 该图像是图表,展示了生产环境中空闲时间的演变(图11)。图中通过柱状图表示空闲时间(小时),折线图表示任务规模(GPU数),显示了Aegis两个阶段实施前后空闲时间和任务规模的变化趋势。

Figure 10: Durations of training tasks in production.

Motivations for CBD:

  • Frequent component updates: Regular updates to training frameworks, CCL, container networks, NIC drivers, and switches can introduce new failures.
  • Post-usage failures: Hosts might develop faults after their last use but before being reallocated, leading to failures in new tasks. Diagnosing these is hard once a host is delivered to a customer.

CBD Procedure: CBD is performed right before resources are handed over to customers. It ensures the environment is fully set up, catching issues that earlier physical host checks might miss (e.g., container network misconfigurations). CBD needs to be efficient to avoid impacting user experience. The procedure is summarized in Table 1:

Phase Tasks Time
Configuration check in parallel Host configuration check <1min
GPU configuration check
NIC configuration check
Single-host test in parallel GPU kernels test 3min
NVLink test
HBM test
PCIe test
CPU execution test
Dataset/Model/Checkpoint load test
Multi-hosts test in parallel Collective communication test 6min
Comput./Comm. overlap test

Table 1: CBD task list (Manual transcription from the paper.)

  • The total CBD execution takes less than 10 minutes.
  • A lightweight CBD version (under 1 minute) is available for PaaS mode, focusing on parallelized configuration checks and critical quick local host tests.
  • If a significant number of machines fail CBD, recent updates are rolled back. CBD has intercepted 1% problematic hosts before delivery, preventing training task failures. It is a mandatory procedure.

Experimental Setup

The evaluation of Aegis was conducted in a real-world production environment at Alibaba Cloud.

Context

  • Deployment Environment: Aegis has been online since September 2023, serving dozens of large-scale training clusters for over a year.
  • Target Workload: Statistics are drawn from an inner model training team at Alibaba Cloud, which is involved in training one of the top-tier LLMs (Large Language Models) globally. This provides a realistic and demanding testbed.
  • Scale of Operations: The training task scale increased by more than 40x during the 16 months of observation.

Evaluation Metrics

The effectiveness of Aegis is quantified using several key metrics:

  1. Idle Time Wasted by Diagnosis:

    • Conceptual Definition: This metric measures the total duration, typically in hours, during which computational resources (primarily GPUs) are available but remain unused because the system is undergoing fault diagnosis or waiting for resolution of a detected issue. High idle time signifies inefficient resource utilization and increased operational costs.
    • Mathematical Formula & Symbol Explanation: Not explicitly given, but implicitly defined as the cumulative time GPUs are not performing useful work due due to diagnosis. Idle Time=i=1N(Time where GPUi is available but idle due to diagnosis) \text{Idle Time} = \sum_{i=1}^{N} (\text{Time where GPU}_i \text{ is available but idle due to diagnosis}) Where:
      • NN: total number of GPUs in the cluster.
      • The sum is taken over the entire observation period.
  2. Training Task Restart Count:

    • Conceptual Definition: This metric quantifies the number of times a training task (representing a model training job) has to be terminated and re-initiated from a previous checkpoint or from scratch due to failures. A high restart count indicates poor system reliability and significant delays in model development and deployment.
    • Mathematical Formula & Symbol Explanation: Not explicitly given, simply a count. Restart Count=Total number of training task restarts \text{Restart Count} = \text{Total number of training task restarts}
  3. Performance Degradation Percentage (PdegP_{deg}):

    • Conceptual Definition: This metric quantifies the extent to which training tasks operate below their expected performance level. It measures the excess time spent in training iterations beyond a defined standard iteration time, relative to the total training time. Lower performance degradation indicates more efficient and stable training.
    • Mathematical Formula: Pdeg=k:Tk>TS(TkTS)allTk P_{deg} = \frac{\sum_{k: T_k > T_S} (T_k - T_S)}{\sum_{all} T_k}
    • Symbol Explanation:
      • TkT_k: The measured iteration time for the kk-th training iteration. This is obtained from the model training log.
      • TST_S: The standard iteration time, calculated as TS=1.2×TkT_S = 1.2 \times \overline{T_k}.
      • Tk\overline{T_k}: The average iteration time (presumably of a stable period or expected baseline). The paper defines TST_S relative to Tk\overline{T_k}, with a 20% slack (1.2x) to account for normal variations.
      • k:Tk>TS()\sum_{k: T_k > T_S} (\dots): Summation over all iterations kk where the measured iteration time TkT_k exceeds the standard iteration time TST_S. This captures only the degraded portions.
      • allTk\sum_{all} T_k: Summation over all measured iteration times for the entire training task.

Baselines

The paper's evaluation focuses on measuring the impact of Aegis's deployment over time, rather than comparing it against specific existing research systems in a controlled benchmark. The baselines are essentially the "before Aegis" or "previous state of the production system" at Alibaba Cloud.

  • Implicit Baseline 1: The state of the production cluster prior to the deployment of Aegis Phase-1.

  • Implicit Baseline 2: The state of the production cluster after Aegis Phase-1 but before Aegis Phase-2 and CBD, demonstrating the incremental improvements.

    This approach is typical for papers describing real-world production system deployments, where the primary goal is to show the practical impact and value added by the new system within their specific operational context.

Results & Analysis

The results demonstrate the significant positive impact of Aegis's phased deployment on training stability and efficiency in a production LLM training environment.

Core Results

The quantitative improvements achieved by Aegis are substantial:

  • Idle Time Reduction: More than 97% decrease in idle time wasted by diagnosis.

  • Training Task Restart Reduction: 84% decrease in training task restart count.

  • Performance Degradation Reduction: 71% decrease in performance degradation.

    These figures highlight Aegis's effectiveness in increasing GPU utilization, reducing operational costs, and improving the overall developer experience for AI model training.

Data Presentation (Tables)

Figure 1: Failures in a representative cluster.

Figure 1: Failures in a representative cluster. 该图像是一张柱状图,展示了某代表性集群中连续10周内硬件、软件和网络故障的数量分布,硬件故障占比最高,软件和网络故障相对较少。

Figure 1: Failures in a representative cluster. Analysis: This bar chart illustrates the frequency of hardware, software, and network failures over ten weeks in a production cluster. Hardware failures consistently dominate, with 100-230 critical failures per week. This underscores the primary challenge Aegis was built to address: the inherent unreliability of high-performance hardware in large-scale AI training environments and the need for robust diagnosis.

Figure 2: Types of failures encountered in production.

Figure 7: Customizing CCL for failure diagnosis. 该图像是图7的示意图,展示了通过定制集体通信库(CCL)实现故障诊断的三种情况:(a)正常训练迭代,(b)计算故障,(c)通信故障。图中使用 CLi,j=mCL_{i,j}=m 等公式标注迭代与通信状态。

Figure 2: Types of failures encountered in production. Analysis: This pie chart details the breakdown of failure types. GPU-related reasons (e.g., execution error, driver error, memory error, CUDA error, NVLINK error, ECC error) account for a significant 45.6% of all failures. NVLINK failures are 9.2% and PCIe errors are 10.4%. This further emphasizes the dominance of host-side hardware issues and the complexity introduced by specialized intra-host interconnects. It justifies Aegis's Phase-1 strategy of prioritizing host-side critical failures.

Figure 10: Durations of training tasks in production.

Figure 11: Evolution of idle time in production. 该图像是图表,展示了生产环境中空闲时间的演变(图11)。图中通过柱状图表示空闲时间(小时),折线图表示任务规模(GPU数),显示了Aegis两个阶段实施前后空闲时间和任务规模的变化趋势。

Figure 10: Durations of training tasks in production. Analysis: This bar chart shows that 73% of training tasks failed within the first 10 minutes. Given that the initialization phase typically takes 5-20 minutes, this indicates a high rate of failures occurring before actual productive training begins. This finding directly motivated the development and deployment of Check Before Delivery (CBD) to proactively catch issues.

Figure 11: Evolution of idle time in production.

Figure 2: Types of failures encountered in production. 该图像是图2,一个饼图,展示了生产环境中遇到的各种故障类型及其所占百分比,如CPU错误、GPU执行错误和PCIe错误等。

Figure 11: Evolution of idle time in production. Analysis: This figure shows the monthly accumulated idle time (bars) and the training task scale (line) from May 2023 to August 2024.

  • Pre-Aegis Phase-1: High idle time is evident before September 2023.
  • Aegis Phase-1 (Online in Sep 2023): In September and October 2023, despite a doubling of training scale, idle time decreased by 71% in the first month.
  • Scaling Impact (Nov 2023): A 4x boost in training scale in November led to a temporary increase in idle time, highlighting how scaling introduces new corner case issues.
  • Aegis Phase-2 (Online in Jun 2024): The deployment of Phase-2 resulted in a further 91% reduction in idle time in July and August 2024, attributed to more failures being resolved by runtime diagnosis without resorting to offline diagnosis.

Figure 12: Evolution of restart counts in production.

Figure 3: Intra-host network topology. 该图像是图3,展示了主机内部的网络拓扑结构,包含两个CPU以及各自连接的PCIe交换机、GPU和NIC,通过高带宽的主机内网络和后端网络相互连接。

Figure 12: Evolution of restart counts in production. Analysis: This chart tracks training task restart counts over the same period.

  • Pre-CBD: High restart counts are observed in November 2023, coinciding with the training scale increase and indicating numerous initialization phase failures.
  • CBD Deployment (Dec 2023): CBD was deployed in December 2023, leading to a 44.8% decrease in restart counts in the following month. Continuous optimization of CBD eventually reduced restart counts by 84.6%.
  • August 2024 Increase: An increase in restarts in August 2024 is attributed to planned experiments and fine-tuning activities by the model training team, rather than systemic failures. This indicates that CBD is effective in mitigating unplanned restarts.

Figure 13: Runtime diagnosis percentage.

Figure 4: Different types of host accessing topology. 该图像是图4,展示了两种不同的主机访问拓扑结构:(a) 单个ToR访问和 (b) 优化的多机架(Rail)访问,突出不同拓扑下的连接关系及访问延迟。

Figure 13: Runtime diagnosis percentage. Analysis: This figure shows the percentage of failure cases diagnosed at runtime (vs. offline).

  • Pre-Aegis Phase-2: Before June 2024, the runtime diagnosis percentage was lower, indicating a reliance on offline diagnosis for many cases.
  • Aegis Phase-2 (Online in Jun 2024): With Phase-2 deployment, the runtime diagnosis percentage gradually converges to nearly 100%. This is a critical achievement as it means most training task failures can be automatically recovered without human interference or resource-intensive offline procedures, drastically improving GPU utilization and user experience.

Figure 14: Performance degradation percentage.

Figure 5: Limitations of state-of-the-art solutions. 该图像是图5,展示了当前主流故障诊断方案的局限性,通过故障覆盖率和易部署性维度对比了多种方法,突出Aegis系统在任务级诊断中的优越性和易部署特性。

Figure 14: Performance degradation percentage. Analysis: This bar chart displays the performance degradation percentage over time.

  • Pre-Degradation Diagnosis: Before June 2024, performance degradation values were consistently higher.
  • Degradation Diagnosis Deployment (June 2024): With the deployment of the performance degradation diagnosis features in June 2024, the performance degradation percentage significantly decreased by 71% in the subsequent months. This highlights the effectiveness of both basic correlating diagnosis and enhanced procedure-aware diagnosis in identifying and mitigating performance bottlenecks.

Figure 6: Aegis overview. 该图像是论文中图6的示意图,展示了Aegis故障诊断系统的整体架构,包含任务失败和性能下降的多阶段诊断流程,以及在线和离线日志数据的利用,体现系统的演进和错误定位机制。

Figure 15: Number of failed links during the batch failure. Analysis: This line graph illustrates a specific experience regarding batch link failures caused by contamination during data center construction. It shows a swift increase in failed links when new machines are delivered, followed by a gradual decrease as cleaning efforts progress. This case study from the experience section demonstrates a practical challenge encountered and how Aegis (or the operational response informed by Aegis's insights) helped manage and eventually resolve a large-scale hardware reliability issue.

Ablations / Parameter Sensitivity

The paper implicitly conducts an ablation study through its phased evolutionary approach:

  • Impact of Phase-1: Demonstrated by the initial 71% reduction in idle time after its deployment, showing the value of enhancing existing systems with training-specific logs and host-side critical error diagnosis.

  • Impact of Phase-2: Further 91% saving in idle time (on top of Phase-1) and near 100% runtime diagnosis percentage after its deployment, highlighting the crucial contribution of CCL customization for procedure-aware diagnosis.

  • Impact of CBD: A 44.8% initial and 84.6% overall decrease in restart counts after CBD deployment, showing the benefits of proactive checks before resource delivery.

    Each phase builds upon the previous one, effectively demonstrating the incremental value of each Aegis component. The Z-Score threshold (λ+2δ\lambda + 2\delta) for outlier detection and CCL degradation parameters (α=0.8\alpha = 0.8, β=1.5\beta = 1.5) are mentioned as practical values determined in production.

Conclusion & Personal Thoughts

Conclusion Summary

This paper rigorously details the Evolution of Aegis, a fault diagnosis system specifically designed for AI model training services in a production cloud environment at Alibaba Cloud. Driven by the unique challenges of large-scale distributed AI training, Aegis evolved through strategic phases. Phase-1 enhanced existing diagnosis systems with training-specific error detection and an offline diagnosis fallback. Phase-2 significantly improved runtime diagnosis capabilities by customizing the Collective Communication Library (CCL) to gain procedure-aware insights without modifying customer code. Complementing these, Aegis also developed modules for performance degradation diagnosis and implemented Check Before Delivery (CBD) for proactive issue detection. The system has achieved remarkable results in production: decreasing idle time by 97%, reducing training task restarts by 84%, and mitigating performance degradation by 71%. These outcomes underscore Aegis's critical role in enhancing the reliability, efficiency, and user experience of AI model training in the cloud.

Limitations & Future Work

The authors candidly discuss several limitations and areas for future work:

  • CCL Customization Depth: While CCL customization is easy to deploy and effective for culprit localization, it might not always provide sufficient information to determine the ultimate root cause of failures, especially complex ones that intertwine multiple system layers.
  • CCL Version Management: The need to maintain customized CCL versions for various CUDA, driver, and CCL versions across customer environments poses an ongoing maintenance challenge.
  • Reboot or Repair Determination: Accurately and efficiently determining whether a faulty host requires a simple reboot or a full hardware repair remains an open issue. Their current SOP (Standard Operating Procedure), which includes stress tests and a three-strikes-and-repair rule, is practical but still leaves room for refinement.
  • Complex Failure Cases: Despite Aegis's success, corner cases (e.g., highly specific congestion control bugs, silent packet loss with unusual characteristics) still emerge, requiring continuous adaptation and enhancement of the system.
  • Customer Privacy vs. Diagnostic Depth: The inherent tension between the desire for deeper diagnostic insights (which might require more intrusive data collection) and the need to protect customer privacy is a constant design constraint.

Personal Insights & Critique

This paper offers a highly valuable contribution by presenting a practical, battle-tested diagnosis system from a major cloud provider. Its strengths lie in:

  • Production Focus and Pragmatism: The paper doesn't just propose a theoretical solution; it describes a system that has been continuously evolved and deployed in a large-scale production environment under real-world constraints. This makes the findings and lessons learned extremely relevant.

  • Phased Evolutionary Approach: The iterative development of Aegis (Phase-1, Phase-2, CBD) is a realistic reflection of how complex systems are built and improved. It provides a clear ablation study by showing the incremental benefits of each stage.

  • Non-Intrusive Design Principle: The strong commitment to diagnose faults without modifying customer code is a critical design choice for a public cloud service provider. The innovative use of CCL customization as a bridge is particularly clever, balancing diagnostic power with customer privacy and ease of deployment.

  • Comprehensive Problem Coverage: Addressing task failures, performance degradation, and pre-delivery checks covers the full lifecycle of reliability challenges in AI training.

  • Transparent Lessons Learned: The "Experience and Lessons" section provides invaluable insights into the practical difficulties of operating such a service, including hardware reliability, heterogeneous environments, configuration management, and network issues. These are often overlooked in academic papers but are crucial for real-world impact.

    One potential area for deeper exploration (though perhaps beyond the scope of a single paper) could be:

  • Root Cause Analysis Automation: While Aegis excels at culprit localization, the paper notes that root cause analysis is often done offline after isolation. Future work could explore more automated, AI-driven root cause analysis techniques that leverage the rich diagnostic data collected by Aegis to provide deeper, actionable insights for system engineers.

  • Generalizability of CCL Customization: While the CCL customization is effective, its direct transferability to extremely niche or highly specialized collective communication patterns not covered by mainstream frameworks might require further investigation.

    Overall, Aegis is an impressive feat of system engineering that tackles a complex, growing problem with an elegant and practical solution. Its success story is a testament to the power of iterative design, real-world data, and a deep understanding of operational challenges.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.