Paper status: completed

Minder: Faulty Machine Detection for Large-scale Distributed Model Training

Published:01/01/2025
Original Link
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Minder is an automated faulty machine detection system for large-scale distributed model training, accurately identifying fault patterns with an average response time of 3.6 seconds, achieving 0.904 precision and 0.893 F1-score, significantly reducing manual diagnosis time and co

Abstract

Large-scale distributed model training requires simultaneous training on up to thousands of machines. Faulty machine detection is critical when an unexpected fault occurs in a machine. From our experience, a training task can encounter two faults per day on average, possibly leading to a halt for hours. To address the drawbacks of the time-consuming and labor-intensive manual scrutiny, we propose Minder, an automatic faulty machine detector for distributed training tasks. The key idea of Minder is to automatically and efficiently detect faulty distinctive monitoring metric patterns, which could last for a period before the entire training task comes to a halt. Minder has been deployed in our production environment for over one year, monitoring daily distributed training tasks where each involves up to thousands of machines. In our real-world fault detection scenarios, Minder can accurately and efficiently react to faults within 3.6 seconds on average, with a precision of 0.904 and F1-score of 0.893.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Minder: Faulty Machine Detection for Large-scale Distributed Model Training". It focuses on automatically identifying malfunctioning machines within large-scale distributed machine learning training environments.

1.2. Authors

The authors and their affiliations are:

  • Yangtao Deng (Tsinghua University)

  • Xiang Shi (ByteDance)

  • Zhuo Jiang (ByteDance)

  • Xingjian Zhang (Tsinghua University)

  • Lei Zhang (ByteDance)

  • Zhang Zhang (ByteDance)

  • Bo Li (ByteDance)

  • Zuquan Song (ByteDance)

  • Hang Zhu (ByteDance)

  • Gaohong Liu (ByteDance)

  • Fuliang Li (Northeastern University)

  • Shuguang Wang (ByteDance)

  • Haibin Lin (ByteDance)

  • Jianxi Ye (ByteDance)

  • Minlan Yu (Harvard University)

    The paper is a collaboration primarily between Tsinghua University, ByteDance, Northeastern University, and Harvard University, indicating a mix of academic research and industry application, particularly from a large technology company involved in distributed systems.

1.3. Journal/Conference

The paper was published in the Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI). NSDI is a highly reputable and influential conference in the field of computer systems, particularly for networked systems, operating systems, and distributed systems. Publication at NSDI signifies high-quality, impactful, and often practical research in systems design and implementation.

1.4. Publication Year

April 28-30, 2025.

1.5. Abstract

Large-scale distributed model training, especially for models like Large Language Models (LLMs), involves thousands of machines, making faulty machine detection critical. The authors' experience shows that a training task can encounter an average of two faults per day, potentially causing hours-long halts. To overcome the limitations of time-consuming and labor-intensive manual diagnosis, the paper proposes Minder, an automatic faulty machine detector. The core idea of Minder is to efficiently detect distinctive, faulty monitoring metric patterns that persist for a period before a complete task halt. Minder has been deployed in a production environment for over a year, monitoring daily distributed training tasks with up to thousands of machines. In real-world scenarios, Minder achieves an average fault reaction time of 3.6 seconds, with a precision of 0.904 and an F1-score of 0.893.

https://www.usenix.org/conference/nsdi25/presentation/deng PDF Link: https://www.usenix.org/system/files/nsdi25-deng.pdf The paper is officially published and available through the USENIX conference proceedings.

2. Executive Summary

2.1. Background & Motivation

The rapid growth in dataset sizes and model parameters, particularly in Large Language Models (LLMs), has necessitated large-scale distributed model training involving thousands of machines and GPUs. While advancements in distributed training, collective communication, and fault tolerance have made this feasible, the immense scale and complexity also significantly increase the potential for faults.

The core problem Minder aims to solve is the inefficient and costly detection of faulty machines in large-scale distributed model training environments.

  • Importance of the problem:

    • High Fault Frequency: In production environments, an accidental hardware or software fault occurs, on average, twice a day. These faults can impact thousands of machines and GPUs.
    • Significant Downtime & Economic Loss: A single fault can force an entire training task to halt for hours or even days, leading to substantial economic losses (e.g., $1700 for 40 minutes on a 128-machine task). Frequent interruptions severely increase operating expenses and time costs for training large models.
    • Drawbacks of Current Solutions: The prevailing method is manual diagnosis, which is:
      • Time-consuming: Engineers are alerted only after a task has entirely stopped, missing performance degradations. Diagnosis often takes over half an hour, sometimes days (as shown in Figure 2 of the paper).
      • Labor-intensive: Multiple teams (training, networking, storage, hardware) are involved.
      • Incomplete/Redundant Scrutiny: Manual log checks are often incomplete or contain redundant information, making fault identification difficult.
      • Lack of Timely Notification: Faults that cause slowdowns but not complete halts go undetected until the task eventually stops.
  • Paper's Entry Point/Innovative Idea: Minder addresses these challenges by recognizing that a machine experiencing a fault will exhibit distinctive, abnormal patterns in its monitoring metrics that differ from healthy machines and persist over time. Instead of relying on manual inspection or generic anomaly detection, Minder automatically and efficiently identifies these patterns by leveraging machine-level similarity, continuity of abnormal behavior, individual learning-based denoising models for each metric, and metric prioritization.

2.2. Main Contributions / Findings

The primary contributions of this paper are:

  • Empirical Investigation of Faults: A detailed analysis of real-world fault types encountered in large-scale distributed training and their correlations with various monitoring metrics. This investigation highlights why certain metrics are more sensitive and the inherent challenges in detection (Section 2.3 and 2.4).

  • Novel Design Principles: The introduction of key design ideas for Minder: machine-level similarity (comparing a machine's metrics to its peers), machine-level continuity (detecting persistent abnormal patterns), individual learning-based denoising models (using LSTM-VAE for each metric to handle noise and task-dependent baselines), and metric prioritization (to expedite detection).

  • System Implementation and Deployment: The design, implementation, and deployment of Minder in a production environment for over a year, monitoring daily distributed training tasks involving up to thousands of machines.

  • Thorough Evaluation and Ablation Studies: A comprehensive evaluation of Minder's performance in real-world fault detection scenarios, demonstrating its fast reaction time, high accuracy (precision of 0.904, F1-score of 0.893), and the effectiveness of its design choices through ablation experiments.

  • Practical Lessons and Future Directions: Sharing practical lessons learned from deploying Minder and outlining future research directions to further enhance fault detection capabilities.

    The key findings demonstrate that Minder can accurately and efficiently react to faults within 3.6 seconds on average, significantly reducing the detection time compared to manual methods. It effectively addresses the challenges of diverse fault types, task-dependent normal states, non-one-to-one metric-fault correlations, and noise in monitoring data.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand Minder, a reader should be familiar with the following concepts:

  • Distributed Model Training:

    • Large Language Models (LLMs): AI models with billions or trillions of parameters, requiring massive computational resources. Examples include GPT-4.
    • 3D Parallelism (Data Parallelism, Model Parallelism, Pipeline Parallelism): Techniques to distribute the computational load of training large models across multiple devices (GPUs) and machines.
      • Data Parallelism (DP): The same model is replicated on multiple devices, and each device processes a different subset of the training data. Gradients are then synchronized across devices (e.g., using All-reduce).
      • Model Parallelism (MP) / Tensor Parallelism (TP): The model itself is too large to fit on a single device, so its layers or parts of a layer (tensors) are split across multiple devices. Each device computes a portion of the model.
      • Pipeline Parallelism (PP): Different layers of the model are assigned to different devices, forming a pipeline. Data batches are processed in a pipelined fashion, with each device passing intermediate activations to the next.
    • Homogeneous vs. Heterogeneous Environments: Minder operates in a largely homogeneous environment, where machines have similar hardware (GPUs, NICs) and configurations, which is crucial for its similarity assumption.
  • Machine Monitoring Metrics: Quantitative measurements of a machine's hardware and software components.

    • CPU Usage: Percentage of time the Central Processing Unit is actively working.
    • GPU Usage / Duty Cycle: Percentage of time the Graphics Processing Unit is actively processing tasks.
    • Memory Usage: Amount of RAM currently in use.
    • Disk Usage: Amount of storage space used on a disk.
    • Network Metrics:
      • Throughput (TCP/RDMA): The rate at which data is successfully transmitted over a network connection. RDMA (Remote Direct Memory Access) allows direct memory access between computers without involving the CPU, crucial for high-performance computing.
      • PFC (Priority-based Flow Control) Packet Rates: A mechanism in Ethernet networks to prevent packet loss due to congestion by pausing traffic for specific priority levels. Surges in PFC packets often indicate network congestion.
      • ECN (Explicit Congestion Notification) Packet Rate: A feature in TCP/IP networks that allows routers to mark packets to signal network congestion to endpoints, without dropping them.
      • CNP (Congestion Notification Packet) Rate: Used in RoCEv2 (RDMA over Converged Ethernet) to explicitly signal congestion back to the sender.
    • Hardware-specific Metrics:
      • PCIe (Peripheral Component Interconnect Express) Bandwidth/Usage: High-speed serial computer expansion bus used to connect hardware devices. Degradation indicates issues with data transfer to/from components like GPUs, NICs.
      • NVLink Bandwidth: A high-speed interconnect developed by NVIDIA for GPU-to-GPU and GPU-to-CPU communication, offering much higher bandwidth than PCIe.
      • ECC (Error-Correcting Code) Errors: Errors in memory that can be corrected by ECC memory, but frequent errors can indicate hardware instability.
  • Anomaly Detection:

    • Unsupervised Learning: A type of machine learning where the algorithm learns patterns from unlabeled data. It's suitable for anomaly detection when "normal" data is abundant but "anomalous" data is rare and undefined.
    • Time Series Data: A sequence of data points indexed in time order. Monitoring metrics are typically time series.
    • Denoising: The process of removing noise from data to reveal underlying patterns.
    • Variational Autoencoder (VAE): A type of generative neural network that learns a compressed, meaningful representation (latent space) of its input data. It can be used for denoising by reconstructing input data from its latent representation, effectively filtering out noise.
    • LSTM (Long Short-Term Memory): A type of recurrent neural network (RNN) capable of learning long-term dependencies in sequential data, making it suitable for time series analysis.
    • Z-score: A statistical measure that describes a value's relationship to the mean of a group of values, measured in terms of standard deviations from the mean. It helps identify outliers.
    • Decision Tree: A flowchart-like structure where each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (decision). Used for classification and prioritization.
    • Euclidean Distance: The straight-line distance between two points in Euclidean space. Used to measure similarity/dissimilarity between data points or embeddings.

3.2. Previous Works

The paper categorizes related work into three main areas:

  • Intra-host Diagnosis:

    • Run-time diagnosis: Focuses on detecting anomalies without interrupting running tasks. This often involves leveraging existing hardware counters (e.g., Intel Performance Counter Monitor [8], NVIDIA System Management Interface [14]) or building dependency graphs from historical monitoring data (e.g., BRCA [63], MonitorRank [43], FChain [62]) using Deep Learning (DL) algorithms. Log-based approaches using Natural Language Processing (NLP) are also used (e.g., [19]) but are limited by log content.
    • Offline diagnosis: Requires specific tools to detect bottlenecks when the machine is not running tasks. Examples include Liu et al. [55] and Martinasso et al. [58] for NVLink/PCIe link degradation, and Collie [44] for RDMA subsystem anomalies. These are not suitable for real-time fault identification during training.
    • Minder's approach falls under run-time intra-host diagnosis, but with an emphasis on detecting faulty machines rather than specific component bottlenecks, and importantly, it operates continuously during live training.
  • Inter-host Diagnosis:

    • Focuses on network failures between machines. NetBouncer [73] localizes failure devices/links using IP-in-IP. SNAP [79] monitors TCP statistics for network diagnosis. Pingsmesh [33], Haecki [34], and Fathom [66] monitor end-to-end latencies or RPC performance. These primarily detect network path failures, not intra-host hardware faults affecting training speed.
    • Cloud system diagnosis (e.g., Conan [47], Sage [26], Traceark [81]) focuses on microservices and contextual data patterns in cloud environments. These often assume different dependencies and workload characteristics than the homogeneous and highly synchronized nature of distributed model training.
    • Minder differentiates by focusing on machine-level faults, which can then manifest as inter-host network issues (e.g., PFC surges), but the root cause is often intra-host. The assumption of similar workloads across machines in distributed training makes Minder's similarity-based approach applicable, unlike general cloud systems.
  • Algorithms for Anomaly Detection and Diagnosis:

    • Statistics-based methods: Use statistical measures like Euclidean distance, Pearson Correlation [25], Kendall's tau [42], Spearman Correlation [80], or Mahalanobis Distance (MD) [30, 46, 57] to quantify similarity and identify deviations. These often require manual threshold setting. Minder uses Z-score for prioritization and Euclidean distance for similarity but builds on denoised data.
      • Mahalanobis Distance (MD): A measure of the distance between a point and a distribution, accounting for correlations between variables. DM(x)=(xμ)TS1(xμ)D_M(x) = \sqrt{(x - \mu)^T S^{-1} (x - \mu)} Where:
        • xx is the data point (vector of observations).
        • μ\mu is the mean vector of the observations.
        • S1S^{-1} is the inverse of the covariance matrix of the observations.
        • TT denotes the transpose.
    • Supervised algorithms: Widely used for anomaly detection when labeled data is available (e.g., EGADS [45] for time series, random forest [64] for incident routing [27]).
      • Minder explicitly states that supervised learning is inappropriate because "normal" states are task-dependent, and labeling is impractical (Challenge 2). The goal is to identify which machine is faulty, not just classify normal/abnormal.
    • Unsupervised learning: Identifies outliers from monitoring data. This includes clustering (e.g., [50, 51, 56]) and methods using VAE (e.g., [52, 70, 71, 75, 78]).
      • Minder leverages unsupervised learning, specifically LSTM-VAE, for denoising and representation learning, which aligns with this category but applies it in a novel way for distributed training fault detection.

3.3. Technological Evolution

The evolution of distributed systems, particularly for AI training, has driven the need for advanced fault detection. Early distributed systems relied on simpler health checks and manual debugging. As models grew into the billions and trillions of parameters (e.g., GPT-4, PaLM), training required thousands of GPUs and complex parallelism strategies (DP, TP, PP). This increased scale introduced:

  1. Higher fault frequency: More components mean more points of failure.

  2. Cascading failures: A single fault can quickly propagate, halting the entire task.

  3. Increased cost of downtime: Idle high-end GPUs are extremely expensive.

  4. Complex interdependencies: Pinpointing root causes becomes harder with complex software stacks and hardware interactions.

    This evolution renders traditional manual diagnosis, which was already slow and expensive, completely unsustainable. Simple threshold-based anomaly detection often fails due to the task-dependent nature of "normal" behavior and the presence of noise. Supervised learning is hampered by the lack of labeled fault data and the dynamic nature of "abnormal" patterns.

Minder fits into this technological timeline by proposing an automated, real-time, and robust solution tailored to the specific challenges of modern large-scale distributed training. It moves beyond simple statistical methods and generic anomaly detection by incorporating domain-specific insights like machine-level similarity and continuity within the framework of unsupervised learning (LSTM-VAE) and strategic metric prioritization.

3.4. Differentiation Analysis

Minder's core differences and innovations compared to existing methods:

  • Automatic vs. Manual: Unlike the labor-intensive manual diagnosis, Minder fully automates the detection process, reducing reaction time from hours/days to seconds.
  • Real-time vs. Post-mortem: Minder detects faults at runtime, often before the entire task halts, addressing the issue of delayed notifications in manual methods.
  • Unsupervised & Task-Agnostic vs. Supervised/Threshold-based:
    • Traditional supervised methods struggle with the task-dependent nature of "normal" metric values and the scarcity of labeled fault data.
    • Simple threshold-based methods are brittle to varying workloads.
    • Minder uses unsupervised learning (LSTM-VAE) to learn normal patterns for a given task without requiring pre-labeled fault data. Its similarity-based approach compares machines within the same task, making it adaptive to diverse workloads.
  • Handles Noise & Jitters: Minder explicitly incorporates LSTM-VAE for denoising and continuity checks to filter out transient noises, a common pitfall for statistical methods like Mahalanobis Distance (as shown in its lower accuracy).
  • Multi-metric & Prioritized Approach vs. Monolithic/Single-metric:
    • Acknowledging that no single metric indicates all fault types and that correlations are complex, Minder uses individual models for each metric. This avoids the "mutual interference" that a single, integrated model (like CON or INT in evaluation) might face.
    • Metric prioritization further optimizes detection speed by checking the most sensitive metrics first, a strategic application of domain knowledge.
  • Focus on Machine-level Faults in Homogeneous Distributed Training: Minder leverages the inherent similarity of metric patterns across machines in 3D parallel distributed training. This strong assumption allows it to effectively detect outliers, a characteristic not present in heterogeneous cloud microservice environments or general network diagnosis tools.

4. Methodology

The Minder framework is designed as an automatic, responsive, and accurate system to detect faulty machines causing unexpected halts in distributed training tasks. It addresses challenges like diverse fault types, task-dependent normal states, complex metric-fault correlations, and noisy monitoring data.

The following figure (Figure 5 from the original paper) shows the system architecture of Minder:

fig 4 该图像是一个关于监测数据预处理和在线故障机器检测的流程图,展示了从监测数据预处理到故障检测的各个环节,包括每个指标模型训练、监测指标优先级排序,以及基于相似性的距离检查和连续性检查等步骤。

4.1. Principles

Minder's design revolves around four key principles to overcome the identified challenges:

  1. Machine-level Similarity (3.1): Distributed training, especially with 3D parallelism (Data Parallelism, Pipeline Parallelism, Tensor Parallelism), typically ensures a balanced load across machines. This leads to similar fluctuations in monitoring metrics among healthy machines. A faulty machine, however, will exhibit distinctive differences in its metric patterns compared to its peers. Minder leverages this by detecting outliers in metric data, regardless of the specific fault type, by measuring "distance" from other machines. This addresses Challenge 1 (various ways a machine can fail) and Challenge 2 (task-dependent normal states) by focusing on relative abnormality within a task rather than absolute thresholds.

  2. Machine-level Continuity (3.2): Transient jitters or short-term noise can lead to false alarms. Minder incorporates the idea that actual faults typically cause abnormal performance that persists for a period (e.g., minutes). By requiring a detected abnormality to be continuous over several time windows, Minder filters out bursty noises and temporary fluctuations, enhancing robustness and addressing Challenge 2 (task-dependent normal states by differentiating true faults from transient blips).

  3. Individual Learning-Based Denoising Models for Each Monitoring Metric (3.3): Monitoring data inherently contains noise (e.g., jitters, sensor inaccuracies). Minder uses Variational Autoencoders (VAEs) with LSTM layers to denoise and reconstruct raw time series data. Importantly, it trains individual models for each monitoring metric. This addresses Challenge 3 (non-one-to-one correlation between fault types and metrics) and Challenge 4 (noisy data). Training separate models avoids mutual interference that could arise if a single model tried to process all metrics simultaneously, especially since different metrics have varying sensitivities to different fault types.

  4. Prioritized Metric Sequence (3.4): To expedite detection, Minder doesn't check all available metrics at once. Instead, it generates a prioritized list of metrics based on their sensitivity to faults. It then uses the models corresponding to the top-prioritized metrics first. This ensures that the most indicative metrics are checked earliest, leading to faster fault identification.

4.2. Core Methodology In-depth (Layer by Layer)

The Minder framework consists of four main components: Preprocessing, Per-metric Model Training, Monitoring Metric Prioritization, and Online Faulty Machine Detection.

4.2.1. Monitoring Data Preprocessing (4.1)

Minder first processes the raw monitoring data collected from each machine to prepare it for analysis.

  1. Aggregation into Time Windows: The continuous stream of monitoring data is grouped into discrete time windows.

  2. Data Alignment: Sampling points across all machines are aligned based on their timestamps. If data samples are missing at certain points, Minder fills these gaps by using data from the nearest available sampling time.

  3. Normalization: To ensure that multi-dimensional monitoring data, which often has different scales and ranges, can be effectively compared and integrated, Min-Max Normalization is applied. This technique scales the data to a predefined range, typically [0, 1].

    The formula for Min-Max Normalization is: x=xxminxmaxxminx' = \frac{x - x_{min}}{x_{max} - x_{min}}

    Where:

    • xx' is the normalized value of the data point.

    • xx is the original value of the data point.

    • xminx_{min} is the minimum value of the feature (metric) in the dataset.

    • xmaxx_{max} is the maximum value of the feature (metric) in the dataset.

      This ensures that all metrics contribute proportionally to similarity calculations, preventing metrics with larger value ranges from dominating the comparison.

4.2.2. Per-metric Model Training (4.2)

As established in Section 3.3, Minder trains individual learning models for each distinct monitoring metric. This is crucial because no single metric guarantees an indication for all fault types, and different metrics have varying sensitivities.

  1. Input Data for Training: For each metric, the preprocessed per-machine data within a time window (e.g., a vector of length ww, representing ww consecutive samples) is used as an input instance. For example, to train a model for CPU Usage, a sequence of CPU Usage samples of length ww from each machine is fed into its dedicated model.

  2. Model Choice: LSTM-VAE: Minder specifically uses Variational Autoencoders (VAEs) enhanced with Long Short-Term Memory (LSTM) layers.

    • Variational Autoencoder (VAE): A VAE is a generative model that learns a compressed, lower-dimensional representation (called the latent space or embedding) of input data. It consists of an encoder and a decoder.

      • The encoder maps the input data xx to parameters (mean μ\mu and standard deviation σ\sigma) of a probability distribution in the latent space, from which a latent variable zz is sampled.
      • The decoder then reconstructs the original input data xx' from this sampled latent variable zz.
      • The VAE is trained to minimize the reconstruction error between xx and xx', while also ensuring that the latent space distribution is close to a prior distribution (e.g., a standard normal distribution), which encourages continuous and meaningful latent representations.
      • For anomaly detection, VAEs are effective because they learn to reconstruct "normal" data well. Anomalous data, being outside the learned distribution of normal behavior, will be reconstructed poorly or mapped to distinct regions in the latent space. This process inherently performs denoising as the model learns to capture the essential patterns of the data, filtering out transient noise.
    • LSTM Integration: Given that monitoring data is temporal time series, LSTM networks are used within both the encoder and decoder of the VAE. LSTMs are well-suited for processing sequences because they can capture long-term dependencies and context in the data, which traditional neural networks or simpler RNNs might miss. This allows the LSTM-VAE to effectively extract temporal characteristics and patterns from the time series data.

      The following figure (Figure 6 from the original paper) depicts the LSTM-VAE structure for Minder:

      fig 5 该图像是示意图,展示了Minder系统中的编码器和解码器结构。左侧为编码器部分,接收输入的监测数据 x1,x2,,xnx_1, x_2, \ldots, x_n,通过多个LSTM单元处理,并生成潜在变量 zz。右侧为解码器部分,接收潜在变量 zz,输出重构数据 p(x1z),p(x2z),,p(xnz)p(x'_1|z), p(x'_2|z), \ldots, p(x'_n|z)。该架构旨在通过循环神经网络有效检测机器故障。

    The Encoder takes the input monitoring data (e.g., x1,x2,...,xnx_1, x_2, ..., x_n) and processes it through LSTM layers to generate the parameters for the latent variable zz. The Decoder then takes this latent variable zz and uses LSTM layers to reconstruct the data (e.g., p(x'_1|z), p(x'_2|z), ..., p(x'_n|z)).

    Model parameters, such as hidden_size (e.g., 4), latent_size (e.g., 8), and lstm_layer (e.g., 1), are configured for each model. The output of this training is a set of individual LSTM-VAE models, one for each monitoring metric, capable of denoising and reconstructing time series input into a representative vector (embedding).

4.2.3. Monitoring Metric Prioritization (4.3)

To accelerate the detection process, Minder prioritizes metrics based on their sensitivity to faults. This step runs in parallel with the model training.

  1. Step 1: Z-score Calculation for Evaluating Metric Sensitivity: To identify which metrics are most indicative of a fault, Minder uses the Z-score statistic. The Z-score quantifies how many standard deviations an element is from the mean. A higher absolute Z-score suggests that a data point is an outlier and deviates significantly from the typical distribution, making it useful for identifying unusual behavior on a faulty machine.

    The formula for the Z-score of the ii-th machine for the jj-th monitoring metric is: Zij=xijxjˉsjZ_{ij} = \frac{x_{ij} - \bar{x_j}}{s_j}

    Where:

    • ZijZ_{ij} is the Z-score of the ii-th machine for the jj-th monitoring metric.

    • xijx_{ij} is the sample value of the ii-th machine for the jj-th metric at a specific sampling point.

    • xjˉ\bar{x_j} is the average value of all machines for the jj-th metric at that sampling point.

    • sjs_j is the standard deviation of all machines for the jj-th metric at that sampling point.

      When a fault occurs, the affected machine's metric value xijx_{ij} will likely deviate significantly from the mean xjˉ\bar{x_j} of all machines, resulting in a high absolute Z-score. For a given time window of a training task, Minder considers max(Zij)max(|Z_{ij}|) across all machines for the jj-th metric to represent the overall dispersion.

  2. Step 2: Prioritization of Monitoring Metrics using Decision Tree: Based on the calculated maximum Z-scores for each metric, Minder constructs a decision tree to determine the sensitivity and prioritize metrics.

    • Reasoning for Decision Tree:

      • Resemblance to Rule-based Policies: Decision trees mimic logical, rule-based structures often found in networking monitoring systems (e.g., "if CPU Usage drops to zero, alert").
      • High Expressiveness and Faithfulness: Decision trees are non-parametric and can represent complex decision-making processes effectively.
    • Training Process:

      • Minder collects the maximum Z-score for each metric (from Step 1) for various time windows and training tasks.
      • Each collection of maximum Z-scores (one for each metric) forms an instance.
      • These instances are manually labeled as "normal" or "abnormal" based on whether a faulty machine was present in that time window.
      • A decision tree is then trained using these labeled instances.
    • Prioritization Outcome: The decision tree prioritizes metrics. Metrics closer to the root of the tree are considered more sensitive and informative for identifying faulty machines. As shown in Figure 7 of the paper, metrics like PFC, CPU, GPU, and NVLink-related metrics are identified as highly informative. This aligns with empirical observations that CPU and GPU states, and network quality (intra-host NVLink, inter-host PFC), are critical indicators.

      The following figure (Figure 7 from the original paper) shows the top 7 layers of the decision tree for prioritization:

      fig 11 该图像是一个示意图,展示了故障检测系统的工作流程。图中所示的各个节点通过 Z-score 来评估不同监控指标的健康状态,包括 PFC、CPU 和 GPU 各项指标的使用情况,并标记为正常或异常。

    This decision tree visually represents the hierarchy of metric sensitivity, with PFC and CPU metrics appearing at higher levels, indicating their higher importance in fault detection.

4.2.4. Online Faulty Machine Detection (4.4)

This is the run-time component of Minder that continuously monitors training tasks.

  1. Iterative Metric Check: Minder follows the prioritized list generated by the decision tree. It selects the highest-priority metric first.

  2. Denoising and Embedding: The monitoring data for the selected metric from each machine is fed into its corresponding LSTM-VAE model. This model denoises the data and outputs a reconstructed embedding (a low-dimensional vector representing the denoised data pattern).

  3. Step 1: Similarity-based Distance Check per Time Window: Minder compares the embeddings from all machines for the current metric to identify outliers.

    • Embedding Reconstruction: For each machine, its 1×w1 \times w vector of monitoring data for the selected metric is passed through its LSTM-VAE model to obtain a reconstructed embedding.

    • Pairwise Euclidean Distances: The Euclidean distance is calculated between the embeddings of every pair of machines. Euclidean distance is a common metric for measuring the dissimilarity between two data points in a multi-dimensional space.

      The Euclidean distance between two embeddings, EA=(eA1,eA2,...,eAn)E_A = (e_{A1}, e_{A2}, ..., e_{An}) and EB=(eB1,eB2,...,eBn)E_B = (e_{B1}, e_{B2}, ..., e_{Bn}), is given by: d(E_A, E_B) = \sqrt{\sum_{k=1}^{n} (e_{Ak} - e_{Bk})^2}

      Where:

      • d(EA,EB)d(E_A, E_B) is the Euclidean distance between embedding EAE_A and EBE_B.
      • eAke_{Ak} and eBke_{Bk} are the kk-th components of embeddings EAE_A and EBE_B, respectively.
      • nn is the dimensionality of the embeddings.
    • Dissimilarity Score Calculation: For each machine, Minder sums its Euclidean distances to all other machines. This sum represents the machine's overall dissimilarity from the rest of the cluster. A machine with a fault will have a significantly higher sum of distances.

    • Normal Score Calculation: To normalize these dissimilarity sums, especially since magnitude can shift with the number of machines, Minder calculates a normal score for each machine's sum of distances. (The paper does not provide an explicit formula for normal score but states its purpose is normalization).

    • Candidate Identification: The machine with the maximum normal score is considered the most likely faulty candidate. If this maximum normal score exceeds a predefined similarity threshold, the machine is identified as a faulty machine candidate for that specific time window.

  4. Step 2: Continuity Check for Consecutive Time Windows: To avoid false alarms from temporary jitters, Minder applies a continuity check.

    • If a machine is identified as a faulty machine candidate in a time window, Minder then checks if the same machine has been consistently identified as a candidate in a series of consecutive time windows.
    • If the same machine continuously exceeds the similarity threshold for a number of consecutive times that surpasses a continuity threshold (e.g., 4 minutes), it is then confirmed as a truly faulty machine. This threshold is empirically chosen to filter out short-term noise while capturing actual persistent performance degradation (as shown in Figure 4).
  5. Iteration: If no faulty machine is detected using the current metric, Minder moves to the next metric in the prioritized list and repeats steps 3 and 4. This process continues until a faulty machine is identified or all prioritized metrics have been checked. If no anomaly is detected after checking all metrics, Minder concludes that no fault is currently present.

5. Experimental Setup

5.1. Datasets

Minder's evaluation dataset comprises 150 run-time fault instances collected from online distributed training tasks over a period of nine months. These tasks utilized 3D parallelism (Data Parallelism, Pipeline Parallelism, and Tensor Parallelism), indicating their large-scale and complex nature.

  • Scale and Scope: The tasks involved diverse scales, ranging from 4 to over 1500 machines, with up to 10,000 NVIDIA Ampere GPUs. This covers a wide spectrum of machine scales, including those depicted in Figure 1 of the paper. Approximately 30% of these tasks involved a minimum of 600 machines.

  • Fault Types: The dataset encompasses all fault types listed in Table 1 (see below for transcription). The dominant fault types observed were:

    • ECC error: 25.7%
    • CUDA execution error: 15%
    • GPU execution error: 10%
    • PCIe downgrading: 8.6%
  • Focus: The dataset primarily focuses on faults occurring in an individual machine, as these constitute 99% of all faulty cases in their production environment. The paper also includes a brief evaluation of concurrent faulty machines.

  • Monitoring Granularity: Monitoring metrics (listed in Appendix B, Table 2) were collected at a second-level granularity.

  • Data Split: Data from the first three months of the dataset was used for training the LSTM-VAE models, while the remaining six months of data were used for evaluation.

  • Ground Truth: Faulty machines in the dataset were manually confirmed through offline log-checking, nocl-tests, or hardware tests.

    The following are the results from Table 1 of the original paper:

    Fault type Frequency of each fault type Metrics
    Intra-host hardware faults (55.8%) ECC error 38.9% 80.0% 65.7% 8.6% 45.7% 11.4% 57.1%
    PCIe downgrading 6.6% 0.0% 8.3% 100% 33.3% 8.3% 0.0%
    NIC dropout 5.7% 100% 100% 0.0% 0.0% 100% 100%
    GPU card drop 2.0% 75.0% 70.0% 5.0% 50% 20.0% 55.0%
    NVLink error 1.7% 83.3% 50.0% 16.7% 50% 0.0% 66.7%
    Intra-host software faults (28.0%) CUDA execution error 0.9% 25.0% 25.0% 0.0% 25.0% 25.0% 25.0%
    GPU execution error 14.6% 61.9% 57.1% 19.0% 33.3% 14.3% 61.9%
    HDFS error 7.7% 50.0% 71.4% 14.3% 42.9% 21.4% 42.8%
    Inter-host network faults (6.0%) Machine unreachable 5.7% 57.1% 57.1% 0.0% 14.3% 0% 14.3%
    Others (10.3%) 6.0% 47.4% 63.2% 0.0% 53.6% 26.3% 15.8%

(Note: The column headers for 'Metrics' are missing in the provided table image, but based on the text, these would likely be CPU, GPU, PCIe, Memory, Disk, PFC/Network related metrics.)

5.2. Evaluation Metrics

For evaluating the performance of Minder, standard metrics from classification tasks are used: Precision, Recall, and F1-score. These metrics are defined based on True Positives (TP), False Negatives (FN), True Negatives (TN), and False Positives (FP).

  • True Positives (TP):

    • Conceptual Definition: Instances where Minder correctly identifies a machine as faulty when it is, in fact, faulty.
    • Symbol Explanation: TP represents the count of correctly detected faulty machines.
  • False Negatives (FN):

    • Conceptual Definition: Instances where Minder fails to detect a faulty machine that is, in fact, faulty, or misses a fault.
    • Symbol Explanation: FN represents the count of faulty machines that Minder failed to detect.
  • True Negatives (TN):

    • Conceptual Definition: Instances where Minder correctly approves a machine as normal when it is, in fact, running normally (i.e., no fault is detected, and no fault exists).
    • Symbol Explanation: TN represents the count of correctly identified normal machines.
  • False Positives (FP):

    • Conceptual Definition: Instances where Minder incorrectly identifies a machine as faulty when there is actually no fault (i.e., a false alarm).

    • Symbol Explanation: FP represents the count of machines incorrectly flagged as faulty.

      Based on these, the following composite metrics are calculated:

  • Precision:

    • Conceptual Definition: Measures the accuracy of positive predictions. It answers: "Of all machines Minder identified as faulty, how many were actually faulty?" High precision means fewer false alarms.
    • Mathematical Formula: Precision=TPTP+FPPrecision = \frac{TP}{TP + FP}
    • Symbol Explanation:
      • TP: True Positives (correctly identified faulty machines).
      • FP: False Positives (incorrectly identified faulty machines).
  • Recall:

    • Conceptual Definition: Measures Minder's ability to find all the actual faulty machines. It answers: "Of all the machines that were actually faulty, how many did Minder correctly identify?" High recall means fewer missed faults.
    • Mathematical Formula: Recall=TPTP+FNRecall = \frac{TP}{TP + FN}
    • Symbol Explanation:
      • TP: True Positives (correctly identified faulty machines).
      • FN: False Negatives (faulty machines that Minder missed).
  • F1-score:

    • Conceptual Definition: The harmonic mean of Precision and Recall. It provides a single score that balances both Precision and Recall, which is useful when there is an uneven class distribution (e.g., faults are rare). A high F1-score indicates that Minder has both low false alarms and low missed detections.
    • Mathematical Formula: F1-score=2PrecisionRecallPrecision+RecallF1\text{-}score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}
    • Symbol Explanation:
      • Precision: As defined above.
      • Recall: As defined above.

5.3. Baselines

The paper compares Minder against several baselines and variations in its evaluation:

  • Mahalanobis Distance (MD):

    • Description: MD is a widely used statistical method for identifying outliers. It measures the distance between a point and a distribution, taking into account the correlations between variables. It typically involves calculating features like mean, variance, skewness, and kurtosis, potentially applying Principal Component Analysis (PCA), and then computing pairwise distances.
    • Representativeness: MD is a strong baseline because it's a common and robust multivariate outlier detection technique, suitable for comparing a machine's multi-dimensional metric data against the cluster's distribution. The paper uses it to highlight the importance of Minder's denoising and pattern extraction capabilities.
  • Variations of Minder's Model and Design Choices (Ablation Studies): The paper also conducts ablation studies by comparing Minder against its own variations to validate specific design choices:

    • RAW (Raw Data): Calculates Euclidean Distances directly on the preprocessed raw monitoring data, without using VAE for denoising and reconstruction. This baseline highlights the importance of denoising.
    • CON (Concatenated Embeddings): Concatenates the individual embeddings produced by each metric's LSTM-VAE model into a single, larger embedding, and then calculates distances on this combined embedding. This tests the hypothesis that individual models are better than a simple concatenation.
    • INT (Integrated Model): Trains a single LSTM-VAE model using all monitoring metrics simultaneously as input, rather than training individual models per metric. This tests the hypothesis that individual models are better than a single integrated model due to potential mutual interference between metrics.
    • Minder without Continuity: Compares Minder's performance when the continuity check (Section 3.2, 4.4) is removed, and an alert is triggered immediately upon detecting a candidate in a single time window. This validates the role of the continuity principle in reducing false alarms.
    • Different Distance Measures: Compares Minder using Euclidean Distance with Manhattan Distance (MhtD) and Chebyshev Distance (ChD) to understand the impact of the distance metric on performance.

6. Results & Analysis

6.1. Overall Performance

6.1.1. Total Data Processing Time

Minder demonstrates significant efficiency in fault detection. On average, a call to Minder takes 3.6 seconds to issue an alert. This time includes both data pulling from Data APIs (fetching 15-minute monitoring data) and processing (preprocessing, LSTM-VAE inference, distance checks). This is a dramatic improvement over manual diagnosis. The paper states that this reduces the time by 99% (or 500x shorter) compared to the manual process, which often takes hours or even days (as indicated in Figure 2 of the paper, showing manual diagnosis times often exceeding 30 minutes and sometimes days).

The following figure (Figure 8 from the original paper) shows the total data processing time for a call of Minder:

fig 14 该图像是一个示意图,展示了任务索引与总时间、数据拉取时间及处理时间之间的关系。横轴为任务索引,纵轴为时间(秒)。图中蓝线表示总时间,黄色线表示数据拉取时间,棕色线表示处理时间,反映出随着任务索引的增加,各种时间值的变化趋势。

The figure illustrates that the Total Time (blue line) for Minder's operation is consistently low, with the Processing Time (brown line) being a small fraction of the total, indicating efficient algorithmic execution. The Data Pulling Time (yellow line) often dominates the overall latency, highlighting the overhead of data retrieval from the database.

6.1.2. Comparison with Baseline Algorithm (MD)

Minder is compared against Mahalanobis Distance (MD), a common outlier detection method.

The following are the results from Figure 9 of the original paper:

fig 15 该图像是一个表格,展示了Minder和MD在精确率、召回率和F1-score三项指标上的对比。Minder在精确率、召回率和F1-score上分别达到了0.904、0.883和0.893,明显优于MD的精确率0.788、召回率0.767和F1-score 0.777。

Metric Minder MD
Precision 0.904 0.788
Recall 0.883 0.767
F1-score 0.893 0.777

Analysis:

  • Minder significantly outperforms MD across all metrics: Precision (0.904 vs. 0.788), Recall (0.883 vs. 0.767), and F1-score (0.893 vs. 0.777).
  • The higher Precision for Minder (0.904) indicates that it generates significantly fewer false alarms compared to MD.
  • The higher Recall for Minder (0.883) shows it is better at detecting actual faults, missing fewer.
  • The F1-score of 0.893 for Minder demonstrates a strong balance between Precision and Recall, making it a robust detector.
  • Reason for Superiority: MD detects outliers statistically, which can be susceptible to jitters and noise in raw monitoring data. Minder's use of LSTM-VAE for denoising and extracting more representative data patterns before distance calculation allows it to achieve higher accuracy by focusing on persistent, meaningful deviations.

6.1.3. Performance Breakdown with Fault Types

The accuracy of Minder varies slightly depending on the specific fault type.

The following are the results from Figure 10 of the original paper:

fig 2 (Note: The labels on the x-axis for individual fault types are not clearly readable in the provided image. Based on the paper's text, these include ECC error, CUDA execution error, GPU card drop, machine unreachable, NVLink error, HDFS error, PCIe downgrading, GPU execution error, and AOC errors.)

Analysis:

  • Minder is highly effective in detecting faults related to CPU, GPU states, and networking performances, such as ECC error, CUDA execution error, GPU card drop, machine unreachable, NVLink error, and HDFS error. These faults often manifest clearly in the metrics that Minder monitors.
  • Lower Recall for Specific Faults:
    • GPU execution error and PCIe downgrading show lower Recall. This is attributed to instances where concurrent faulty GPUs or PCIe links within a machine lead to rapid fault propagation across multiple machines (due to 3D parallelism). At second-level granularity, these rapid, widespread impacts are harder for Minder to isolate to a single faulty machine.
    • AOC errors (Active Optical Cable errors) are partially missed due to the current absence of specific optical cable-related counters in the monitoring metrics. The paper notes that these errors are rare and often detected by other dedicated switch monitoring systems.
  • Overall Accuracy: Despite these specific challenges, Minder maintains a high overall accuracy, suggesting that the dominant fault types are well-covered.
  • Nature of False Positives: Many of Minder's "errors" (false alarms) were not entirely incorrect but detected short-term metric fluctuations or performance jitters that were not the root cause of a task halt, but still indicated temporary performance degradation. This suggests Minder is sensitive to even minor deviations.

6.1.4. Performance Breakdown by Fault Occurrences of a Task's Lifetime

The paper also analyzes Minder's performance based on how many faults a task experiences throughout its lifetime (which can span months).

The following are the results from Figure 11 of the original paper:

fig 3 该图像是一个柱状图,展示了不同范围的机器的精度(Precision)、召回率(Recall)和F-score的评分。各个评分的值从0到1,并且标识不同范围的机器,如[1,2]、(2,5)、(5,8)、(8,11)和(11,∞)的表现,整体评分也被呈现。

Task Lifetime Fault Occurrences Precision Recall F-score Overall
[1,2] 0.92 0.90 0.91
(2,5] 0.90 0.88 0.89
(5,8] 0.89 0.87 0.88
(8,11] 0.88 0.85 0.86
(11,∞) 0.91 0.89 0.90
Overall 0.904 0.883 0.893 0.893

Analysis:

  • The results show that Minder's accuracy (Precision, Recall, F-score) is largely independent of the frequency of fault occurrences throughout a task's lifetime.
  • Performance remains consistently high across tasks experiencing varying numbers of faults (e.g., [1,2] faults vs. (11,∞) faults).
  • Reason: This is because faults are treated as random, independent events, and in the production environment, a faulty machine is promptly replaced. Therefore, the history of fault occurrences for a task does not inherently affect Minder's ability to detect a new fault. The slight dip for the (8,11] group is attributed to the limited number of tasks in that specific category, which can introduce statistical variability.

6.2. Analysis of Monitoring Metric Selection

To validate the proper selection of monitoring metrics (as discussed in Section 4.3), an experiment was conducted by varying the number of metrics used for training and detection.

The following are the results from Figure 12 of the original paper:

fig 7 该图像是一个柱状图,展示了Minder与不同指标数量下的准确率(Precision)、召回率(Recall)和F1得分(F1-score)的比较。Minder在准确率方面达到0.904,显著高于其他两种方法,分别为0.866(少指标)和0.883(多指标)。整体来看,Minder在各个指标上表现优异。

Metric Selection Precision Recall F1-score
Minder (Optimal) 0.904 0.883 0.893
Fewer Metrics 0.866 0.841 0.853
More Metrics 0.883 0.905 0.894

Analysis:

  • Minder's Optimal Selection: The current optimal selection of metrics used by Minder achieves the best Precision (0.904) and a high F1-score (0.893), indicating a strong balance and minimal false alarms.
  • "Fewer Metrics": Using fewer metrics results in lower Precision, Recall, and F1-score. This suggests that excluding key metrics undermines the outlier detection capacity, as crucial fault indicators might be missing.
  • "More Metrics": Including more metrics (e.g., additional GPU metrics like GPU Temperature, GPU Clocks, GPU Memory Bandwidth Usage, GPU FP Engine Activity) leads to a higher Recall (0.905) but a lower Precision (0.883) compared to Minder's optimal selection.
    • Interpretation: While more metrics might catch more true faults (higher Recall), they also introduce more mutual interference. Different metrics can indicate different patterns, potentially obfuscating the detection process and leading to more false positives (lower Precision).
  • Conclusion: This ablation study validates Minder's strategy of carefully selecting and prioritizing metrics. The top metrics identified through prioritization are sufficient to cover potential malfunction points, avoiding the pitfalls of both too few (missing indicators) and too many (mutual interference) metrics.

6.3. Analysis of Model Selection

The paper compares Minder's choice of LSTM-VAE and its individual per-metric model approach against other statistical methods and VAE variations.

The following are the results from Figure 13 of the original paper:

fig 8 该图像是一个柱状图,展示了Minder与其他方法(RAW、CON、INT)在准确率(Precision)、召回率(Recall)和F1-score上的表现。可以看到,在所有指标上,Minder的表现均优于其他方法,特别是在精度和F1-score方面。

Model Selection Precision Recall F1-score
Minder (LSTM-VAE) 0.904 0.883 0.893
RAW 0.758 0.741 0.749
CON 0.801 0.789 0.795
INT 0.793 0.778 0.785

Analysis:

  • Minder's Superiority: Minder (using individual LSTM-VAE models) clearly outperforms all other variants in terms of Precision, Recall, and F1-score.
  • RAW (Raw Data): Shows significantly worse performance across all metrics (e.g., F1-score of 0.749). This highlights the crucial role of denoising and reconstruction provided by the VAE models. Raw data contains too much noise and jitters, leading to poor outlier detection.
  • CON (Concatenated Embeddings) & INT (Integrated Model): Both CON and INT perform worse than Minder (F1-scores of 0.795 and 0.785, respectively).
    • Reasoning: These approaches either concatenate embeddings from individual models (CON) or train a single VAE model with all metrics as input (INT). The degradation in performance for CON and INT supports Minder's design choice of individual models. When multiple metrics are treated with equal significance, or integrated into one model, they can introduce mutual interference. Different metrics have varying sensitivities to faults, and their time series may not fluctuate in the same manner. A monolithic model can get "misdirected or confused" by the array of metrics, leading to less accurate detection.
  • LSTM-VAE Effectiveness: The paper also notes that the Mean Squared Error (MSE) between the input and reconstructed data of the LSTM-VAE is less than 0.0001, demonstrating its high effectiveness in reconstructing (and thus denoising) the data.
  • Conclusion: This analysis strongly validates the choice of LSTM-VAE for denoising and the strategy of training individual models for each monitoring metric as critical components for Minder's superior performance.

6.4. Analysis of Continuity and Threshold

To verify the importance of the continuity principle (Section 3.2), Minder's performance is compared with and without its application.

The following are the results from Figure 14 of the original paper:

fig 9 该图像是图表,展示了Minder与不连续模式下Minder的精准度、召回率和F1-score的对比。Minder在精准度上为0.904,召回率为0.883,F1-score为0.893,而不连续模型的精准度为0.757,召回率为0.777,F1-score为0.767。这些数据表明Minder在故障检测任务中的显著优势。

Model Precision Recall F1-score
Minder (with Continuity) 0.904 0.883 0.893
Without Continuity 0.757 0.777 0.767

Analysis:

  • Impact of Continuity: Minder with continuity significantly outperforms Minder without continuity across all metrics (F1-score of 0.893 vs. 0.767).
  • Reduction in False Alarms: The most notable difference is in Precision (0.904 vs. 0.757). This indicates that the continuity check is highly effective in reducing false alarms that would otherwise be triggered by instant bursts, temporary counter noises, or short-term jitters in the monitoring data. Without continuity, the system would immediately alert on any detected deviation, leading to many spurious notifications.
  • Fault Persistence: The continuity principle is based on the observation that actual faults typically lead to a deteriorated performance that persists for a period (as shown in Figure 4 of the paper, where most abnormal patterns last over five minutes). By requiring a machine to show dissimilarity for a continuous duration, Minder effectively filters out transient, non-fault-related events.
  • Threshold Selection: The continuity threshold is set to four minutes. This value is chosen empirically based on the observed duration of real-world faults. A shorter threshold would reintroduce more false alarms, while a longer one might delay detection or miss some legitimate, but shorter-lived, faults.

6.5. Choice of Distance Measures

The paper investigates the impact of different distance measures on Minder's performance by comparing Euclidean Distance with Manhattan Distance (MhtD) and Chebyshev Distance (ChD).

The following are the results from Figure 15 of the original paper:

fig 15 该图像是一个表格,展示了Minder和MD在精确率、召回率和F1-score三项指标上的对比。Minder在精确率、召回率和F1-score上分别达到了0.904、0.883和0.893,明显优于MD的精确率0.788、召回率0.767和F1-score 0.777。

Distance Measure Precision Recall F1-score
Euclidean Distance 0.904 0.883 0.893
Manhattan Distance 0.901 0.878 0.889
Chebyshev Distance 0.871 0.855 0.863

Analysis:

  • Robustness to Distance Measure: Minder achieves similar high performance with both Euclidean Distance (0.893 F1-score) and Manhattan Distance (0.889 F1-score). This suggests that the embeddings produced by the LSTM-VAE are already highly representative and that faulty machines are distinctly clear outliers in this latent space, regardless of the specific distance metric used.
  • Euclidean vs. Manhattan: Both Euclidean and Manhattan distances perform very well. Manhattan Distance (sum of absolute differences along each dimension) implies that the spatial distribution of the embeddings provides a solid representation of machine state.
  • Chebyshev Distance: Chebyshev Distance (the largest difference along any single dimension) yields slightly worse Precision (0.871) and F1-score (0.863). This suggests that relying on a single maximum spatial difference might be insufficient for accurately comparing dissimilarity, as faults might manifest across multiple dimensions but not necessarily with an extreme value in just one. Minder's approach of summing distances (either Euclidean or Manhattan) considers the overall deviation from normal machines, intensifying the dissimilarity for true outliers.

6.6. Performance with Multiple Concurrent Faulty Machines

The ability of Minder to detect multiple concurrent faulty machines is explored, noting its dependence on the faulty machine scale ratio and monitoring data granularity.

  • Challenges at Scale: In production environments with 3D parallelism, multiple faulty machines can rapidly impact many DP and PP groups, causing fast negative propagation across the entire cluster.
    • Coarse Granularity Limitation: The standard second-level monitoring granularity becomes a limitation here. Rapid propagation (e.g., an iteration lasting tens of milliseconds) can be annihilated or overlooked, as the distinct pattern of individual faulty machines gets obscured by the widespread group effect before the next second-level sample.
  • Real-world Concurrent Faults: Concurrent faulty machine instances in their production environment are rare (less than 1% per month), primarily due to switch reboots or AOC errors. When a switch reboots (e.g., 32 out of 600 machines going offline), Minder struggles to distinguish individual faulty outliers due to the high fault ratio and rapid, widespread impact. In such cases, dedicated switch monitoring systems are more effective.
  • Millisecond-level Monitoring Experiment: To demonstrate Minder's potential, an experiment was conducted by injecting PCIe downgrading into two of four machines simultaneously, using customized millisecond-level monitoring.
    • The following figure (Figure 16 from the original paper) shows the millisecond-level NIC throughput for all machines after injection of PCIe downgrading on two NICs:

      fig 10 该图像是一个展示NIC(网络接口卡)吞吐量随时间变化的图表。图中显示了在正常NIC和多个异常NIC发生的情况下,吞吐量的变化情况,其中标出了进行一次Reduce-Scatter过程的时间点。

    • Results: With millisecond-level data from the NICs, Minder successfully detected the two NICs connected to the faulty PCIe links. These NICs showed the largest outlier distances during Reduce-Scatter operations.

    • Explanation: Normal NICs show high throughput at the start of a Reduce-Scatter step and then drop to zero while waiting for slower NICs. The NICs with degraded PCIe links, however, exhibit steady and low throughput. This distinct pattern is clearly visible at millisecond-level granularity.

  • Conclusion: This experiment suggests that with finer-grained monitoring data, Minder is capable of detecting multiple concurrent faults, as the rapid propagation and individual straggler behavior become observable. The current limitation for widespread multi-fault detection is primarily the standard second-level granularity.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Minder, an automatic system for detecting faulty machines in large-scale distributed model training tasks. Minder tackles the challenges of frequent faults, lengthy manual diagnosis, and the complexity of monitoring diverse fault types in dynamic environments. Its core design leverages machine-level similarity (comparing a machine's metrics to its peers), machine-level continuity (ensuring persistent abnormal patterns), individual learning-based denoising models (using LSTM-VAE for each monitoring metric), and metric prioritization to achieve efficient and accurate detection.

Deployed in a production environment for over a year, Minder monitors daily tasks involving thousands of machines. Evaluation results demonstrate its effectiveness, reacting to faults within 3.6 seconds on average, with a precision of 0.904 and an F1-score of 0.893. This significantly reduces manual debugging time by over 99%, validating its design choices and practical utility.

7.2. Limitations & Future Work

The authors highlight several limitations and propose future research directions:

  • Limitations in Detecting Certain Faults:

    • AOC errors and rapidly propagating concurrent faults (especially when the fault ratio is high, e.g., an entire switch reboot) are hard to capture effectively with current second-level monitoring granularity and available metrics.
    • While the labeled machine is the root cause, Minder-detected machines might have temporary performance fluctuations that are not the root cause but still impact training.
  • Future Work Directions:

    • Finer Data Granularity: Exploring the benefits of millisecond-level monitoring (as hinted by the concurrent fault experiment in Section 6.6) to better capture rapid fault propagation and individual straggler behaviors that occur within single training iterations (tens of milliseconds).
    • Expanded Metric Spectrum: Incorporating additional hardware counters (e.g., AOC counters) and in-band traces (e.g., Torch Profiler, Megatron-LM timer, CUDA event timer) to provide more fine-grained information on training and collective communication operations, leading to more comprehensive monitoring.
    • Robustness to New/Rare Fault Types: Minder is inherently designed to detect new or rare fault types as long as they manifest as discernible dissimilarities in monitoring data.
    • Detection of Concurrent Faults: Further enhancing Minder's ability to detect multiple concurrent faulty machines by potentially refining its similarity comparison or adapting to higher fault ratios.
    • Applicability to Other Workloads: Investigating Minder's effectiveness in other distributed ML workloads like large-scale inference and fine-tuning, assuming these workloads also exhibit inter-machine metric similarity and fault continuity.
    • Root Cause Analysis: Currently, Minder identifies faulty machines but does not pinpoint the exact root cause. Future work aims to design fine-grained run-time monitoring for automated root cause identification, reducing the need for extra manual labor for network jitter and straggler analysis.

7.3. Personal Insights & Critique

  • Inspirations:

    • Leveraging Domain-Specific Homogeneity: The paper's core insight into machine-level similarity in 3D parallel distributed training is highly inspiring. By understanding the inherent homogeneity of these specific workloads, Minder can effectively use an unsupervised approach, avoiding the challenges of labeled data and task-dependent thresholds. This highlights the power of combining deep domain knowledge with machine learning.
    • Practicality and Impact: The deployment in a production environment for over a year, with significant time savings, demonstrates strong practical value. It showcases how a well-designed automated system can alleviate a critical operational bottleneck in AI infrastructure.
    • Robustness through Multi-faceted Design: The combination of denoising (LSTM-VAE), continuity checks, and metric prioritization creates a robust system. Each component addresses a specific challenge (noise, false positives, efficiency), leading to a highly effective overall solution.
  • Potential Issues, Unverified Assumptions, or Areas for Improvement:

    • Assumption of Homogeneity: Minder's reliance on machine-level similarity assumes a relatively homogeneous training environment and balanced workloads. If future distributed training paradigms become highly heterogeneous (e.g., using machines with vastly different hardware, or highly imbalanced task distributions), this core assumption might be weakened, impacting Minder's effectiveness.
    • Root Cause Analysis Gap: While Minder excels at identifying which machine is faulty, the lack of automated root cause analysis means human engineers are still needed for the next diagnostic step. This could become a new bottleneck if the frequency of faults increases further or if the root causes become more obscure.
    • Empirical Thresholds: The continuity threshold of 4 minutes is empirically determined. While effective, an adaptive or dynamic thresholding mechanism, potentially learned from historical data or adjusted based on workload characteristics, might offer greater robustness and flexibility.
    • Granularity vs. Overhead Trade-off: The discussion on millisecond-level monitoring reveals a clear path for detecting faster, more complex multi-machine faults. However, the overhead of millisecond-level monitoring (data collection, storage, processing) for thousands of machines is significant and remains a practical challenge for widespread deployment.
    • "Normal Score" Definition: The paper mentions calculating a "normal score" for each machine's sum of distances but does not provide a specific formula. While the concept is clear, the lack of explicit mathematical definition might pose a slight challenge for full reproducibility without access to implementation details.
    • Generalizability to Diverse Faults: While Minder is designed to be general, its performance on truly novel or unforeseen fault types (beyond the observed 150 instances) remains to be fully tested. The LSTM-VAE will learn the "normal" manifold, but if a new fault creates a pattern that still lies close to this manifold in the latent space, it might be missed.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.