Minder: Faulty Machine Detection for Large-scale Distributed Model Training
TL;DR Summary
Minder is an automated faulty machine detection system for large-scale distributed model training, accurately identifying fault patterns with an average response time of 3.6 seconds, achieving 0.904 precision and 0.893 F1-score, significantly reducing manual diagnosis time and co
Abstract
Large-scale distributed model training requires simultaneous training on up to thousands of machines. Faulty machine detection is critical when an unexpected fault occurs in a machine. From our experience, a training task can encounter two faults per day on average, possibly leading to a halt for hours. To address the drawbacks of the time-consuming and labor-intensive manual scrutiny, we propose Minder, an automatic faulty machine detector for distributed training tasks. The key idea of Minder is to automatically and efficiently detect faulty distinctive monitoring metric patterns, which could last for a period before the entire training task comes to a halt. Minder has been deployed in our production environment for over one year, monitoring daily distributed training tasks where each involves up to thousands of machines. In our real-world fault detection scenarios, Minder can accurately and efficiently react to faults within 3.6 seconds on average, with a precision of 0.904 and F1-score of 0.893.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Minder: Faulty Machine Detection for Large-scale Distributed Model Training". It focuses on automatically identifying malfunctioning machines within large-scale distributed machine learning training environments.
1.2. Authors
The authors and their affiliations are:
-
Yangtao Deng (Tsinghua University)
-
Xiang Shi (ByteDance)
-
Zhuo Jiang (ByteDance)
-
Xingjian Zhang (Tsinghua University)
-
Lei Zhang (ByteDance)
-
Zhang Zhang (ByteDance)
-
Bo Li (ByteDance)
-
Zuquan Song (ByteDance)
-
Hang Zhu (ByteDance)
-
Gaohong Liu (ByteDance)
-
Fuliang Li (Northeastern University)
-
Shuguang Wang (ByteDance)
-
Haibin Lin (ByteDance)
-
Jianxi Ye (ByteDance)
-
Minlan Yu (Harvard University)
The paper is a collaboration primarily between Tsinghua University, ByteDance, Northeastern University, and Harvard University, indicating a mix of academic research and industry application, particularly from a large technology company involved in distributed systems.
1.3. Journal/Conference
The paper was published in the Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI).
NSDI is a highly reputable and influential conference in the field of computer systems, particularly for networked systems, operating systems, and distributed systems. Publication at NSDI signifies high-quality, impactful, and often practical research in systems design and implementation.
1.4. Publication Year
April 28-30, 2025.
1.5. Abstract
Large-scale distributed model training, especially for models like Large Language Models (LLMs), involves thousands of machines, making faulty machine detection critical. The authors' experience shows that a training task can encounter an average of two faults per day, potentially causing hours-long halts. To overcome the limitations of time-consuming and labor-intensive manual diagnosis, the paper proposes Minder, an automatic faulty machine detector. The core idea of Minder is to efficiently detect distinctive, faulty monitoring metric patterns that persist for a period before a complete task halt. Minder has been deployed in a production environment for over a year, monitoring daily distributed training tasks with up to thousands of machines. In real-world scenarios, Minder achieves an average fault reaction time of 3.6 seconds, with a precision of 0.904 and an F1-score of 0.893.
1.6. Original Source Link
https://www.usenix.org/conference/nsdi25/presentation/deng PDF Link: https://www.usenix.org/system/files/nsdi25-deng.pdf The paper is officially published and available through the USENIX conference proceedings.
2. Executive Summary
2.1. Background & Motivation
The rapid growth in dataset sizes and model parameters, particularly in Large Language Models (LLMs), has necessitated large-scale distributed model training involving thousands of machines and GPUs. While advancements in distributed training, collective communication, and fault tolerance have made this feasible, the immense scale and complexity also significantly increase the potential for faults.
The core problem Minder aims to solve is the inefficient and costly detection of faulty machines in large-scale distributed model training environments.
-
Importance of the problem:
- High Fault Frequency: In production environments, an accidental hardware or software fault occurs, on average, twice a day. These faults can impact thousands of machines and GPUs.
- Significant Downtime & Economic Loss: A single fault can force an entire training task to halt for hours or even days, leading to substantial economic losses (e.g., $1700 for 40 minutes on a 128-machine task). Frequent interruptions severely increase operating expenses and time costs for training large models.
- Drawbacks of Current Solutions: The prevailing method is manual diagnosis, which is:
- Time-consuming: Engineers are alerted only after a task has entirely stopped, missing performance degradations. Diagnosis often takes over half an hour, sometimes days (as shown in Figure 2 of the paper).
- Labor-intensive: Multiple teams (training, networking, storage, hardware) are involved.
- Incomplete/Redundant Scrutiny: Manual log checks are often incomplete or contain redundant information, making fault identification difficult.
- Lack of Timely Notification: Faults that cause slowdowns but not complete halts go undetected until the task eventually stops.
-
Paper's Entry Point/Innovative Idea:
Minderaddresses these challenges by recognizing that a machine experiencing a fault will exhibit distinctive, abnormal patterns in its monitoring metrics that differ from healthy machines and persist over time. Instead of relying on manual inspection or generic anomaly detection,Minderautomatically and efficiently identifies these patterns by leveraging machine-levelsimilarity,continuityof abnormal behavior,individual learning-based denoising modelsfor each metric, andmetric prioritization.
2.2. Main Contributions / Findings
The primary contributions of this paper are:
-
Empirical Investigation of Faults: A detailed analysis of real-world fault types encountered in large-scale distributed training and their correlations with various monitoring metrics. This investigation highlights why certain metrics are more sensitive and the inherent challenges in detection (Section 2.3 and 2.4).
-
Novel Design Principles: The introduction of key design ideas for
Minder:machine-level similarity(comparing a machine's metrics to its peers),machine-level continuity(detecting persistent abnormal patterns),individual learning-based denoising models(usingLSTM-VAEfor each metric to handle noise and task-dependent baselines), andmetric prioritization(to expedite detection). -
System Implementation and Deployment: The design, implementation, and deployment of
Minderin a production environment for over a year, monitoring daily distributed training tasks involving up to thousands of machines. -
Thorough Evaluation and Ablation Studies: A comprehensive evaluation of
Minder's performance in real-world fault detection scenarios, demonstrating its fast reaction time, high accuracy (precision of 0.904, F1-score of 0.893), and the effectiveness of its design choices through ablation experiments. -
Practical Lessons and Future Directions: Sharing practical lessons learned from deploying
Minderand outlining future research directions to further enhance fault detection capabilities.The key findings demonstrate that
Mindercan accurately and efficiently react to faults within 3.6 seconds on average, significantly reducing the detection time compared to manual methods. It effectively addresses the challenges of diverse fault types, task-dependent normal states, non-one-to-one metric-fault correlations, and noise in monitoring data.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand Minder, a reader should be familiar with the following concepts:
-
Distributed Model Training:
- Large Language Models (LLMs): AI models with billions or trillions of parameters, requiring massive computational resources. Examples include GPT-4.
- 3D Parallelism (Data Parallelism, Model Parallelism, Pipeline Parallelism): Techniques to distribute the computational load of training large models across multiple devices (GPUs) and machines.
- Data Parallelism (DP): The same model is replicated on multiple devices, and each device processes a different subset of the training data. Gradients are then synchronized across devices (e.g., using
All-reduce). - Model Parallelism (MP) / Tensor Parallelism (TP): The model itself is too large to fit on a single device, so its layers or parts of a layer (tensors) are split across multiple devices. Each device computes a portion of the model.
- Pipeline Parallelism (PP): Different layers of the model are assigned to different devices, forming a pipeline. Data batches are processed in a pipelined fashion, with each device passing intermediate activations to the next.
- Data Parallelism (DP): The same model is replicated on multiple devices, and each device processes a different subset of the training data. Gradients are then synchronized across devices (e.g., using
- Homogeneous vs. Heterogeneous Environments:
Minderoperates in a largely homogeneous environment, where machines have similar hardware (GPUs, NICs) and configurations, which is crucial for itssimilarityassumption.
-
Machine Monitoring Metrics: Quantitative measurements of a machine's hardware and software components.
- CPU Usage: Percentage of time the Central Processing Unit is actively working.
- GPU Usage / Duty Cycle: Percentage of time the Graphics Processing Unit is actively processing tasks.
- Memory Usage: Amount of RAM currently in use.
- Disk Usage: Amount of storage space used on a disk.
- Network Metrics:
- Throughput (TCP/RDMA): The rate at which data is successfully transmitted over a network connection.
RDMA (Remote Direct Memory Access)allows direct memory access between computers without involving the CPU, crucial for high-performance computing. - PFC (Priority-based Flow Control) Packet Rates: A mechanism in Ethernet networks to prevent packet loss due to congestion by pausing traffic for specific priority levels. Surges in PFC packets often indicate network congestion.
- ECN (Explicit Congestion Notification) Packet Rate: A feature in TCP/IP networks that allows routers to mark packets to signal network congestion to endpoints, without dropping them.
- CNP (Congestion Notification Packet) Rate: Used in
RoCEv2 (RDMA over Converged Ethernet)to explicitly signal congestion back to the sender.
- Throughput (TCP/RDMA): The rate at which data is successfully transmitted over a network connection.
- Hardware-specific Metrics:
- PCIe (Peripheral Component Interconnect Express) Bandwidth/Usage: High-speed serial computer expansion bus used to connect hardware devices. Degradation indicates issues with data transfer to/from components like GPUs, NICs.
- NVLink Bandwidth: A high-speed interconnect developed by NVIDIA for GPU-to-GPU and GPU-to-CPU communication, offering much higher bandwidth than PCIe.
- ECC (Error-Correcting Code) Errors: Errors in memory that can be corrected by
ECCmemory, but frequent errors can indicate hardware instability.
-
Anomaly Detection:
- Unsupervised Learning: A type of machine learning where the algorithm learns patterns from unlabeled data. It's suitable for anomaly detection when "normal" data is abundant but "anomalous" data is rare and undefined.
- Time Series Data: A sequence of data points indexed in time order. Monitoring metrics are typically time series.
- Denoising: The process of removing noise from data to reveal underlying patterns.
- Variational Autoencoder (VAE): A type of generative neural network that learns a compressed, meaningful representation (latent space) of its input data. It can be used for denoising by reconstructing input data from its latent representation, effectively filtering out noise.
- LSTM (Long Short-Term Memory): A type of recurrent neural network (RNN) capable of learning long-term dependencies in sequential data, making it suitable for time series analysis.
- Z-score: A statistical measure that describes a value's relationship to the mean of a group of values, measured in terms of standard deviations from the mean. It helps identify outliers.
- Decision Tree: A flowchart-like structure where each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (decision). Used for classification and prioritization.
- Euclidean Distance: The straight-line distance between two points in Euclidean space. Used to measure similarity/dissimilarity between data points or embeddings.
3.2. Previous Works
The paper categorizes related work into three main areas:
-
Intra-host Diagnosis:
- Run-time diagnosis: Focuses on detecting anomalies without interrupting running tasks. This often involves leveraging existing hardware counters (e.g., Intel Performance Counter Monitor [8], NVIDIA System Management Interface [14]) or building dependency graphs from historical monitoring data (e.g.,
BRCA[63],MonitorRank[43],FChain[62]) using Deep Learning (DL) algorithms. Log-based approaches using Natural Language Processing (NLP) are also used (e.g., [19]) but are limited by log content. - Offline diagnosis: Requires specific tools to detect bottlenecks when the machine is not running tasks. Examples include
Liu et al.[55] andMartinasso et al.[58] for NVLink/PCIe link degradation, andCollie[44] forRDMAsubsystem anomalies. These are not suitable for real-time fault identification during training. Minder's approach falls under run-time intra-host diagnosis, but with an emphasis on detecting faulty machines rather than specific component bottlenecks, and importantly, it operates continuously during live training.
- Run-time diagnosis: Focuses on detecting anomalies without interrupting running tasks. This often involves leveraging existing hardware counters (e.g., Intel Performance Counter Monitor [8], NVIDIA System Management Interface [14]) or building dependency graphs from historical monitoring data (e.g.,
-
Inter-host Diagnosis:
- Focuses on network failures between machines.
NetBouncer[73] localizes failure devices/links usingIP-in-IP.SNAP[79] monitorsTCPstatistics for network diagnosis.Pingsmesh[33],Haecki[34], andFathom[66] monitor end-to-end latencies orRPCperformance. These primarily detect network path failures, not intra-host hardware faults affecting training speed. - Cloud system diagnosis (e.g.,
Conan[47],Sage[26],Traceark[81]) focuses on microservices and contextual data patterns in cloud environments. These often assume different dependencies and workload characteristics than the homogeneous and highly synchronized nature of distributed model training. Minderdifferentiates by focusing on machine-level faults, which can then manifest as inter-host network issues (e.g.,PFCsurges), but the root cause is often intra-host. The assumption of similar workloads across machines in distributed training makesMinder's similarity-based approach applicable, unlike general cloud systems.
- Focuses on network failures between machines.
-
Algorithms for Anomaly Detection and Diagnosis:
- Statistics-based methods: Use statistical measures like
Euclidean distance,Pearson Correlation[25],Kendall's tau[42],Spearman Correlation[80], orMahalanobis Distance (MD)[30, 46, 57] to quantify similarity and identify deviations. These often require manual threshold setting.MinderusesZ-scorefor prioritization andEuclidean distancefor similarity but builds on denoised data.- Mahalanobis Distance (MD): A measure of the distance between a point and a distribution, accounting for correlations between variables.
Where:
- is the data point (vector of observations).
- is the mean vector of the observations.
- is the inverse of the covariance matrix of the observations.
- denotes the transpose.
- Mahalanobis Distance (MD): A measure of the distance between a point and a distribution, accounting for correlations between variables.
Where:
- Supervised algorithms: Widely used for anomaly detection when labeled data is available (e.g.,
EGADS[45] for time series,random forest[64] for incident routing [27]).Minderexplicitly states that supervised learning is inappropriate because "normal" states are task-dependent, and labeling is impractical (Challenge 2). The goal is to identify which machine is faulty, not just classify normal/abnormal.
- Unsupervised learning: Identifies outliers from monitoring data. This includes clustering (e.g., [50, 51, 56]) and methods using
VAE(e.g., [52, 70, 71, 75, 78]).Minderleverages unsupervised learning, specificallyLSTM-VAE, for denoising and representation learning, which aligns with this category but applies it in a novel way for distributed training fault detection.
- Statistics-based methods: Use statistical measures like
3.3. Technological Evolution
The evolution of distributed systems, particularly for AI training, has driven the need for advanced fault detection. Early distributed systems relied on simpler health checks and manual debugging. As models grew into the billions and trillions of parameters (e.g., GPT-4, PaLM), training required thousands of GPUs and complex parallelism strategies (DP, TP, PP). This increased scale introduced:
-
Higher fault frequency: More components mean more points of failure.
-
Cascading failures: A single fault can quickly propagate, halting the entire task.
-
Increased cost of downtime: Idle high-end GPUs are extremely expensive.
-
Complex interdependencies: Pinpointing root causes becomes harder with complex software stacks and hardware interactions.
This evolution renders traditional manual diagnosis, which was already slow and expensive, completely unsustainable. Simple threshold-based anomaly detection often fails due to the task-dependent nature of "normal" behavior and the presence of noise. Supervised learning is hampered by the lack of labeled fault data and the dynamic nature of "abnormal" patterns.
Minder fits into this technological timeline by proposing an automated, real-time, and robust solution tailored to the specific challenges of modern large-scale distributed training. It moves beyond simple statistical methods and generic anomaly detection by incorporating domain-specific insights like machine-level similarity and continuity within the framework of unsupervised learning (LSTM-VAE) and strategic metric prioritization.
3.4. Differentiation Analysis
Minder's core differences and innovations compared to existing methods:
- Automatic vs. Manual: Unlike the labor-intensive manual diagnosis,
Minderfully automates the detection process, reducing reaction time from hours/days to seconds. - Real-time vs. Post-mortem:
Minderdetects faults at runtime, often before the entire task halts, addressing the issue of delayed notifications in manual methods. - Unsupervised & Task-Agnostic vs. Supervised/Threshold-based:
- Traditional supervised methods struggle with the task-dependent nature of "normal" metric values and the scarcity of labeled fault data.
- Simple threshold-based methods are brittle to varying workloads.
Minderuses unsupervised learning (LSTM-VAE) to learn normal patterns for a given task without requiring pre-labeled fault data. Itssimilarity-based approach compares machines within the same task, making it adaptive to diverse workloads.
- Handles Noise & Jitters:
Minderexplicitly incorporatesLSTM-VAEfor denoising andcontinuitychecks to filter out transient noises, a common pitfall for statistical methods likeMahalanobis Distance(as shown in its lower accuracy). - Multi-metric & Prioritized Approach vs. Monolithic/Single-metric:
- Acknowledging that no single metric indicates all fault types and that correlations are complex,
Minderusesindividual models for each metric. This avoids the "mutual interference" that a single, integrated model (likeCONorINTin evaluation) might face. Metric prioritizationfurther optimizes detection speed by checking the most sensitive metrics first, a strategic application of domain knowledge.
- Acknowledging that no single metric indicates all fault types and that correlations are complex,
- Focus on Machine-level Faults in Homogeneous Distributed Training:
Minderleverages the inherentsimilarityof metric patterns across machines in 3D parallel distributed training. This strong assumption allows it to effectively detect outliers, a characteristic not present in heterogeneous cloud microservice environments or general network diagnosis tools.
4. Methodology
The Minder framework is designed as an automatic, responsive, and accurate system to detect faulty machines causing unexpected halts in distributed training tasks. It addresses challenges like diverse fault types, task-dependent normal states, complex metric-fault correlations, and noisy monitoring data.
The following figure (Figure 5 from the original paper) shows the system architecture of Minder:
该图像是一个关于监测数据预处理和在线故障机器检测的流程图,展示了从监测数据预处理到故障检测的各个环节,包括每个指标模型训练、监测指标优先级排序,以及基于相似性的距离检查和连续性检查等步骤。
4.1. Principles
Minder's design revolves around four key principles to overcome the identified challenges:
-
Machine-level Similarity (3.1): Distributed training, especially with 3D parallelism (Data Parallelism, Pipeline Parallelism, Tensor Parallelism), typically ensures a balanced load across machines. This leads to similar fluctuations in monitoring metrics among healthy machines. A faulty machine, however, will exhibit
distinctive differencesin its metric patterns compared to its peers.Minderleverages this by detecting outliers in metric data, regardless of the specific fault type, by measuring "distance" from other machines. This addresses Challenge 1 (various ways a machine can fail) and Challenge 2 (task-dependent normal states) by focusing on relative abnormality within a task rather than absolute thresholds. -
Machine-level Continuity (3.2): Transient jitters or short-term noise can lead to false alarms.
Minderincorporates the idea that actual faults typically causeabnormal performance that persists for a period(e.g., minutes). By requiring a detected abnormality to becontinuousover several time windows,Minderfilters out bursty noises and temporary fluctuations, enhancing robustness and addressing Challenge 2 (task-dependent normal states by differentiating true faults from transient blips). -
Individual Learning-Based Denoising Models for Each Monitoring Metric (3.3): Monitoring data inherently contains noise (e.g., jitters, sensor inaccuracies).
MinderusesVariational Autoencoders (VAEs)withLSTMlayers todenoise and reconstructraw time series data. Importantly, it trainsindividual models for each monitoring metric. This addresses Challenge 3 (non-one-to-one correlation between fault types and metrics) and Challenge 4 (noisy data). Training separate models avoids mutual interference that could arise if a single model tried to process all metrics simultaneously, especially since different metrics have varying sensitivities to different fault types. -
Prioritized Metric Sequence (3.4): To expedite detection,
Minderdoesn't check all available metrics at once. Instead, it generates aprioritized list of metricsbased on their sensitivity to faults. It then uses the models corresponding to thetop-prioritized metricsfirst. This ensures that the most indicative metrics are checked earliest, leading to faster fault identification.
4.2. Core Methodology In-depth (Layer by Layer)
The Minder framework consists of four main components: Preprocessing, Per-metric Model Training, Monitoring Metric Prioritization, and Online Faulty Machine Detection.
4.2.1. Monitoring Data Preprocessing (4.1)
Minder first processes the raw monitoring data collected from each machine to prepare it for analysis.
-
Aggregation into Time Windows: The continuous stream of monitoring data is grouped into discrete time windows.
-
Data Alignment: Sampling points across all machines are aligned based on their timestamps. If data samples are missing at certain points,
Minderfills these gaps by using data from the nearest available sampling time. -
Normalization: To ensure that multi-dimensional monitoring data, which often has different scales and ranges, can be effectively compared and integrated,
Min-Max Normalizationis applied. This technique scales the data to a predefined range, typically [0, 1].The formula for
Min-Max Normalizationis:Where:
-
is the normalized value of the data point.
-
is the original value of the data point.
-
is the minimum value of the feature (metric) in the dataset.
-
is the maximum value of the feature (metric) in the dataset.
This ensures that all metrics contribute proportionally to similarity calculations, preventing metrics with larger value ranges from dominating the comparison.
-
4.2.2. Per-metric Model Training (4.2)
As established in Section 3.3, Minder trains individual learning models for each distinct monitoring metric. This is crucial because no single metric guarantees an indication for all fault types, and different metrics have varying sensitivities.
-
Input Data for Training: For each metric, the preprocessed per-machine data within a time window (e.g., a vector of length , representing consecutive samples) is used as an input instance. For example, to train a model for
CPU Usage, a sequence ofCPU Usagesamples of length from each machine is fed into its dedicated model. -
Model Choice: LSTM-VAE:
Minderspecifically usesVariational Autoencoders (VAEs)enhanced withLong Short-Term Memory (LSTM)layers.-
Variational Autoencoder (VAE): A
VAEis a generative model that learns a compressed, lower-dimensional representation (called thelatent spaceorembedding) of input data. It consists of anencoderand adecoder.- The
encodermaps the input data to parameters (mean and standard deviation ) of a probability distribution in the latent space, from which a latent variable is sampled. - The
decoderthen reconstructs the original input data from this sampled latent variable . - The
VAEis trained to minimize the reconstruction error between and , while also ensuring that the latent space distribution is close to a prior distribution (e.g., a standard normal distribution), which encourages continuous and meaningful latent representations. - For anomaly detection,
VAEsare effective because they learn to reconstruct "normal" data well. Anomalous data, being outside the learned distribution of normal behavior, will be reconstructed poorly or mapped to distinct regions in the latent space. This process inherently performsdenoisingas the model learns to capture the essential patterns of the data, filtering out transient noise.
- The
-
LSTM Integration: Given that monitoring data is
temporal time series,LSTMnetworks are used within both theencoderanddecoderof theVAE.LSTMs are well-suited for processing sequences because they can capture long-term dependencies and context in the data, which traditional neural networks or simplerRNNsmight miss. This allows theLSTM-VAEto effectively extract temporal characteristics and patterns from the time series data.The following figure (Figure 6 from the original paper) depicts the
LSTM-VAEstructure forMinder:
该图像是示意图,展示了Minder系统中的编码器和解码器结构。左侧为编码器部分,接收输入的监测数据 ,通过多个LSTM单元处理,并生成潜在变量 。右侧为解码器部分,接收潜在变量 ,输出重构数据 。该架构旨在通过循环神经网络有效检测机器故障。
The
Encodertakes the input monitoring data (e.g., ) and processes it throughLSTMlayers to generate the parameters for the latent variable . TheDecoderthen takes this latent variable and usesLSTMlayers to reconstruct the data (e.g.,p(x'_1|z), p(x'_2|z), ..., p(x'_n|z)).Model parameters, such as
hidden_size(e.g., 4),latent_size(e.g., 8), andlstm_layer(e.g., 1), are configured for each model. The output of this training is a set ofindividual LSTM-VAE models, one for each monitoring metric, capable of denoising and reconstructing time series input into a representative vector (embedding). -
4.2.3. Monitoring Metric Prioritization (4.3)
To accelerate the detection process, Minder prioritizes metrics based on their sensitivity to faults. This step runs in parallel with the model training.
-
Step 1: Z-score Calculation for Evaluating Metric Sensitivity: To identify which metrics are most indicative of a fault,
Minderuses theZ-scorestatistic. TheZ-scorequantifies how many standard deviations an element is from the mean. A higher absoluteZ-scoresuggests that a data point is an outlier and deviates significantly from the typical distribution, making it useful for identifying unusual behavior on a faulty machine.The formula for the
Z-scoreof the -th machine for the -th monitoring metric is:Where:
-
is the
Z-scoreof the -th machine for the -th monitoring metric. -
is the sample value of the -th machine for the -th metric at a specific sampling point.
-
is the average value of all machines for the -th metric at that sampling point.
-
is the standard deviation of all machines for the -th metric at that sampling point.
When a fault occurs, the affected machine's metric value will likely deviate significantly from the mean of all machines, resulting in a high absolute
Z-score. For a given time window of a training task,Minderconsiders across all machines for the -th metric to represent the overall dispersion.
-
-
Step 2: Prioritization of Monitoring Metrics using Decision Tree: Based on the calculated maximum
Z-scoresfor each metric,Minderconstructs adecision treeto determine the sensitivity and prioritize metrics.-
Reasoning for Decision Tree:
- Resemblance to Rule-based Policies:
Decision treesmimic logical, rule-based structures often found in networking monitoring systems (e.g., "ifCPU Usagedrops to zero, alert"). - High Expressiveness and Faithfulness:
Decision treesare non-parametric and can represent complex decision-making processes effectively.
- Resemblance to Rule-based Policies:
-
Training Process:
Mindercollects the maximumZ-scorefor each metric (from Step 1) for various time windows and training tasks.- Each collection of maximum
Z-scores(one for each metric) forms an instance. - These instances are manually labeled as "normal" or "abnormal" based on whether a faulty machine was present in that time window.
- A
decision treeis then trained using these labeled instances.
-
Prioritization Outcome: The
decision treeprioritizes metrics. Metrics closer to the root of the tree are considered more sensitive and informative for identifying faulty machines. As shown in Figure 7 of the paper, metrics likePFC,CPU,GPU, andNVLink-related metrics are identified as highly informative. This aligns with empirical observations thatCPUandGPUstates, and network quality (intra-hostNVLink, inter-hostPFC), are critical indicators.The following figure (Figure 7 from the original paper) shows the top 7 layers of the decision tree for prioritization:
该图像是一个示意图,展示了故障检测系统的工作流程。图中所示的各个节点通过 Z-score 来评估不同监控指标的健康状态,包括 PFC、CPU 和 GPU 各项指标的使用情况,并标记为正常或异常。
This decision tree visually represents the hierarchy of metric sensitivity, with
PFCandCPUmetrics appearing at higher levels, indicating their higher importance in fault detection. -
4.2.4. Online Faulty Machine Detection (4.4)
This is the run-time component of Minder that continuously monitors training tasks.
-
Iterative Metric Check:
Minderfollows the prioritized list generated by thedecision tree. It selects the highest-priority metric first. -
Denoising and Embedding: The monitoring data for the selected metric from each machine is fed into its corresponding
LSTM-VAEmodel. This model denoises the data and outputs a reconstructed embedding (a low-dimensional vector representing the denoised data pattern). -
Step 1: Similarity-based Distance Check per Time Window:
Mindercompares the embeddings from all machines for the current metric to identify outliers.-
Embedding Reconstruction: For each machine, its vector of monitoring data for the selected metric is passed through its
LSTM-VAEmodel to obtain a reconstructed embedding. -
Pairwise Euclidean Distances: The
Euclidean distanceis calculated between the embeddings of every pair of machines.Euclidean distanceis a common metric for measuring the dissimilarity between two data points in a multi-dimensional space.The
Euclidean distancebetween two embeddings, and , is given by:d(E_A, E_B) = \sqrt{\sum_{k=1}^{n} (e_{Ak} - e_{Bk})^2}Where:
- is the Euclidean distance between embedding and .
- and are the -th components of embeddings and , respectively.
- is the dimensionality of the embeddings.
-
Dissimilarity Score Calculation: For each machine,
Mindersums itsEuclidean distancesto all other machines. This sum represents the machine's overalldissimilarityfrom the rest of the cluster. A machine with a fault will have a significantly higher sum of distances. -
Normal Score Calculation: To normalize these dissimilarity sums, especially since magnitude can shift with the number of machines,
Mindercalculates anormal scorefor each machine's sum of distances. (The paper does not provide an explicit formula fornormal scorebut states its purpose is normalization). -
Candidate Identification: The machine with the maximum
normal scoreis considered the most likely faulty candidate. If this maximumnormal scoreexceeds a predefinedsimilarity threshold, the machine is identified as afaulty machine candidatefor that specific time window.
-
-
Step 2: Continuity Check for Consecutive Time Windows: To avoid false alarms from temporary jitters,
Minderapplies acontinuitycheck.- If a machine is identified as a
faulty machine candidatein a time window,Minderthen checks if thesame machinehas been consistently identified as a candidate in aseries of consecutive time windows. - If the same machine continuously exceeds the
similarity thresholdfor a number of consecutive times that surpasses acontinuity threshold(e.g., 4 minutes), it is then confirmed as atruly faulty machine. This threshold is empirically chosen to filter out short-term noise while capturing actual persistent performance degradation (as shown in Figure 4).
- If a machine is identified as a
-
Iteration: If no faulty machine is detected using the current metric,
Mindermoves to the next metric in the prioritized list and repeats steps 3 and 4. This process continues until a faulty machine is identified or all prioritized metrics have been checked. If no anomaly is detected after checking all metrics,Minderconcludes that no fault is currently present.
5. Experimental Setup
5.1. Datasets
Minder's evaluation dataset comprises 150 run-time fault instances collected from online distributed training tasks over a period of nine months. These tasks utilized 3D parallelism (Data Parallelism, Pipeline Parallelism, and Tensor Parallelism), indicating their large-scale and complex nature.
-
Scale and Scope: The tasks involved diverse scales, ranging from 4 to over 1500 machines, with up to 10,000 NVIDIA Ampere GPUs. This covers a wide spectrum of machine scales, including those depicted in Figure 1 of the paper. Approximately 30% of these tasks involved a minimum of 600 machines.
-
Fault Types: The dataset encompasses all fault types listed in Table 1 (see below for transcription). The dominant fault types observed were:
ECC error: 25.7%CUDA execution error: 15%GPU execution error: 10%PCIe downgrading: 8.6%
-
Focus: The dataset primarily focuses on faults occurring in an individual machine, as these constitute 99% of all faulty cases in their production environment. The paper also includes a brief evaluation of concurrent faulty machines.
-
Monitoring Granularity: Monitoring metrics (listed in Appendix B, Table 2) were collected at a
second-level granularity. -
Data Split: Data from the first three months of the dataset was used for training the
LSTM-VAEmodels, while the remaining six months of data were used for evaluation. -
Ground Truth: Faulty machines in the dataset were manually confirmed through offline log-checking,
nocl-tests, or hardware tests.The following are the results from Table 1 of the original paper:
Fault type Frequency of each fault type Metrics Intra-host hardware faults (55.8%) ECC error 38.9% 80.0% 65.7% 8.6% 45.7% 11.4% 57.1% PCIe downgrading 6.6% 0.0% 8.3% 100% 33.3% 8.3% 0.0% NIC dropout 5.7% 100% 100% 0.0% 0.0% 100% 100% GPU card drop 2.0% 75.0% 70.0% 5.0% 50% 20.0% 55.0% NVLink error 1.7% 83.3% 50.0% 16.7% 50% 0.0% 66.7% Intra-host software faults (28.0%) CUDA execution error 0.9% 25.0% 25.0% 0.0% 25.0% 25.0% 25.0% GPU execution error 14.6% 61.9% 57.1% 19.0% 33.3% 14.3% 61.9% HDFS error 7.7% 50.0% 71.4% 14.3% 42.9% 21.4% 42.8% Inter-host network faults (6.0%) Machine unreachable 5.7% 57.1% 57.1% 0.0% 14.3% 0% 14.3% Others (10.3%) 6.0% 47.4% 63.2% 0.0% 53.6% 26.3% 15.8%
(Note: The column headers for 'Metrics' are missing in the provided table image, but based on the text, these would likely be CPU, GPU, PCIe, Memory, Disk, PFC/Network related metrics.)
5.2. Evaluation Metrics
For evaluating the performance of Minder, standard metrics from classification tasks are used: Precision, Recall, and F1-score. These metrics are defined based on True Positives (TP), False Negatives (FN), True Negatives (TN), and False Positives (FP).
-
True Positives (TP):
- Conceptual Definition: Instances where
Mindercorrectly identifies a machine as faulty when it is, in fact, faulty. - Symbol Explanation:
TPrepresents the count of correctly detected faulty machines.
- Conceptual Definition: Instances where
-
False Negatives (FN):
- Conceptual Definition: Instances where
Minderfails to detect a faulty machine that is, in fact, faulty, or misses a fault. - Symbol Explanation:
FNrepresents the count of faulty machines thatMinderfailed to detect.
- Conceptual Definition: Instances where
-
True Negatives (TN):
- Conceptual Definition: Instances where
Mindercorrectly approves a machine as normal when it is, in fact, running normally (i.e., no fault is detected, and no fault exists). - Symbol Explanation:
TNrepresents the count of correctly identified normal machines.
- Conceptual Definition: Instances where
-
False Positives (FP):
-
Conceptual Definition: Instances where
Minderincorrectly identifies a machine as faulty when there is actually no fault (i.e., a false alarm). -
Symbol Explanation:
FPrepresents the count of machines incorrectly flagged as faulty.Based on these, the following composite metrics are calculated:
-
-
Precision:
- Conceptual Definition: Measures the accuracy of positive predictions. It answers: "Of all machines
Minderidentified as faulty, how many were actually faulty?" High precision means fewer false alarms. - Mathematical Formula:
- Symbol Explanation:
TP: True Positives (correctly identified faulty machines).FP: False Positives (incorrectly identified faulty machines).
- Conceptual Definition: Measures the accuracy of positive predictions. It answers: "Of all machines
-
Recall:
- Conceptual Definition: Measures
Minder's ability to find all the actual faulty machines. It answers: "Of all the machines that were actually faulty, how many didMindercorrectly identify?" High recall means fewer missed faults. - Mathematical Formula:
- Symbol Explanation:
TP: True Positives (correctly identified faulty machines).FN: False Negatives (faulty machines thatMindermissed).
- Conceptual Definition: Measures
-
F1-score:
- Conceptual Definition: The harmonic mean of
PrecisionandRecall. It provides a single score that balances bothPrecisionandRecall, which is useful when there is an uneven class distribution (e.g., faults are rare). A highF1-scoreindicates thatMinderhas both low false alarms and low missed detections. - Mathematical Formula:
- Symbol Explanation:
Precision: As defined above.Recall: As defined above.
- Conceptual Definition: The harmonic mean of
5.3. Baselines
The paper compares Minder against several baselines and variations in its evaluation:
-
Mahalanobis Distance (MD):
- Description:
MDis a widely used statistical method for identifying outliers. It measures the distance between a point and a distribution, taking into account the correlations between variables. It typically involves calculating features like mean, variance, skewness, and kurtosis, potentially applyingPrincipal Component Analysis (PCA), and then computing pairwise distances. - Representativeness:
MDis a strong baseline because it's a common and robust multivariate outlier detection technique, suitable for comparing a machine's multi-dimensional metric data against the cluster's distribution. The paper uses it to highlight the importance ofMinder's denoising and pattern extraction capabilities.
- Description:
-
Variations of
Minder's Model and Design Choices (Ablation Studies): The paper also conducts ablation studies by comparingMinderagainst its own variations to validate specific design choices:- RAW (Raw Data): Calculates
Euclidean Distancesdirectly on the preprocessed raw monitoring data, without usingVAEfor denoising and reconstruction. This baseline highlights the importance of denoising. - CON (Concatenated Embeddings): Concatenates the individual embeddings produced by each metric's
LSTM-VAEmodel into a single, larger embedding, and then calculates distances on this combined embedding. This tests the hypothesis that individual models are better than a simple concatenation. - INT (Integrated Model): Trains a single
LSTM-VAEmodel using all monitoring metrics simultaneously as input, rather than training individual models per metric. This tests the hypothesis that individual models are better than a single integrated model due to potential mutual interference between metrics. - Minder without Continuity: Compares
Minder's performance when thecontinuity check(Section 3.2, 4.4) is removed, and an alert is triggered immediately upon detecting a candidate in a single time window. This validates the role of thecontinuityprinciple in reducing false alarms. - Different Distance Measures: Compares
MinderusingEuclidean DistancewithManhattan Distance (MhtD)andChebyshev Distance (ChD)to understand the impact of the distance metric on performance.
- RAW (Raw Data): Calculates
6. Results & Analysis
6.1. Overall Performance
6.1.1. Total Data Processing Time
Minder demonstrates significant efficiency in fault detection. On average, a call to Minder takes 3.6 seconds to issue an alert. This time includes both data pulling from Data APIs (fetching 15-minute monitoring data) and processing (preprocessing, LSTM-VAE inference, distance checks). This is a dramatic improvement over manual diagnosis. The paper states that this reduces the time by 99% (or 500x shorter) compared to the manual process, which often takes hours or even days (as indicated in Figure 2 of the paper, showing manual diagnosis times often exceeding 30 minutes and sometimes days).
The following figure (Figure 8 from the original paper) shows the total data processing time for a call of Minder:
该图像是一个示意图,展示了任务索引与总时间、数据拉取时间及处理时间之间的关系。横轴为任务索引,纵轴为时间(秒)。图中蓝线表示总时间,黄色线表示数据拉取时间,棕色线表示处理时间,反映出随着任务索引的增加,各种时间值的变化趋势。
The figure illustrates that the Total Time (blue line) for Minder's operation is consistently low, with the Processing Time (brown line) being a small fraction of the total, indicating efficient algorithmic execution. The Data Pulling Time (yellow line) often dominates the overall latency, highlighting the overhead of data retrieval from the database.
6.1.2. Comparison with Baseline Algorithm (MD)
Minder is compared against Mahalanobis Distance (MD), a common outlier detection method.
The following are the results from Figure 9 of the original paper:
该图像是一个表格,展示了Minder和MD在精确率、召回率和F1-score三项指标上的对比。Minder在精确率、召回率和F1-score上分别达到了0.904、0.883和0.893,明显优于MD的精确率0.788、召回率0.767和F1-score 0.777。
| Metric | Minder | MD |
|---|---|---|
| Precision | 0.904 | 0.788 |
| Recall | 0.883 | 0.767 |
| F1-score | 0.893 | 0.777 |
Analysis:
Mindersignificantly outperformsMDacross all metrics:Precision(0.904 vs. 0.788),Recall(0.883 vs. 0.767), andF1-score(0.893 vs. 0.777).- The higher
PrecisionforMinder(0.904) indicates that it generates significantly fewer false alarms compared toMD. - The higher
RecallforMinder(0.883) shows it is better at detecting actual faults, missing fewer. - The
F1-scoreof 0.893 forMinderdemonstrates a strong balance betweenPrecisionandRecall, making it a robust detector. - Reason for Superiority:
MDdetects outliers statistically, which can be susceptible to jitters and noise in raw monitoring data.Minder's use ofLSTM-VAEfor denoising and extracting more representative data patterns before distance calculation allows it to achieve higher accuracy by focusing on persistent, meaningful deviations.
6.1.3. Performance Breakdown with Fault Types
The accuracy of Minder varies slightly depending on the specific fault type.
The following are the results from Figure 10 of the original paper:
(Note: The labels on the x-axis for individual fault types are not clearly readable in the provided image. Based on the paper's text, these include ECC error, CUDA execution error, GPU card drop, machine unreachable, NVLink error, HDFS error, PCIe downgrading, GPU execution error, and AOC errors.)
Analysis:
Minderis highly effective in detecting faults related toCPU,GPUstates, and networking performances, such asECC error,CUDA execution error,GPU card drop,machine unreachable,NVLink error, andHDFS error. These faults often manifest clearly in the metrics thatMindermonitors.- Lower Recall for Specific Faults:
GPU execution errorandPCIe downgradingshow lowerRecall. This is attributed to instances where concurrent faultyGPUsorPCIelinks within a machine lead to rapid fault propagation across multiple machines (due to 3D parallelism). Atsecond-level granularity, these rapid, widespread impacts are harder forMinderto isolate to a single faulty machine.AOC errors(Active Optical Cable errors) are partially missed due to the current absence of specific optical cable-related counters in the monitoring metrics. The paper notes that these errors are rare and often detected by other dedicated switch monitoring systems.
- Overall Accuracy: Despite these specific challenges,
Mindermaintains a high overall accuracy, suggesting that the dominant fault types are well-covered. - Nature of False Positives: Many of
Minder's "errors" (false alarms) were not entirely incorrect but detected short-term metric fluctuations or performance jitters that were not the root cause of a task halt, but still indicated temporary performance degradation. This suggestsMinderis sensitive to even minor deviations.
6.1.4. Performance Breakdown by Fault Occurrences of a Task's Lifetime
The paper also analyzes Minder's performance based on how many faults a task experiences throughout its lifetime (which can span months).
The following are the results from Figure 11 of the original paper:
该图像是一个柱状图,展示了不同范围的机器的精度(Precision)、召回率(Recall)和F-score的评分。各个评分的值从0到1,并且标识不同范围的机器,如[1,2]、(2,5)、(5,8)、(8,11)和(11,∞)的表现,整体评分也被呈现。
| Task Lifetime Fault Occurrences | Precision | Recall | F-score | Overall |
|---|---|---|---|---|
| [1,2] | 0.92 | 0.90 | 0.91 | |
| (2,5] | 0.90 | 0.88 | 0.89 | |
| (5,8] | 0.89 | 0.87 | 0.88 | |
| (8,11] | 0.88 | 0.85 | 0.86 | |
| (11,∞) | 0.91 | 0.89 | 0.90 | |
| Overall | 0.904 | 0.883 | 0.893 | 0.893 |
Analysis:
- The results show that
Minder's accuracy (Precision, Recall, F-score) is largely independent of the frequency of fault occurrences throughout a task's lifetime. - Performance remains consistently high across tasks experiencing varying numbers of faults (e.g., [1,2] faults vs. (11,∞) faults).
- Reason: This is because faults are treated as random, independent events, and in the production environment, a faulty machine is promptly replaced. Therefore, the history of fault occurrences for a task does not inherently affect
Minder's ability to detect a new fault. The slight dip for the (8,11] group is attributed to the limited number of tasks in that specific category, which can introduce statistical variability.
6.2. Analysis of Monitoring Metric Selection
To validate the proper selection of monitoring metrics (as discussed in Section 4.3), an experiment was conducted by varying the number of metrics used for training and detection.
The following are the results from Figure 12 of the original paper:
该图像是一个柱状图,展示了Minder与不同指标数量下的准确率(Precision)、召回率(Recall)和F1得分(F1-score)的比较。Minder在准确率方面达到0.904,显著高于其他两种方法,分别为0.866(少指标)和0.883(多指标)。整体来看,Minder在各个指标上表现优异。
| Metric Selection | Precision | Recall | F1-score |
|---|---|---|---|
| Minder (Optimal) | 0.904 | 0.883 | 0.893 |
| Fewer Metrics | 0.866 | 0.841 | 0.853 |
| More Metrics | 0.883 | 0.905 | 0.894 |
Analysis:
- Minder's Optimal Selection: The current optimal selection of metrics used by
Minderachieves the bestPrecision(0.904) and a highF1-score(0.893), indicating a strong balance and minimal false alarms. - "Fewer Metrics": Using fewer metrics results in lower
Precision,Recall, andF1-score. This suggests that excluding key metrics undermines the outlier detection capacity, as crucial fault indicators might be missing. - "More Metrics": Including more metrics (e.g., additional
GPUmetrics likeGPU Temperature,GPU Clocks,GPU Memory Bandwidth Usage,GPU FP Engine Activity) leads to a higherRecall(0.905) but a lowerPrecision(0.883) compared toMinder's optimal selection.- Interpretation: While more metrics might catch more true faults (
higher Recall), they also introduce moremutual interference. Different metrics can indicate different patterns, potentially obfuscating the detection process and leading to more false positives (lower Precision).
- Interpretation: While more metrics might catch more true faults (
- Conclusion: This ablation study validates
Minder's strategy of carefully selecting and prioritizing metrics. Thetop metricsidentified through prioritization are sufficient to cover potential malfunction points, avoiding the pitfalls of both too few (missing indicators) and too many (mutual interference) metrics.
6.3. Analysis of Model Selection
The paper compares Minder's choice of LSTM-VAE and its individual per-metric model approach against other statistical methods and VAE variations.
The following are the results from Figure 13 of the original paper:
该图像是一个柱状图,展示了Minder与其他方法(RAW、CON、INT)在准确率(Precision)、召回率(Recall)和F1-score上的表现。可以看到,在所有指标上,Minder的表现均优于其他方法,特别是在精度和F1-score方面。
| Model Selection | Precision | Recall | F1-score |
|---|---|---|---|
| Minder (LSTM-VAE) | 0.904 | 0.883 | 0.893 |
| RAW | 0.758 | 0.741 | 0.749 |
| CON | 0.801 | 0.789 | 0.795 |
| INT | 0.793 | 0.778 | 0.785 |
Analysis:
- Minder's Superiority:
Minder(usingindividual LSTM-VAEmodels) clearly outperforms all other variants in terms ofPrecision,Recall, andF1-score. - RAW (Raw Data): Shows significantly worse performance across all metrics (e.g., F1-score of 0.749). This highlights the crucial role of
denoisingandreconstructionprovided by theVAEmodels. Raw data contains too much noise and jitters, leading to poor outlier detection. - CON (Concatenated Embeddings) & INT (Integrated Model): Both
CONandINTperform worse thanMinder(F1-scores of 0.795 and 0.785, respectively).- Reasoning: These approaches either concatenate embeddings from individual models (
CON) or train a singleVAEmodel with all metrics as input (INT). The degradation in performance forCONandINTsupportsMinder's design choice ofindividual models. When multiple metrics are treated with equal significance, or integrated into one model, they can introducemutual interference. Different metrics have varying sensitivities to faults, and their time series may not fluctuate in the same manner. A monolithic model can get "misdirected or confused" by the array of metrics, leading to less accurate detection.
- Reasoning: These approaches either concatenate embeddings from individual models (
- LSTM-VAE Effectiveness: The paper also notes that the
Mean Squared Error (MSE)between the input and reconstructed data of theLSTM-VAEis less than 0.0001, demonstrating its high effectiveness in reconstructing (and thus denoising) the data. - Conclusion: This analysis strongly validates the choice of
LSTM-VAEfor denoising and the strategy of trainingindividual models for each monitoring metricas critical components forMinder's superior performance.
6.4. Analysis of Continuity and Threshold
To verify the importance of the continuity principle (Section 3.2), Minder's performance is compared with and without its application.
The following are the results from Figure 14 of the original paper:
该图像是图表,展示了Minder与不连续模式下Minder的精准度、召回率和F1-score的对比。Minder在精准度上为0.904,召回率为0.883,F1-score为0.893,而不连续模型的精准度为0.757,召回率为0.777,F1-score为0.767。这些数据表明Minder在故障检测任务中的显著优势。
| Model | Precision | Recall | F1-score |
|---|---|---|---|
| Minder (with Continuity) | 0.904 | 0.883 | 0.893 |
| Without Continuity | 0.757 | 0.777 | 0.767 |
Analysis:
- Impact of Continuity:
Minderwithcontinuitysignificantly outperformsMinder without continuityacross all metrics (F1-score of 0.893 vs. 0.767). - Reduction in False Alarms: The most notable difference is in
Precision(0.904 vs. 0.757). This indicates that thecontinuity checkis highly effective in reducingfalse alarmsthat would otherwise be triggered by instant bursts, temporary counter noises, or short-term jitters in the monitoring data. Without continuity, the system would immediately alert on any detected deviation, leading to many spurious notifications. - Fault Persistence: The
continuityprinciple is based on the observation that actual faults typically lead to adeteriorated performance that persists for a period(as shown in Figure 4 of the paper, where most abnormal patterns last over five minutes). By requiring a machine to show dissimilarity for a continuous duration,Mindereffectively filters out transient, non-fault-related events. - Threshold Selection: The
continuity thresholdis set to four minutes. This value is chosen empirically based on the observed duration of real-world faults. A shorter threshold would reintroduce more false alarms, while a longer one might delay detection or miss some legitimate, but shorter-lived, faults.
6.5. Choice of Distance Measures
The paper investigates the impact of different distance measures on Minder's performance by comparing Euclidean Distance with Manhattan Distance (MhtD) and Chebyshev Distance (ChD).
The following are the results from Figure 15 of the original paper:
该图像是一个表格,展示了Minder和MD在精确率、召回率和F1-score三项指标上的对比。Minder在精确率、召回率和F1-score上分别达到了0.904、0.883和0.893,明显优于MD的精确率0.788、召回率0.767和F1-score 0.777。
| Distance Measure | Precision | Recall | F1-score |
|---|---|---|---|
| Euclidean Distance | 0.904 | 0.883 | 0.893 |
| Manhattan Distance | 0.901 | 0.878 | 0.889 |
| Chebyshev Distance | 0.871 | 0.855 | 0.863 |
Analysis:
- Robustness to Distance Measure:
Minderachieves similar high performance with bothEuclidean Distance(0.893 F1-score) andManhattan Distance(0.889 F1-score). This suggests that theembeddingsproduced by theLSTM-VAEare already highlyrepresentativeand that faulty machines aredistinctly clear outliersin this latent space, regardless of the specific distance metric used. - Euclidean vs. Manhattan: Both
EuclideanandManhattandistances perform very well.Manhattan Distance(sum of absolute differences along each dimension) implies that thespatial distributionof the embeddings provides a solid representation of machine state. - Chebyshev Distance:
Chebyshev Distance(the largest difference along any single dimension) yields slightly worsePrecision(0.871) andF1-score(0.863). This suggests that relying on a single maximum spatial difference might beinsufficientfor accurately comparing dissimilarity, as faults might manifest across multiple dimensions but not necessarily with an extreme value in just one.Minder's approach of summing distances (either Euclidean or Manhattan) considers the overall deviation from normal machines, intensifying the dissimilarity for true outliers.
6.6. Performance with Multiple Concurrent Faulty Machines
The ability of Minder to detect multiple concurrent faulty machines is explored, noting its dependence on the faulty machine scale ratio and monitoring data granularity.
- Challenges at Scale: In production environments with
3D parallelism, multiple faulty machines can rapidly impact manyDPandPPgroups, causing fast negative propagation across the entire cluster.- Coarse Granularity Limitation: The standard
second-level monitoring granularitybecomes a limitation here. Rapid propagation (e.g., an iteration lasting tens of milliseconds) can be annihilated or overlooked, as the distinct pattern of individual faulty machines gets obscured by the widespread group effect before the next second-level sample.
- Coarse Granularity Limitation: The standard
- Real-world Concurrent Faults: Concurrent faulty machine instances in their production environment are rare (less than 1% per month), primarily due to switch reboots or
AOC errors. When a switch reboots (e.g., 32 out of 600 machines going offline),Minderstruggles to distinguish individual faulty outliers due to the high fault ratio and rapid, widespread impact. In such cases, dedicated switch monitoring systems are more effective. - Millisecond-level Monitoring Experiment: To demonstrate
Minder's potential, an experiment was conducted by injectingPCIe downgradinginto two of four machines simultaneously, using customizedmillisecond-level monitoring.-
The following figure (Figure 16 from the original paper) shows the millisecond-level NIC throughput for all machines after injection of PCIe downgrading on two NICs:
该图像是一个展示NIC(网络接口卡)吞吐量随时间变化的图表。图中显示了在正常NIC和多个异常NIC发生的情况下,吞吐量的变化情况,其中标出了进行一次Reduce-Scatter过程的时间点。 -
Results: With
millisecond-level datafrom theNICs,Mindersuccessfully detected the twoNICsconnected to the faultyPCIelinks. TheseNICsshowed the largest outlier distances duringReduce-Scatteroperations. -
Explanation: Normal
NICsshow high throughput at the start of aReduce-Scatterstep and then drop to zero while waiting for slowerNICs. TheNICswith degradedPCIelinks, however, exhibitsteady and low throughput. This distinct pattern is clearly visible atmillisecond-level granularity.
-
- Conclusion: This experiment suggests that with finer-grained monitoring data,
Minderis capable of detecting multiple concurrent faults, as the rapid propagation and individual straggler behavior become observable. The current limitation for widespread multi-fault detection is primarily the standardsecond-level granularity.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces Minder, an automatic system for detecting faulty machines in large-scale distributed model training tasks. Minder tackles the challenges of frequent faults, lengthy manual diagnosis, and the complexity of monitoring diverse fault types in dynamic environments. Its core design leverages machine-level similarity (comparing a machine's metrics to its peers), machine-level continuity (ensuring persistent abnormal patterns), individual learning-based denoising models (using LSTM-VAE for each monitoring metric), and metric prioritization to achieve efficient and accurate detection.
Deployed in a production environment for over a year, Minder monitors daily tasks involving thousands of machines. Evaluation results demonstrate its effectiveness, reacting to faults within 3.6 seconds on average, with a precision of 0.904 and an F1-score of 0.893. This significantly reduces manual debugging time by over 99%, validating its design choices and practical utility.
7.2. Limitations & Future Work
The authors highlight several limitations and propose future research directions:
-
Limitations in Detecting Certain Faults:
AOC errorsand rapidly propagatingconcurrent faults(especially when the fault ratio is high, e.g., an entire switch reboot) are hard to capture effectively with currentsecond-level monitoring granularityand available metrics.- While the
labeled machineis the root cause,Minder-detected machines might have temporary performance fluctuations that are not the root cause but still impact training.
-
Future Work Directions:
- Finer Data Granularity: Exploring the benefits of
millisecond-level monitoring(as hinted by the concurrent fault experiment in Section 6.6) to better capture rapid fault propagation and individual straggler behaviors that occur within single training iterations (tens of milliseconds). - Expanded Metric Spectrum: Incorporating additional hardware counters (e.g.,
AOC counters) andin-band traces(e.g.,Torch Profiler,Megatron-LMtimer,CUDAevent timer) to provide more fine-grained information on training and collective communication operations, leading to more comprehensive monitoring. - Robustness to New/Rare Fault Types:
Minderis inherently designed to detect new or rare fault types as long as they manifest as discernible dissimilarities in monitoring data. - Detection of Concurrent Faults: Further enhancing
Minder's ability to detect multiple concurrent faulty machines by potentially refining itssimilaritycomparison or adapting to higher fault ratios. - Applicability to Other Workloads: Investigating
Minder's effectiveness in other distributedMLworkloads like large-scale inference and fine-tuning, assuming these workloads also exhibitinter-machine metric similarityandfault continuity. - Root Cause Analysis: Currently,
Minderidentifies faulty machines but does not pinpoint the exact root cause. Future work aims to design fine-grained run-time monitoring for automatedroot cause identification, reducing the need for extra manual labor for network jitter and straggler analysis.
- Finer Data Granularity: Exploring the benefits of
7.3. Personal Insights & Critique
-
Inspirations:
- Leveraging Domain-Specific Homogeneity: The paper's core insight into
machine-level similarityin3D parallel distributed trainingis highly inspiring. By understanding the inherent homogeneity of these specific workloads,Mindercan effectively use an unsupervised approach, avoiding the challenges of labeled data and task-dependent thresholds. This highlights the power of combining deep domain knowledge with machine learning. - Practicality and Impact: The deployment in a production environment for over a year, with significant time savings, demonstrates strong practical value. It showcases how a well-designed automated system can alleviate a critical operational bottleneck in
AIinfrastructure. - Robustness through Multi-faceted Design: The combination of
denoising (LSTM-VAE),continuity checks, andmetric prioritizationcreates a robust system. Each component addresses a specific challenge (noise, false positives, efficiency), leading to a highly effective overall solution.
- Leveraging Domain-Specific Homogeneity: The paper's core insight into
-
Potential Issues, Unverified Assumptions, or Areas for Improvement:
- Assumption of Homogeneity:
Minder's reliance onmachine-level similarityassumes a relatively homogeneous training environment and balanced workloads. If future distributed training paradigms become highly heterogeneous (e.g., using machines with vastly different hardware, or highly imbalanced task distributions), this core assumption might be weakened, impactingMinder's effectiveness. - Root Cause Analysis Gap: While
Minderexcels at identifying which machine is faulty, the lack of automatedroot cause analysismeans human engineers are still needed for the next diagnostic step. This could become a new bottleneck if the frequency of faults increases further or if the root causes become more obscure. - Empirical Thresholds: The
continuity thresholdof 4 minutes is empirically determined. While effective, an adaptive or dynamic thresholding mechanism, potentially learned from historical data or adjusted based on workload characteristics, might offer greater robustness and flexibility. - Granularity vs. Overhead Trade-off: The discussion on
millisecond-level monitoringreveals a clear path for detecting faster, more complex multi-machine faults. However, the overhead ofmillisecond-level monitoring(data collection, storage, processing) for thousands of machines is significant and remains a practical challenge for widespread deployment. - "Normal Score" Definition: The paper mentions calculating a "normal score" for each machine's sum of distances but does not provide a specific formula. While the concept is clear, the lack of explicit mathematical definition might pose a slight challenge for full reproducibility without access to implementation details.
- Generalizability to Diverse Faults: While
Minderis designed to be general, its performance on truly novel or unforeseen fault types (beyond the observed 150 instances) remains to be fully tested. TheLSTM-VAEwill learn the "normal" manifold, but if a new fault creates a pattern that still lies close to this manifold in the latent space, it might be missed.
- Assumption of Homogeneity:
Similar papers
Recommended via semantic vector search.