Paper status: completed

Towards LLM-Based Failure Localizationin Production-Scale Networks

Original Link
Price: 0.100000
9 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

BiAn leverages LLMs to analyze network monitoring data, providing ranked error device diagnoses with explanations, reducing root cause analysis time and improving accuracy in production-scale cloud networks.

Abstract

Towards LLM-Based Failure Localization in Production-Scale Networks Chenxu Wang 1 , 2 , Xumiao Zhang 2 , Runwei Lu 2 , 3 , Xianshang Lin 2 , Xuan Zeng 2 , Xinlei Zhang 2 , Zhe An 2 , Gongwei Wu 2 , Jiaqi Gao 2 , Chen Tian 1 , Guihai Chen 1 , Guyue Liu 4 , Yuhong Liao 2 , Tao Lin 2 , Dennis Cai 2 , Ennan Zhai 2 1 Nanjing University 2 Alibaba Cloud 3 New York University Shanghai 4 Peking University Abstract Root causing and failure localization are critical to maintain relia- bility in cloud network operations. When an incident is reported, network operators must review massive volumes of monitoring data and identify the root cause ( i.e., error device) as fast as possible, making it extremely challenging even for experienced operators. Large language models (LLMs) have shown great potential in text understanding and reasoning. In this paper, we present BiAn , an LLM-based framework designed to assist operators in efficient inci- dent investigation. BiAn processes monitoring data and generates error device rankings with detailed explanations. To date, BiAn has been deployed in our network infrastructure for 10 months and it has successfully assisted operators in

Mind Map

In-depth Reading

English Analysis

Bibliographic Information

  • Title: Towards LLM-Based Failure Localization in Production-Scale Networks
  • Authors: Chenxu Wang, Xumiao Zhang, Runwei Lu, Xianshang Lin, Xuan Zeng, Xinlei Zhang, Zhe An, Gongwei Wu, Jiaqi Gao, Chen Tian, Guihai Chen, Guyue Liu, Yuhong Liao, Tao Lin, Dennis Cai, Ennan Zhai
  • Affiliations: Nanjing University, Alibaba Cloud, New York University Shanghai, Peking University
  • Journal/Conference: ACM SIGCOMM 2025 Conference (SIGCOMM '25)
    • Comment on Venue Reputation: SIGCOMM is widely regarded as the premier conference in the field of computer networking. Publication at SIGCOMM signifies highly impactful and rigorously peer-reviewed research, making this paper a significant contribution to the networking community.
  • Publication Year: 2025 (scheduled)
  • Abstract: Root causing and failure localization are crucial for maintaining reliability in large-scale cloud networks. Network operators currently face the challenge of sifting through vast volumes of monitoring data to identify the root cause (i.e., error device) quickly during incidents. Large Language Models (LLMs) demonstrate strong potential in text understanding and reasoning for such tasks. This paper introduces BiAn, an LLM-based framework designed to assist operators in efficient incident investigation. BiAn processes monitoring data to generate ranked lists of error devices with detailed explanations. Deployed in Alibaba Cloud's network infrastructure for 10 months, BiAn has successfully reduced the time to root causing by 20.5% overall (55.2% for high-risk incidents). Extensive evaluations over 17 months of real cases confirm BiAn's accuracy and speed, improving accuracy by 9.2% compared to a baseline approach.
  • Original Source Link: https://ennanzhai.github.io/pub/sigcomm25-bian.pdf (This link points to the PDF of the paper, indicating it's a pre-print or accepted version for the upcoming SIGCOMM '25 conference).

Executive Summary

Background & Motivation (Why)

The paper addresses the critical problem of failure localization and root cause analysis in production-scale cloud networks. In large and complex network infrastructures, incidents are inevitable and can severely impact reliability, leading to significant revenue loss and reputational damage. When an incident occurs, network operators are tasked with quickly identifying the faulty device (the root cause) by reviewing massive volumes of monitoring data. This process is extremely challenging due to several factors:

  • Massive Data Volume: An incident can trigger thousands of alerts from hundreds of devices, generating gigabytes of log entries (Figure 2).

  • Complex Relationships: Anomalies can propagate through interconnected devices, making it difficult to trace the original error source, especially with spatial (topological) and temporal interdependencies.

  • Time Pressure: Operators must diagnose and resolve incidents rapidly, often under strict service-level agreements (SLAs), making thorough manual review impractical.

  • Network Evolution: Modern networks are highly dynamic, requiring troubleshooting tools to continuously adapt to changes in configurations and topologies.

  • Explainability and Consistency: Any automated system must provide clear, trustworthy explanations to assist operators, not just replace them, and minimize LLM's inherent randomness.

    Traditional automated methods often lack the generalization capabilities for unseen cases or are limited to coarse-grained analysis. Large Language Models (LLMs) have shown superior capabilities in text understanding and reasoning, presenting a promising direction to assist human operators in sifting through textual monitoring data and performing logical deductions. The paper's novel approach is to leverage LLMs to build an intelligent assistant that mimics and augments human operators' incident investigation processes, providing accurate, fast, and explainable failure localization.

Main Contributions / Findings (What)

The primary contributions of this paper are:

  1. BiAn Framework: Introduction of BiAn, a practical and comprehensive LLM-based framework for accurate and fast root causing and failure localization in production cloud network operations.
  2. Hierarchical Reasoning: A novel hierarchical reasoning mechanism that breaks down the complex problem of analyzing large-scale, diverse monitoring data into manageable steps: monitor alert summary, single-device anomaly analysis, and joint scoring.
  3. Multi-pipeline Integration: Integration of three distinct pipelines (SOP-based anomaly analysis, network topology, and event timeline) to tackle the spatial and temporal complexity of network incidents, enhancing reasoning accuracy.
  4. Continuous Prompt Updating: A mechanism for continuously updating LLM prompts by extracting knowledge from historical incidents, enabling the system to adapt to evolving network infrastructures and operational procedures.
  5. System Optimizations: Implementation of several optimizations, including fine-tuning smaller LLMs for specific tasks, an early stop mechanism for efficient processing of simpler cases, and parallel execution for latency reduction.
  6. Real-world Deployment and Evaluation: Successful deployment of BiAn in Alibaba Cloud's global network infrastructure for 10 months, demonstrating its practical effectiveness. Extensive performance evaluations over 17 months of real-world incidents show significant improvements:
    • Reduced Time-to-Root-Causing (TTR): BiAn reduced TTR by 20.5% on average, with a notable 55.2% reduction for high-risk incidents.
    • Improved Accuracy: BiAn achieved 95.5% accuracy (identifying the error device as top-1) compared to 86.3% for the baseline Hot Device approach, and 97.1% when the baseline failed to differentiate between top devices.
    • Cost-Efficiency: Demonstrated low training and inference costs (average 0.17-0.18 per incident).

Prerequisite Knowledge & Related Work

Foundational Concepts

  • Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data, enabling them to understand, generate, and reason with human language. Key capabilities leveraged in BiAn include text understanding, reasoning, and text generation for explanations. LLMs can process prompts (instructions or questions) and generate responses.
  • Network Operations (NetOps): Refers to the management of network infrastructure, including monitoring, maintenance, troubleshooting, and ensuring network reliability and performance.
  • AIOps (Artificial Intelligence for IT Operations): The application of artificial intelligence (especially machine learning and LLMs) to automate and streamline IT operations tasks, including anomaly detection, root cause analysis, and performance optimization. BiAn is an example of AIOps for networking.
  • Root Cause Analysis (RCA): The process of identifying the fundamental reason for an incident or problem. In networking, this means pinpointing the specific device, link, or configuration error that initiated a failure.
  • Failure Localization: The process of identifying the exact location (e.g., a specific device or link) where a failure has occurred within a network. This is a key outcome of RCA.
  • Incident Management (IM): The structured process of responding to and resolving incidents to restore normal service operation as quickly as possible. This involves detection, diagnosis (including RCA and failure localization), mitigation, and resolution.
  • Network Telemetry: Data collected from network devices (e.g., routers, switches) about their operational state, performance metrics, and events. This data is crucial for monitoring and troubleshooting.
  • Standard Operating Procedures (SOPs): Detailed, written instructions to achieve uniformity of the performance of a specific function. In NetOps, SOPs guide operators through incident investigation steps. BiAn's design is informed by these SOPs.
  • Fine-tuning: A technique in machine learning where a pre-trained model (like an LLM) is further trained on a smaller, specific dataset to adapt it to a particular task or domain. This typically improves performance for the specialized task and can make smaller models more efficient.
  • Retrieval Augmented Generation (RAG): An LLM technique where the model retrieves relevant information from a knowledge base (e.g., documents, databases) before generating a response, helping it provide more accurate and context-aware answers and reduce hallucinations. The paper notes RAG is impractical for BiAn's continuous prompt updating due to data labeling challenges.
  • Chain-of-Thought (CoT) Prompting: A technique for LLMs where the model is prompted to generate intermediate reasoning steps before providing a final answer. This improves the model's ability to solve complex reasoning problems and makes its thought process more transparent.
  • Self-consistency: A technique that involves prompting an LLM multiple times for the same question, generating several CoT paths, and then selecting the most consistent answer among them. This often improves accuracy and robustness.
  • Entropy: In information theory, entropy quantifies the uncertainty or randomness in a probability distribution. A lower entropy indicates higher confidence in a prediction (e.g., if one device has a much higher probability of being the error device).
  • Token Limits: LLMs have a maximum number of tokens (words or sub-words) they can process in a single input or output. This limits the amount of raw data an LLM can analyze at once.

Previous Works

The paper discusses various categories of related work:

  • Network Telemetry and Measurements:
    • Many studies focus on collecting and analyzing network data at flow-level or packet-level (e.g., 007 [7], which focuses on pinpointing problems in TCP flows). These tools are helpful for preliminary identification but often cannot pinpoint exact error devices.
    • Other works optimize telemetry systems for efficiency and reliability (e.g., FlowRadar [41], Omnimon [32]).
    • NetPilot [74], CorrOpt [91], SWARM [52] focus on mitigating failures after they are located, complementing BiAn's localization role.
  • Network Failure Localization (Traditional Methods):
    • Statistical/Rule-based: Approaches using statistical techniques or predefined rules to locate link failures [28, 37], analyze packet drops [7, 53, 62], or troubleshoot IPTV/IP networks [6, 48, 76].
    • Probing/Tomography: Tools like NetSonar [83], NetPoirot [8], NetNORAD [38], and Pingmesh [22] identify bottlenecks or locate failures at a network level (e.g., spine, ToR switch). These can serve as data sources for BiAn.
    • Hardware/INT-based: Solutions requiring specific hardware support [26, 42, 66] or in-band network telemetry (INT) [11], often resource-intensive or intrusive, making them hard to deploy in heterogeneous infrastructures.
  • Troubleshooting at Upper Layers:
    • Focuses on software/service issues in cloud environments, often identifying root causes leveraging dependencies between entities [9, 17, 27, 34]. These primarily analyze performance metrics, while BiAn operates on textual alerts.
    • Other works address application-layer bugs unrelated to physical devices [43, 50, 57, 60, 68, 85].
  • Utilizing LLMs in the Network Domain (Prior LLM-NetOps):
    • Monitoring/Coarse-grained RCA: MonitorAssistant [81] generates guidance-oriented reports, Oasis [33] assesses impact scope and summarizes outages, RCACopILoT [16] outputs root cause categories. These often still require significant operator effort to pinpoint the specific error device.
    • Intent Recognition/Summarization: NetAssistant [65] recognizes diagnosis intent, while Ahmed et al. [5] study GPT's ability to produce root causes from incident titles/digests. These typically lack deep visibility into detailed logs.
    • Cloud Software IM: Roy et al. [56] evaluate ReAcT framework for incident analysis, Zhang et al. [86] use in-context learning for general-purpose LLMs, RCAgent [69] uses tool-augmented LLMs for cloud job anomalies, LMPACE [84] estimates confidence for black-box RCA. NetLLM [72] modifies LLM architectures for networking.

Technological Evolution

The field of network troubleshooting has evolved from manual inspection, to rule-based automation, then statistical/machine learning approaches for anomaly detection and coarser-grained RCA. The advent of LLMs represents a significant leap, offering human-like reasoning over unstructured textual data, which is abundant in network monitoring logs. This paper builds on this evolution by applying LLMs to perform detailed, explainable failure localization, bridging the gap between automated detection and human-level investigative reasoning.

Differentiation

BiAn differentiates itself from prior work, particularly LLM-based approaches, by:

  • Deep Visibility: Unlike prior LLM works that rely on coarse summaries or incident titles, BiAn processes detailed monitoring data, allowing for fully informed decisions.
  • Specific Device Pinpointing: While other LLM systems might provide root cause categories or summaries (RCACopILoT, MonitorAssistant, Oasis), BiAn specifically aims to pinpoint the exact error device with ranked probabilities and detailed explanations.
  • Comprehensive Integration: BiAn integrates SOP-based knowledge with network topology and event timeline data, enabling more sophisticated reasoning than single-aspect LLM applications.
  • Adaptability: The continuous prompt updating mechanism allows BiAn to evolve with the dynamic network infrastructure, a crucial feature for production environments that many static ML models lack.
  • Practical Deployment: BiAn is designed for and demonstrated in a real-world, large-scale production network, highlighting its robustness and operational value.

Methodology (Core Technology & Implementation Details)

BiAn is an LLM-powered framework designed to assist network operators in root causing and failure localization during incidents. It mimics human investigative processes by understanding monitor alerts, reasoning about root causes, and identifying error devices.

The overall architecture of BiAn is shown in Figure 4 (image not provided, but described as a schematic diagram including monitoring alert summary, device anomaly analysis, joint scoring, multi-pipeline reasoning stages, and final device ranking and root cause analysis results).

When an incident is reported, BiAn receives monitor alerts for a set of candidate devices, which are pre-selected by an upstream automated monitoring system as the most suspicious. BiAn then processes this data through several stages to predict probabilities for each device being the actual error device, providing a ranked list with explanations.

3.1 Data Preprocessing

Before feeding data to LLMs, BiAn preprocesses the raw monitor alerts. This involves removing empty items (e.g., keys with no values) and unused fields from the logs, which often contain redundancy and noise due to logging formatting.

3.2 Hierarchical Reasoning

To manage the large volume and diversity of data and overcome LLM token limits, BiAn employs a hierarchical reasoning approach, processing different types of alerts for different devices separately and breaking down the investigation into three structured steps:

3.2.1 Monitor Alert Summary

  • Purpose: To condense raw monitor alerts into concise summaries, making them manageable for LLMs.
  • Process: BiAn processes alerts from 11 upstream monitoring tools (Figure 5 left). Each type of alert is handled by a dedicated LLM agent.
  • LLM Agent Prompting: These agents are prompted to extract key information and summarize anomalous behaviors for each device. The prompt template (Figure 18) consists of:
    1. Role Definition: Specifies the agent's function (e.g., "You are an expert network monitoring alert summarizer...").
    2. Input Field Description: Describes the fields in the raw log to be analyzed.
    3. Summary Guidelines: Provides instructions on how to summarize (e.g., "Extract key information... be concise...").
    4. Alert Example (One-shot): A concrete example of raw alert data with its expected summarized output, serving as in-context learning.
    5. Response Format: Defines the structure of the output summary (e.g., JSON format).
  • Output: For each device, BiAn generates 11 highly condensed summaries (one for each alert type) which are then forwarded for single-device anomaly analysis. This modular design allows easy extension for new monitoring tools.

3.2.2 Single-Device Anomaly Analysis

  • Purpose: To identify specific anomaly scenarios exhibited by each suspect device based on its summarized alerts.
  • Key Insight: Operators have defined 7 distinct anomaly scenarios (Figure 5 right) that cover all observable anomalies for an error device. Each scenario relies on a specific combination of alert types. Examples include Flapping, Packet Loss, High Utilization, etc.
  • Process: BiAn employs 7 dedicated LLM agents (one for each anomaly scenario). Each agent analyzes the summarized alerts for a single device to determine if it exhibits its assigned anomaly. The prompt template (Figure 19) for these agents includes:
    1. Role Definition: Specifies the agent's role (e.g., "You are a network device anomaly analyzer...").
    2. Input Description: Describes the format of the summarized alerts.
    3. Analysis Instructions: Specifies how to perform the anomaly analysis based on SOPs and mapping in Figure 5.
    4. One-shot Example: An example input with the expected anomaly analysis.
    5. Response Format: Defines the structure of the output (e.g., "Anomaly found: True/False, Explanation: ...").
  • Output: An anomaly analysis report for each device, indicating which anomaly scenarios it exhibits. An anomaly is not necessarily the root cause, and a device can exhibit multiple anomalies.

3.2.3 Joint Scoring

  • Purpose: To consolidate all anomaly analysis reports from suspect devices and generate failure scores reflecting the likelihood of each device being the error device.
  • Key Insight: The true error device typically exhibits more severe and numerous anomalous behaviors compared to other affected (but not root cause) devices. Operators assign severity levels to different anomaly scenarios.
  • Process: An LLM agent evaluates the compiled anomaly analysis reports and severity levels to assign a failure score to each suspect device. The scores sum up to 1, and the highest-scoring device is initially marked as the error device.
  • Output: An initial ranked list of candidate devices with failure scores.

3.3 Three-Pipeline Integration

To address complex spatial and temporal relationships and improve accuracy beyond the SOP-based Pipeline 1, BiAn incorporates additional information through two more pipelines, creating a two-stage reasoning framework.

3.3.1 Pipeline 2: Network Topology

  • Purpose: To incorporate network interconnections into the reasoning, as anomalies can propagate. A parent node causing anomalies on its children is more suspicious.
  • Process: BiAn extracts topology-related information from logs. It identifies a smaller sub-topology containing only suspect devices and their relevant connections. An LLM agent processes JSON-formatted topology data using an algorithm:
    • Find shortest paths between every two suspect devices.
    • Remove shortest paths if an intermediate suspect device already represents the relationship (e.g., A-B becomes A-C and B-C if CC is on A-B path).
    • Construct a simplified topology.
    • Aggregate non-suspect devices in the same group into a single node for simplification.
  • Output: A simplified, relevant network topology for the suspect devices.

3.3.2 Pipeline 3: Event Timeline

  • Purpose: To incorporate the temporal sequence of events, as devices with anomalies reported earlier often have higher suspicion.
  • Process: BiAn compiles a timeline of events from all devices, ordered by their start times.
  • Output: A global timeline of anomaly events.

3.3.3 Integrated Root Causing

  • Two-Stage Reasoning: BiAn does not run all pipelines initially. After Pipeline 1 provides initial scores, it filters devices with low suspicion using a Top-\mathcal{P}filter.TopP filter. * **`Top-`\mathcal{P} Concept:** BiAn retains devices whose cumulative softmax scores (ranked descending) reach a threshold pp. This focuses the second stage on the most suspicious devices.
    • Formula for Softmax: softmax(xi)=exij=1Dexj \mathrm{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{D} e^{x_j}} Where:
      • xix_i is the initial failure score for device ii.
      • DD is the total number of candidate devices.
      • exie^{x_i} is the exponential function applied to the score of device ii.
      • j=1Dexj\sum_{j=1}^{D} e^{x_j} is the sum of exponentials of all device scores, normalizing the output to a probability distribution.
  • Second Stage LLM Agent: An enhanced final reasoning LLM agent takes inputs from all three pipelines: device anomaly reports (Pipeline 1), topological relationships (Pipeline 2), and temporal event data (Pipeline 3).
  • Prompt Template (Figure 6): This prompt specifies the task objective, combines information from the three pipelines, and lists principles for root cause analysis (e.g., "Prioritize devices that are upstream or central in the affected topology," "Consider events occurring earlier as potentially causal").
  • Output: Refined failure scores and explanations.

3.3.4 Rank of Ranks for Stability

  • Problem: LLMs can exhibit randomness, leading to inconsistent outputs.
  • Solution: To mitigate this, BiAn uses a Rank of Ranks approach:
    1. The integrated reasoning step (Stage 2) is run NN times (empirically set to 3).
    2. For each run, devices are ranked by their failure scores.
    3. The average rank for each device across all NN trials is calculated.
    4. The device with the highest average rank is considered the error device.
  • Benefit: This approach smoothes out random fluctuations and improves result stability, particularly when absolute scores are close but relative rankings are consistent.

3.4 Continuous Prompt Updating

To adapt to the rapid evolution of network infrastructures and SOPs, BiAn continuously updates its prompts using an iterative loop (Figure 7):

3.4.1 Exploration

  • Initial Step: The scoring agent (from joint scoring) performs 5 reasoning attempts for each incident using the current prompt.
  • Diversification: For cases where accuracy is not 100%, reasoning is repeated with a higher temperature TT. A higher temperature makes LLM outputs more diverse, increasing the chance of getting both correct and incorrect reasoning paths.
  • CoT Generation: Zero-shot Chain-of-Thought (CoT) prompting is used to generate intermediate reasoning steps alongside final scores, providing transparency.

3.4.2 Reflection

  • Agent R: An LLM agent (Agent R) analyzes the diverse reasoning trials (both correct and incorrect).
  • Knowledge Extraction: Agent R identifies key factors influencing the results and proposes actionable knowledge that could lead to correct reasoning or prevent incorrect reasoning.

3.4.3 Knowledge Consolidation

  • Agent C: A different LLM agent (Agent C) summarizes the extracted knowledge from all incidents.
  • Prioritization: Agent C counts the frequency of similar knowledge pieces and ranks them. The six most frequently occurring pieces of knowledge are retained for broad applicability.

3.4.4 Prompt Augmentation

  • Integration: The consolidated knowledge is integrated into the task prompt.
  • Buffer Array: A buffer stores knowledge from previous iterations. New knowledge is merged, and frequencies are updated.
  • Refined Prompt Generation: Another LLM uses the old prompt and the knowledge buffer to generate a "refined prompt." This ensures logical consistency and controlled adjustments.
  • Frequency: Prompt updates are performed every few months, aligning with major hardware/software upgrades.

3.5 System Optimizations

3.5.1 Fine-tuning

  • Strategy: Smaller, specialized LLMs are fine-tuned with domain-specific information for simpler tasks in Pipeline 1 (monitor alert summary and single-device anomaly analysis).
  • Data Generation: Synthetic data is generated based on real-world value distributions. GPT-4 is used to produce ground truth labels, which are then verified with rule-based algorithms and small-scale operator fidelity tests.
  • Benefit: Improves localization accuracy and reduces reasoning latency, as smaller models run faster and are more efficient for specific tasks.

3.5.2 Early Stop

  • Purpose: To save computational resources and reduce average response time for simpler cases.
  • Mechanism: After the first reasoning stage (Pipeline 1), BiAn calculates the entropy of the failure scores.
    • Formula for Entropy: H(P)=i=1Dpilog2(pi) H(P) = - \sum_{i=1}^{D} p_i \log_2(p_i) Where:
      • H(P) is the entropy of the probability distribution PP.
      • pip_i is the probability (or normalized score) that device ii is the error device.
      • DD is the number of candidate devices.
      • A lower entropy value indicates higher confidence (less uncertainty), meaning scores are skewed towards one device.
  • Decision: If entropy is below a predefined threshold, BiAn stops and outputs the current rank. Otherwise, high entropy (greater uncertainty) triggers the second stage (multi-pipeline integration) for more detailed reasoning.

3.5.3 Parallel Execution

  • Strategy: All LLM agents within the same step (e.g., alert summary agents, anomaly analysis agents) run concurrently due to their independence.
  • Benefit: Maximizes efficiency and reduces overall latency, especially for the Rank of Ranks calculation in Stage 2.

Experimental Setup

Datasets

  • Source: Real incidents from Alibaba Cloud's global network infrastructure.
  • Volume: 357 non-trivial cases collected over a period of 17 months.
  • Characteristics: These incidents represent complex scenarios encountered in a large-scale production cloud environment. The raw data consists of diverse monitor alerts (from 11 tools), network topology information, and event timeline data.
  • Data Sample: The paper describes monitor alerts as having a semi-structured format with human-readable textual fields, including preliminary classifications and statistics. An example alert from the Appendix:
    {
      "event_type": "Device Ping Log",
      "device_name": "B2",
      "dropped_ping_count": 150,
      "timestamp_start": "2024-01-15T10:00:00Z",
      "timestamp_end": "2024-01-15T10:05:00Z",
      "device_role": "Core Router",
      "description": "Significant packet loss detected between B2 and its peer A1."
    }
    
    This example shows that the system processes pre-aggregated data, not raw time series, simplifying the LLM's task.
  • Justification: Using real-world data ensures the practical relevance and robustness of BiAn in a challenging production environment.

Evaluation Metrics

  • Accuracy: The primary metric for failure localization, defined as whether the system identifies the actual error device as the top-1 most suspicious device. The paper also reports top-2 and top-3 accuracies, meaning the error device is within the top 2 or 3 ranked devices.
    • Conceptual Definition: Accuracy measures how often BiAn correctly pinpoints the single, true root cause device. Top-N accuracy is a less strict measure, indicating how often the true root cause is among the top N devices proposed, which is relevant for assisting human operators who might check a few highly ranked candidates.
    • Mathematical Formula (derived from Appendix E): Accuracy=YX \mathrm{Accuracy} = \frac{Y}{X} Where:
      • YY is the number of incidents where BiAn successfully identifies the top-1 error device.
      • XX is the total number of incidents evaluated.
    • Secondary Metrics (derived from Appendix E, if needed):
      • False Positive Rate (FPR): FPR=FPD×X=1D(1YX)=1D(1Accuracy) \mathrm{FPR} = \frac{FP}{D \times X} = \frac{1}{D} \left(1 - \frac{Y}{X}\right) = \frac{1}{D} (1 - \mathrm{Accuracy}) Where:
        • FP is the number of false positives (a non-root cause device incorrectly identified as root cause).
        • DD is the initial number of candidate devices.
      • Other metrics like True Positive Rate, True Negative Rate, and False Negative Rate could also be calculated from the confusion matrix if needed, but the paper primarily focuses on overall Accuracy.
  • Time-to-Root-Causing (TTR): The duration from the initiation of an investigation until the identification of the root cause and error device.
    • Conceptual Definition: TTR quantifies the efficiency of the diagnostic phase of incident management. Reducing TTR is crucial for meeting SLAs and minimizing incident impact. It complements Time-to-Mitigation (TTM), which includes the time taken for actual corrective actions.
  • Latency: Measurement of the processing time for various components of BiAn and its end-to-end execution.
    • Conceptual Definition: Measures how quickly BiAn can produce its output, which is critical for providing real-time assistance to operators.
  • Operator Satisfaction Scores: Feedback collected from operators on the helpfulness of BiAn's explanations (on a scale of 0-2).
    • Conceptual Definition: A qualitative measure of the user experience and trustworthiness of the LLM's output, essential for an assistance tool.
  • Training and Inference Costs: Monetary costs associated with fine-tuning models and running LLM inference.
    • Conceptual Definition: Measures the economic viability of deploying BiAn at scale.

Baselines

  • Hot Device: One of Alibaba Cloud's existing automated failure localization tools.
    • Principle: Identifies the device with the most associated alerts. It assumes the error device, being the source, will exhibit various anomalies, while neighboring devices show only partial or less severe symptoms. This is analogous to 007 [7] which pinpoints problems in TCP flows.
    • Why representative: Represents a common, simple, and currently used automated approach in production.
  • RCACopILoT variant: A variant of the RCACopILoT [16] system.
    • Principle: Outputs coarser-grained root cause categories with explanations after triggering active probing, but does not pinpoint specific error devices.
    • Why representative: Represents a more advanced LLM-based approach but with a different scope (category vs. specific device). This is used in the A/B tests for TTR comparison.

Models Used

  • For performance evaluation:
    • Qwen2.5: The default LLM for BiAn.
    • Qwen2.5-7B-Instruct (fine-tuned): Used for the monitor alert summary and single-device anomaly analysis steps in Pipeline 1.
    • Qwen2.5-72B-Instruct: Used for all other tasks (e.g., joint scoring, integrated root causing).
  • For cross-model evaluation: Llama-3.1 (405B), GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro.

Infrastructure

  • Server: Intel Xeon Platinum 8269CY CPU @ 2.50 GHz, 32 GB RAM, 1 TB SSD. This setup is used for both online deployment and evaluation.

Results & Analysis

6.1 Accuracy

BiAn's accuracy is measured by how often it correctly identifies the true error device as the top-1 ranked device. The paper also reports top-2 and top-3 accuracies.

  • Overall Accuracy:

    • BiAn achieved 95.5% top-1 accuracy (averaged over 10 runs) across 357 cases.
    • The Hot Device baseline achieved 86.3% top-1 accuracy.
    • When Hot Device produced tied top-1 results (indicating indecision), its accuracy dropped to 70.5%, while BiAn maintained 97.1%. This highlights BiAn's ability to differentiate between multiple suspicious devices.
    • BiAn's top-2 and top-3 accuracies were 98.6% and 99.3%, respectively (compared to Hot Device's 88.9% and 93.0%). These high top-N accuracies are valuable for operators, who can check a few top-ranked devices.
  • Localization Accuracy under Different Categorizations: The paper categorizes incidents by risk level, difficulty (resolution time), and incident type (Figure 9).

    Figure 9: Localization accuracy under different incident categorizations (risk level, difficulty, incident type). 该图像是图表,展示了图9中不同事故分类(风险等级、难度、事故类型)下的定位准确率比较,分别对比了Hot Device与BIAN方法的表现。

    Figure 9: Localization accuracy under different incident categorizations (risk level, difficulty, incident type).

    • Risk Level (Left):
      • BiAn maintains similar accuracy for low- and medium-risk incidents (above 95%).
      • For high-risk incidents, BiAn's accuracy drops below 90%. This is attributed to the high number of concurrent incidents, which increases noise and complicates information extraction. However, BiAn still significantly outperforms Hot Device, whose accuracy drops more sharply for high-risk incidents.
    • Difficulty (Middle):
      • Easy cases (t1 mint \leq 1 \text{ min}) show the highest accuracy for BiAn (95.7%).
      • Medium cases (1<t5 min1 < t \leq 5 \text{ min}) are also high (92.9%).
      • Hard cases (t>5 mint > 5 \text{ min}) show lower accuracy, reflecting their inherent complexity, which also challenges human operators. The Hot Device baseline fails to report the actual error devices in all hard cases. Despite this, BiAn's speed makes even trial-and-error more efficient than manual investigation for hard cases.
    • Incident Type (Right):
      • BiAn performs best in resolving line card-related failures, likely because these typically affect only the hosting device with less propagation.
      • Port, device, and network disconnection failures show comparable results.

6.2 Latency

BiAn aims to provide real-time assistance.

  • End-to-End Latency: The entire reasoning process completes within 30 seconds. This ensures BiAn's output is ready by the time operators typically get online after an incident notification.

  • Latency Breakdown (Figure 10):

    Figure 10: End-to-end latency breakdown. 该图像是图表,展示了图10中的端到端延迟时间细分,分别显示了各阶段及步骤的耗时以及实际与理论总耗时的对比。

    Figure 10: End-to-end latency breakdown.

    • The Reasoning Task (Stage 2) takes the longest, as it integrates data from all pipelines and devices.
    • The end-to-end latency is only 15% greater than the sum of individual components, demonstrating the effectiveness of parallel execution.
  • Latency Reduction from Fine-tuning (Figure 11):

    Figure 11: Latency for fine-tuned components. 该图像是图表,展示了微调组件的延迟对比,包括‘Monitor Alert Summary’和‘Device Anomaly Analysis’两个部分,从中可以看出微调后延迟显著降低。

    Figure 11: Latency for fine-tuned components.

    • Fine-tuning smaller models for Monitor Alert Summary and Device Anomaly Analysis significantly reduces their latency compared to using larger, general-purpose models. This confirms that specialized, smaller models improve efficiency without compromising performance for less complex, domain-specific tasks.

6.3 Ablation Study

This study evaluates the contribution of different BiAn components.

  • Progressive Design (Figure 12):

    Figure 12: Accuracy at different design phases. 该图像是论文中的柱状图,展示了不同设计阶段的准确率对比,显示采用不同方法后准确率明显提升,最高超过95%。

    Figure 12: Accuracy at different design phases.

    • Running only Pipeline 1 (SOP-based) achieves 87.2% accuracy, already outperforming Hot Device.
    • Adding the Top-\mathcal{P}filter and a second run of Pipeline 1 on filtered candidates increases accuracy by 5.9%.
    • Further incorporating topology or timeline data individually brings additional accuracy gains.
    • The full BiAn with all components enabled reaches 95.5% accuracy.
    • Notably, running all three pipelines without the Top-\mathcal{P}filter results in lower accuracy. This is because including all devices in all pipelines significantly inflates input token count, diluting the LLM's attention. The Top-\mathcal{P}filter is crucial for focusing the LLM on relevant information in Stage 2.
  • Early Stop Mechanism (Figure 13):

    Figure 13: Performance trade-off in the early stop mechanism. 该图像是图表,展示了早停机制中准确率随时间变化的性能权衡。图中用三条曲线分别表示Top1、Top2和Top3准确率,标注了不同时间点对应的案例比例,体现了早停对准确率和时间的影响。

    Figure 13: Performance trade-off in the early stop mechanism.

    • The early stop mechanism, controlled by an entropy threshold, allows BiAn to balance accuracy and latency.
    • A higher threshold (more cases stop early) leads to lower latency but also lower accuracy.
    • At zero threshold (all cases go to Stage 2), BiAn achieves highest accuracy but also highest latency.
    • The curves show a convex relationship: initial early stop adjustments yield substantial accuracy gains with minimal time investment.
    • The optimal trade-off point BB (at an entropy threshold of 0.75) reduces overall processing time by 70% with only a 0.5% accuracy loss compared to always running Stage 2. This allows BiAn to handle simpler cases quickly while reserving Stage 2 for more challenging ones.

6.4 Result Stability

The Rank of Ranks technique was introduced to mitigate LLM randomness.

  • Impact of Rank of Ranks Rounds (Figure 14 left):

    Figure 14: Impact of Figure 15: Benefits of fineRank of Ranks rounds. tuning on Steps 1 and 2. 该图像是两幅柱状图,展示了图14中Rank轮数对准确率的影响(左图)和微调对步骤1和步骤2准确率提升的好处(右图)。左图横轴表示轮次数,右图比较了Qwen模型与微调模型的表现。

    Figure 14: Impact of Figure 15: Benefits of fineRank of Ranks rounds. tuning on Steps 1 and 2.

    • While average accuracy remains comparable (around 95.5%) across single-run, 3-round, and 5-round executions, the Rank of Ranks approach (3 or 5 rounds) significantly reduces the variance in accuracy.
    • Increasing rounds from 3 to 5 offers diminishing returns in performance improvement. This suggests that N=3N=3 is a good empirical choice for balancing stability and computational cost. The paper notes this is inspired by self-consistency techniques.

6.5 Microbenchmarks

  • Monitor Alert Summary (Figure 15 right - "Benefits of fine-tuning on Steps 1 and 2"):

    • Fine-tuned models achieved an average accuracy of 98.7% for alert summarization.
    • This significantly outperforms the default Qwen model, highlighting the benefit of domain expertise for effective summarization even in "simpler" tasks.
  • Single-device Anomaly Analysis (Figure 15 right - "Benefits of fine-tuning on Steps 1 and 2"):

    • Fine-tuned models achieved 98.6% accuracy for anomaly analysis, beating the default model by over 6%.
    • This confirms that fine-tuning smaller models for domain-specific knowledge in less complex tasks improves accuracy.
  • Prompt Updating (Figure 16):

    Figure 16: Accuracy with number of iterations in prompt updating. 该图像是图表,展示了在提示更新的不同迭代次数下,训练与测试的准确率变化情况(图16)。图中展示了迭代次数为0至3时,训练准确率和测试准确率的对比。

    Figure 16: Accuracy with number of iterations in prompt updating.

    • Using five-fold cross-validation, the prompt updating algorithm showed improvements.
    • For the full dataset, training accuracy increased from 88.7% to 90.0%, and test accuracy from 81.7% to 85.9% after three iterations. The marginal gain was attributed to a high baseline.
    • When excluding cases with already 100% accuracy, the improvements were more pronounced: training accuracy from 54.0% to 62.0%, and testing from 36.7% to 50.0% after three iterations.
    • Beyond three iterations, excessive prompt length started to hinder LLM comprehension, causing performance degradation, indicating a limitation of prompt-based training compared to parameter fine-tuning.

6.6 Training and Inference Costs

  • Cost Efficiency (Figure 17):

    Figure 17: Training and inference costs. 该图像是图17,展示了BiAn模型在训练和推理阶段的成本累积分布函数(CDF),横轴为成本(美元),纵轴为CDF,比较了BIAN-Stage 1与完整BIAN模型的成本差异。

    Figure 17: Training and inference costs.

    • The average cost for fine-tuning specialized LLMs was $121.63.
    • The inference cost per incident was very low:
      • $0.17 for running only Pipeline 1.
      • $0.18 for running the full BiAn framework.
    • This demonstrates BiAn's cost-efficiency for deployment at scale, where even small per-incident gains can translate into significant savings in a large cloud environment. The additional $0.01 per incident for full BiAn offers substantial accuracy gains.

6.7 Running with Other LLMs

To validate generalizability, BiAn was tested with other state-of-the-art LLMs.

Table 1: Performance of different models.

Model Name Size Top 1 Top 2 Top 3
Qwen2.5 72B 95.5 98.6 99.3
Llama-3.1 405B 95.7 98.7 99.3
GPT-4o - 95.2 98.1 99.3
Claude-3.5-Sonnet - 93.9 97.4 98.5
Gemini-1.5-Pro - 93.2 97.9 98.7
  • Consistency: BiAn consistently achieves high top-1 to top-3 accuracies across various leading LLMs (Qwen, Llama, GPT, Claude, Gemini).
  • Robustness: This cross-model evaluation highlights the robustness of the proposed framework, indicating that its design principles are effective regardless of the specific underlying LLM used.

6.8 Real-World Deployment Experience (A/B Tests & Case Studies)

BiAn has been deployed in Alibaba Cloud for 10 months.

6.8.1 A/B Tests during Deployment

  • Time-to-Root-Causing (TTR) Reduction (Figure 8a):

    Figure 8: Comparison of time and satisfaction scores 该图像是论文中图8的复合图,包括(a)不同风险等级下单人和辅助调查的平均故障定位时间对比柱状图,以及(b)4月至12月期间操作人员对BiAn解释满意度的折线图,展示了BiAn在加速定位和提升满意度方面的效果。

    Figure 8: Comparison of time and satisfaction scores

    • BiAn reduced TTR by 20.5% on average.
    • For high-risk incidents, TTR was reduced by a notable 55.2%. This is because LLM inference time does not significantly inflate with incident risk, unlike human manual investigation time.
    • A RCACopILoT variant, which only provides coarse root cause categories, resulted in larger TTR differences for higher risk levels, as operators still needed to manually validate and pinpoint devices.
  • Explainability and Operator Satisfaction (Figure 8b):

    Figure 8: Comparison of time and satisfaction scores 该图像是论文中图8的复合图,包括(a)不同风险等级下单人和辅助调查的平均故障定位时间对比柱状图,以及(b)4月至12月期间操作人员对BiAn解释满意度的折线图,展示了BiAn在加速定位和提升满意度方面的效果。

    Figure 8: Comparison of time and satisfaction scores

    • Feedback showed explanations were predominantly rated 1 (somewhat helpful) or 2 (very helpful), with few 0s.
    • Even when the top-1 device was incorrect, explanations often provided useful context or intermediate reasoning, assisting operators.
    • Fluctuations in scores were attributed to rising operator expectations as BiAn iteratively improved.

6.8.2 Case Studies

  • Incident 1 (Example from §2.2):
    • Scenario: Five alerts on device B2, four on B3, other neighbors affected. Operators struggled to pinpoint the root cause for 22 minutes.
    • BiAn's Role: BiAn output B2 as most suspicious (four anomalies) and B3 second (three anomalies).
    • Result: Successfully detected anomalies and located the error device in 28 seconds, significantly reducing TTR compared to manual effort.
  • Incident 2 (Topology Integration):
    • Scenario: B1 connected to A1 and A2; A2 also to B2. Stage 1 assigned equal scores to A1 and A2 due to comparable anomalies.
    • BiAn's Role: Stage 2, with topology pipeline, revealed A2 was a parent node to B1.
    • Result: BiAn flagged A2 as the error device with a much higher score (root cause: line card failure in A2). TTR was 3 minutes. This highlights the value of topology information.
  • Incident 3 (Timeline Integration):
    • Scenario: A1, A2, and B1 showed anomalies (disconnection, network changes on A1; traffic drop on A2 and B1). All had priority-1 anomalies, making initial localization difficult.
    • BiAn's Role: Timeline data showed A1's alerts occurred earlier than A2's.
    • Result: BiAn identified A1 as the error device due to disconnection. TTR was 1.5 minutes. This demonstrates the critical role of temporal data.

6.8.3 Operational Experience

  • Selection of Candidate Devices:
    • Initially, a fixed number of six candidate devices were selected by the upstream monitoring system based on associated alerts. This provided >98% coverage of the error device.
    • Active refinement involves a dynamic selection mechanism considering top suspicious devices from various monitoring tools, narrowing down to about six, and achieving >99% coverage.
  • Iterative Design Process:
    • BiAn evolved through continuous testing and feedback.
    • A key challenge was data retention and format: the three-pipeline design was initially delayed because topology and timeline data were not retained in older logs, requiring accumulation of new data.
  • Onboarding Operators:
    • BiAn was integrated into the NOC's on-call platform with a familiar user interface and output format (ranked devices, explanations).
    • This ensured a comfortable experience for new operators, requiring minimal training due to aligning with existing SOPs and close collaboration during development.

Conclusion & Personal Thoughts

Conclusion Summary

This paper introduces BiAn, an LLM-based framework for failure localization and root cause analysis in production-scale cloud networks. BiAn addresses the challenges of massive data volume, complex interdependencies, time pressure, and dynamic network environments through several key innovations: a hierarchical reasoning structure, multi-pipeline integration (incorporating SOPs, topology, and timeline data), continuous prompt updating, and various system optimizations (fine-tuning, early stop, parallel execution).

The successful 10-month deployment at Alibaba Cloud demonstrates BiAn's practical effectiveness. It significantly reduces the time-to-root-causing (20.5% overall, 55.2% for high-risk incidents) and improves localization accuracy (95.5% top-1 accuracy, 9.2% improvement over baseline). BiAn also proves cost-efficient and robust across different LLM backbones. The framework's explainability and alignment with operator workflows ensure high adoption and trust.

Limitations & Future Work (as identified by the authors)

  1. Multi-device Failures: BiAn currently assumes a single error device per incident. While rare, multiple devices can contribute equally to a failure. BiAn might rank them closely, allowing operators to identify both, but future work could explicitly model multi-device root causes.
  2. Reliance on Upstream Monitors: BiAn's effectiveness depends on the quality of data from upstream monitors. If monitors generate invalid or ambiguous data, BiAn may struggle to provide accurate results, similar to human operators.
  3. Link-based Incidents: BiAn currently focuses on device-based incidents, which account for the majority of cases. It does not yet support link-based incidents, which follow different investigation processes. Future work involves extending BiAn to handle both.
  4. API Latency Fluctuation: External LLM API services can exhibit substantial latency fluctuations, impacting latency-sensitive tasks. Adopting locally deployed, dedicated resources could offer more stable performance.
  5. Explainability and Consistency Validation: While BiAn provides explanations, validating the consistency between LLM scores and explanations (e.g., ensuring a correct ranking isn't supported by flawed reasoning) is an area for future research.
  6. Prompt Length and Obedience: LLM performance can decline with excessive prompt length, and LLM agents sometimes exhibit inconsistent obedience to instructions. Balancing control and flexibility in prompt engineering remains a challenge.
  7. Adaptive Multi-agent Systems: The current structured, hierarchical framework has limited flexibility. Future work could explore more adaptive multi-agent systems where agents can dynamically plan tasks, retry failures, or invoke other agents to handle increasingly complex and unforeseen incidents.

Personal Insights & Critique

This paper presents a compelling and practically relevant application of LLMs in NetOps. The rigor of its design, including the hierarchical reasoning, multi-pipeline integration, and continuous prompt updating mechanisms, effectively addresses the inherent complexities of large-scale network troubleshooting. The emphasis on explainability and operator satisfaction is crucial for real-world adoption, acknowledging that LLMs are assistants, not replacements.

Strengths:

  • Real-world Impact: The deployment in Alibaba Cloud for 10 months and the quantitative improvements in TTR and accuracy provide strong evidence of practical value. This is a significant strength compared to many academic papers that remain in theoretical or lab settings.
  • Comprehensive Approach: BiAn doesn't just throw an LLM at the problem; it integrates LLMs within a carefully designed framework that mirrors human SOPs and incorporates diverse data types (alerts, topology, timeline).
  • Adaptability: The continuous prompt updating mechanism is a novel and essential feature for dynamic production environments, addressing a key limitation of static ML models in NetOps.
  • Cost-Effectiveness: The low training and inference costs demonstrate that LLM-based solutions can be economically viable at scale.

Potential Areas for Further Exploration/Critique:

  • Data Scarcity for Prompt Updating: While the prompt updating mechanism is ingenious, its reliance on a relatively small number of "difficult" cases (where accuracy isn't 100%) for knowledge extraction might limit the diversity of learned knowledge over time. The "performance degradation as the LLM begins to ignore most of the augmented instructions" due to prompt length is a fundamental LLM challenge that needs more robust solutions beyond just summarization. Could more sophisticated knowledge distillation or sparse attention mechanisms be explored?

  • Generalizability Beyond Alibaba Cloud: While the cross-model evaluation is good, the framework's reliance on specific SOPs and monitor alert formats from Alibaba Cloud means direct transferability to other organizations might require significant re-engineering of the initial LLM agents and knowledge base. This is a common challenge for domain-specific AIOps solutions. Sharing anonymized datasets or synthetic representations would greatly aid external research.

  • Black-box Nature of LLM Reasoning: Despite CoT and explanations, LLMs are still largely black boxes. The limitation of "correct rankings come with wrong explanations" or "useful explanations are hidden by incorrect rankings" is a deep problem. Future work could explore formal verification of reasoning steps or neuro-symbolic AI approaches to ensure logical soundness, not just plausible-sounding explanations.

  • The "Six Candidate Devices" Limit: While 99% coverage is excellent, a 1% miss rate can still be significant in critical infrastructure. The reliance on an upstream system to select "the most suspicious devices" implies that BiAn inherits the limitations or biases of this upstream selection. Further research into making this initial selection more robust and LLM-informed could enhance the overall system.

  • Comparison to Human Experts: While A/B tests show TTR reduction, a deeper qualitative analysis of how BiAn's reasoning compares to human expert reasoning (e.g., does it find non-obvious root causes that humans often miss?) could provide more profound insights.

    Overall, BiAn is an impressive demonstration of how LLMs can be engineered into a practical and impactful AIOps solution for complex network management. It sets a strong foundation for future research in LLM-driven network automation and intelligent troubleshooting.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.