Towards LLM-Based Failure Localizationin Production-Scale Networks
TL;DR Summary
BiAn leverages LLMs to analyze network monitoring data, providing ranked error device diagnoses with explanations, reducing root cause analysis time and improving accuracy in production-scale cloud networks.
Abstract
Towards LLM-Based Failure Localization in Production-Scale Networks Chenxu Wang 1 , 2 , Xumiao Zhang 2 , Runwei Lu 2 , 3 , Xianshang Lin 2 , Xuan Zeng 2 , Xinlei Zhang 2 , Zhe An 2 , Gongwei Wu 2 , Jiaqi Gao 2 , Chen Tian 1 , Guihai Chen 1 , Guyue Liu 4 , Yuhong Liao 2 , Tao Lin 2 , Dennis Cai 2 , Ennan Zhai 2 1 Nanjing University 2 Alibaba Cloud 3 New York University Shanghai 4 Peking University Abstract Root causing and failure localization are critical to maintain relia- bility in cloud network operations. When an incident is reported, network operators must review massive volumes of monitoring data and identify the root cause ( i.e., error device) as fast as possible, making it extremely challenging even for experienced operators. Large language models (LLMs) have shown great potential in text understanding and reasoning. In this paper, we present BiAn , an LLM-based framework designed to assist operators in efficient inci- dent investigation. BiAn processes monitoring data and generates error device rankings with detailed explanations. To date, BiAn has been deployed in our network infrastructure for 10 months and it has successfully assisted operators in
Mind Map
In-depth Reading
English Analysis
Bibliographic Information
- Title:
Towards LLM-Based Failure Localization in Production-Scale Networks - Authors: Chenxu Wang, Xumiao Zhang, Runwei Lu, Xianshang Lin, Xuan Zeng, Xinlei Zhang, Zhe An, Gongwei Wu, Jiaqi Gao, Chen Tian, Guihai Chen, Guyue Liu, Yuhong Liao, Tao Lin, Dennis Cai, Ennan Zhai
- Affiliations: Nanjing University, Alibaba Cloud, New York University Shanghai, Peking University
- Journal/Conference: ACM SIGCOMM 2025 Conference (
SIGCOMM '25)- Comment on Venue Reputation:
SIGCOMMis widely regarded as the premier conference in the field of computer networking. Publication atSIGCOMMsignifies highly impactful and rigorously peer-reviewed research, making this paper a significant contribution to the networking community.
- Comment on Venue Reputation:
- Publication Year: 2025 (scheduled)
- Abstract: Root causing and failure localization are crucial for maintaining reliability in large-scale cloud networks. Network operators currently face the challenge of sifting through vast volumes of monitoring data to identify the root cause (i.e., error device) quickly during incidents. Large Language Models (
LLMs) demonstrate strong potential in text understanding and reasoning for such tasks. This paper introducesBiAn, anLLM-based framework designed to assist operators in efficient incident investigation.BiAnprocesses monitoring data to generate ranked lists of error devices with detailed explanations. Deployed in Alibaba Cloud's network infrastructure for 10 months,BiAnhas successfully reduced the time to root causing by 20.5% overall (55.2% for high-risk incidents). Extensive evaluations over 17 months of real cases confirmBiAn's accuracy and speed, improving accuracy by 9.2% compared to a baseline approach. - Original Source Link: https://ennanzhai.github.io/pub/sigcomm25-bian.pdf (This link points to the PDF of the paper, indicating it's a pre-print or accepted version for the upcoming
SIGCOMM '25conference).
Executive Summary
Background & Motivation (Why)
The paper addresses the critical problem of failure localization and root cause analysis in production-scale cloud networks. In large and complex network infrastructures, incidents are inevitable and can severely impact reliability, leading to significant revenue loss and reputational damage. When an incident occurs, network operators are tasked with quickly identifying the faulty device (the root cause) by reviewing massive volumes of monitoring data. This process is extremely challenging due to several factors:
-
Massive Data Volume: An incident can trigger thousands of alerts from hundreds of devices, generating gigabytes of log entries (
Figure 2). -
Complex Relationships: Anomalies can propagate through interconnected devices, making it difficult to trace the original error source, especially with spatial (topological) and temporal interdependencies.
-
Time Pressure: Operators must diagnose and resolve incidents rapidly, often under strict service-level agreements (
SLAs), making thorough manual review impractical. -
Network Evolution: Modern networks are highly dynamic, requiring troubleshooting tools to continuously adapt to changes in configurations and topologies.
-
Explainability and Consistency: Any automated system must provide clear, trustworthy explanations to assist operators, not just replace them, and minimize
LLM's inherent randomness.Traditional automated methods often lack the generalization capabilities for unseen cases or are limited to coarse-grained analysis.
Large Language Models (LLMs)have shown superior capabilities in text understanding and reasoning, presenting a promising direction to assist human operators in sifting through textual monitoring data and performing logical deductions. The paper's novel approach is to leverageLLMsto build an intelligent assistant that mimics and augments human operators' incident investigation processes, providing accurate, fast, and explainable failure localization.
Main Contributions / Findings (What)
The primary contributions of this paper are:
BiAnFramework: Introduction ofBiAn, a practical and comprehensiveLLM-based framework for accurate and fast root causing and failure localization in production cloud network operations.- Hierarchical Reasoning: A novel
hierarchical reasoningmechanism that breaks down the complex problem of analyzing large-scale, diverse monitoring data into manageable steps:monitor alert summary,single-device anomaly analysis, andjoint scoring. - Multi-pipeline Integration: Integration of
three distinct pipelines(SOP-based anomaly analysis,network topology, andevent timeline) to tackle the spatial and temporal complexity of network incidents, enhancing reasoning accuracy. - Continuous Prompt Updating: A mechanism for
continuously updating LLM promptsby extracting knowledge from historical incidents, enabling the system to adapt to evolving network infrastructures and operational procedures. - System Optimizations: Implementation of several optimizations, including
fine-tuningsmallerLLMsfor specific tasks, anearly stopmechanism for efficient processing of simpler cases, andparallel executionfor latency reduction. - Real-world Deployment and Evaluation: Successful deployment of
BiAnin Alibaba Cloud's global network infrastructure for 10 months, demonstrating its practical effectiveness. Extensive performance evaluations over 17 months of real-world incidents show significant improvements:- Reduced
Time-to-Root-Causing (TTR):BiAnreducedTTRby 20.5% on average, with a notable 55.2% reduction for high-risk incidents. - Improved Accuracy:
BiAnachieved 95.5% accuracy (identifying the error device as top-1) compared to 86.3% for the baselineHot Deviceapproach, and 97.1% when the baseline failed to differentiate between top devices. - Cost-Efficiency: Demonstrated low training and inference costs (average
0.17-0.18 per incident).
- Reduced
Prerequisite Knowledge & Related Work
Foundational Concepts
- Large Language Models (
LLMs): These are advanced artificial intelligence models trained on vast amounts of text data, enabling them to understand, generate, and reason with human language. Key capabilities leveraged inBiAnincludetext understanding,reasoning, andtext generationfor explanations.LLMscan processprompts(instructions or questions) and generateresponses. - Network Operations (
NetOps): Refers to the management of network infrastructure, including monitoring, maintenance, troubleshooting, and ensuring network reliability and performance. - AIOps (Artificial Intelligence for IT Operations): The application of artificial intelligence (especially machine learning and
LLMs) to automate and streamline IT operations tasks, including anomaly detection, root cause analysis, and performance optimization.BiAnis an example ofAIOpsfor networking. - Root Cause Analysis (
RCA): The process of identifying the fundamental reason for an incident or problem. In networking, this means pinpointing the specific device, link, or configuration error that initiated a failure. - Failure Localization: The process of identifying the exact location (e.g., a specific device or link) where a failure has occurred within a network. This is a key outcome of
RCA. - Incident Management (
IM): The structured process of responding to and resolving incidents to restore normal service operation as quickly as possible. This involves detection, diagnosis (includingRCAandfailure localization), mitigation, and resolution. - Network Telemetry: Data collected from network devices (e.g., routers, switches) about their operational state, performance metrics, and events. This data is crucial for monitoring and troubleshooting.
- Standard Operating Procedures (
SOPs): Detailed, written instructions to achieve uniformity of the performance of a specific function. InNetOps,SOPsguide operators through incident investigation steps.BiAn's design is informed by theseSOPs. - Fine-tuning: A technique in machine learning where a pre-trained model (like an
LLM) is further trained on a smaller, specific dataset to adapt it to a particular task or domain. This typically improves performance for the specialized task and can make smaller models more efficient. - Retrieval Augmented Generation (
RAG): AnLLMtechnique where the model retrieves relevant information from a knowledge base (e.g., documents, databases) before generating a response, helping it provide more accurate and context-aware answers and reduce hallucinations. The paper notesRAGis impractical forBiAn's continuous prompt updating due to data labeling challenges. - Chain-of-Thought (
CoT) Prompting: A technique forLLMswhere the model is prompted to generate intermediate reasoning steps before providing a final answer. This improves the model's ability to solve complex reasoning problems and makes its thought process more transparent. - Self-consistency: A technique that involves prompting an
LLMmultiple times for the same question, generating severalCoTpaths, and then selecting the most consistent answer among them. This often improves accuracy and robustness. - Entropy: In information theory,
entropyquantifies the uncertainty or randomness in a probability distribution. A lowerentropyindicates higher confidence in a prediction (e.g., if one device has a much higher probability of being the error device). - Token Limits:
LLMshave a maximum number of tokens (words or sub-words) they can process in a single input or output. This limits the amount of raw data anLLMcan analyze at once.
Previous Works
The paper discusses various categories of related work:
- Network Telemetry and Measurements:
- Many studies focus on collecting and analyzing network data at flow-level or packet-level (e.g.,
007[7], which focuses on pinpointing problems inTCPflows). These tools are helpful for preliminary identification but often cannot pinpoint exact error devices. - Other works optimize telemetry systems for efficiency and reliability (e.g.,
FlowRadar[41],Omnimon[32]). NetPilot[74],CorrOpt[91],SWARM[52] focus on mitigating failures after they are located, complementingBiAn's localization role.
- Many studies focus on collecting and analyzing network data at flow-level or packet-level (e.g.,
- Network Failure Localization (Traditional Methods):
- Statistical/Rule-based: Approaches using statistical techniques or predefined rules to locate link failures [28, 37], analyze packet drops [7, 53, 62], or troubleshoot
IPTV/IPnetworks [6, 48, 76]. - Probing/Tomography: Tools like
NetSonar[83],NetPoirot[8],NetNORAD[38], andPingmesh[22] identify bottlenecks or locate failures at a network level (e.g., spine,ToRswitch). These can serve as data sources forBiAn. - Hardware/INT-based: Solutions requiring specific hardware support [26, 42, 66] or in-band network telemetry (
INT) [11], often resource-intensive or intrusive, making them hard to deploy in heterogeneous infrastructures.
- Statistical/Rule-based: Approaches using statistical techniques or predefined rules to locate link failures [28, 37], analyze packet drops [7, 53, 62], or troubleshoot
- Troubleshooting at Upper Layers:
- Focuses on software/service issues in cloud environments, often identifying root causes leveraging dependencies between entities [9, 17, 27, 34]. These primarily analyze performance metrics, while
BiAnoperates on textual alerts. - Other works address application-layer bugs unrelated to physical devices [43, 50, 57, 60, 68, 85].
- Focuses on software/service issues in cloud environments, often identifying root causes leveraging dependencies between entities [9, 17, 27, 34]. These primarily analyze performance metrics, while
- Utilizing
LLMsin the Network Domain (PriorLLM-NetOps):- Monitoring/Coarse-grained
RCA:MonitorAssistant[81] generates guidance-oriented reports,Oasis[33] assesses impact scope and summarizes outages,RCACopILoT[16] outputs root cause categories. These often still require significant operator effort to pinpoint the specific error device. - Intent Recognition/Summarization:
NetAssistant[65] recognizes diagnosis intent, while Ahmed et al. [5] studyGPT's ability to produce root causes from incident titles/digests. These typically lack deep visibility into detailed logs. - Cloud Software
IM: Roy et al. [56] evaluateReAcTframework for incident analysis, Zhang et al. [86] use in-context learning for general-purposeLLMs,RCAgent[69] uses tool-augmentedLLMsfor cloud job anomalies,LMPACE[84] estimates confidence for black-boxRCA.NetLLM[72] modifiesLLMarchitectures for networking.
- Monitoring/Coarse-grained
Technological Evolution
The field of network troubleshooting has evolved from manual inspection, to rule-based automation, then statistical/machine learning approaches for anomaly detection and coarser-grained RCA. The advent of LLMs represents a significant leap, offering human-like reasoning over unstructured textual data, which is abundant in network monitoring logs. This paper builds on this evolution by applying LLMs to perform detailed, explainable failure localization, bridging the gap between automated detection and human-level investigative reasoning.
Differentiation
BiAn differentiates itself from prior work, particularly LLM-based approaches, by:
- Deep Visibility: Unlike prior
LLMworks that rely on coarse summaries or incident titles,BiAnprocesses detailed monitoring data, allowing for fully informed decisions. - Specific Device Pinpointing: While other
LLMsystems might provide root cause categories or summaries (RCACopILoT,MonitorAssistant,Oasis),BiAnspecifically aims to pinpoint the exact error device with ranked probabilities and detailed explanations. - Comprehensive Integration:
BiAnintegratesSOP-based knowledge withnetwork topologyandevent timelinedata, enabling more sophisticated reasoning than single-aspectLLMapplications. - Adaptability: The
continuous prompt updatingmechanism allowsBiAnto evolve with the dynamic network infrastructure, a crucial feature for production environments that many staticMLmodels lack. - Practical Deployment:
BiAnis designed for and demonstrated in a real-world, large-scale production network, highlighting its robustness and operational value.
Methodology (Core Technology & Implementation Details)
BiAn is an LLM-powered framework designed to assist network operators in root causing and failure localization during incidents. It mimics human investigative processes by understanding monitor alerts, reasoning about root causes, and identifying error devices.
The overall architecture of BiAn is shown in Figure 4 (image not provided, but described as a schematic diagram including monitoring alert summary, device anomaly analysis, joint scoring, multi-pipeline reasoning stages, and final device ranking and root cause analysis results).
When an incident is reported, BiAn receives monitor alerts for a set of candidate devices, which are pre-selected by an upstream automated monitoring system as the most suspicious. BiAn then processes this data through several stages to predict probabilities for each device being the actual error device, providing a ranked list with explanations.
3.1 Data Preprocessing
Before feeding data to LLMs, BiAn preprocesses the raw monitor alerts. This involves removing empty items (e.g., keys with no values) and unused fields from the logs, which often contain redundancy and noise due to logging formatting.
3.2 Hierarchical Reasoning
To manage the large volume and diversity of data and overcome LLM token limits, BiAn employs a hierarchical reasoning approach, processing different types of alerts for different devices separately and breaking down the investigation into three structured steps:
3.2.1 Monitor Alert Summary
- Purpose: To condense raw
monitor alertsinto concise summaries, making them manageable forLLMs. - Process:
BiAnprocesses alerts from 11 upstream monitoring tools (Figure 5left). Each type of alert is handled by a dedicatedLLM agent. LLM AgentPrompting: These agents are prompted to extract key information and summarize anomalous behaviors for each device. Theprompt template(Figure 18) consists of:- Role Definition: Specifies the agent's function (e.g., "You are an expert network monitoring alert summarizer...").
- Input Field Description: Describes the fields in the raw log to be analyzed.
- Summary Guidelines: Provides instructions on how to summarize (e.g., "Extract key information... be concise...").
- Alert Example (One-shot): A concrete example of raw alert data with its expected summarized output, serving as
in-context learning. - Response Format: Defines the structure of the output summary (e.g.,
JSONformat).
- Output: For each device,
BiAngenerates 11 highly condensed summaries (one for each alert type) which are then forwarded forsingle-device anomaly analysis. This modular design allows easy extension for new monitoring tools.
3.2.2 Single-Device Anomaly Analysis
- Purpose: To identify specific
anomaly scenariosexhibited by each suspect device based on its summarized alerts. - Key Insight: Operators have defined 7 distinct
anomaly scenarios(Figure 5right) that cover all observable anomalies for an error device. Each scenario relies on a specific combination of alert types. Examples includeFlapping,Packet Loss,High Utilization, etc. - Process:
BiAnemploys 7 dedicatedLLM agents(one for eachanomaly scenario). Each agent analyzes the summarized alerts for a single device to determine if it exhibits its assigned anomaly. Theprompt template(Figure 19) for these agents includes:- Role Definition: Specifies the agent's role (e.g., "You are a network device anomaly analyzer...").
- Input Description: Describes the format of the summarized alerts.
- Analysis Instructions: Specifies how to perform the anomaly analysis based on
SOPsand mapping inFigure 5. - One-shot Example: An example input with the expected anomaly analysis.
- Response Format: Defines the structure of the output (e.g., "Anomaly found: True/False, Explanation: ...").
- Output: An
anomaly analysis reportfor each device, indicating whichanomaly scenariosit exhibits. An anomaly is not necessarily the root cause, and a device can exhibit multiple anomalies.
3.2.3 Joint Scoring
- Purpose: To consolidate all
anomaly analysis reportsfrom suspect devices and generatefailure scoresreflecting the likelihood of each device being the error device. - Key Insight: The true error device typically exhibits more severe and numerous anomalous behaviors compared to other affected (but not root cause) devices. Operators assign
severity levelsto differentanomaly scenarios. - Process: An
LLM agentevaluates the compiledanomaly analysis reportsandseverity levelsto assign afailure scoreto each suspect device. The scores sum up to 1, and the highest-scoring device is initially marked as the error device. - Output: An initial ranked list of candidate devices with
failure scores.
3.3 Three-Pipeline Integration
To address complex spatial and temporal relationships and improve accuracy beyond the SOP-based Pipeline 1, BiAn incorporates additional information through two more pipelines, creating a two-stage reasoning framework.
3.3.1 Pipeline 2: Network Topology
- Purpose: To incorporate network interconnections into the reasoning, as anomalies can propagate. A parent node causing anomalies on its children is more suspicious.
- Process:
BiAnextractstopology-related informationfrom logs. It identifies a smaller sub-topology containing only suspect devices and their relevant connections. AnLLM agentprocessesJSON-formatted topology data using an algorithm:- Find shortest paths between every two suspect devices.
- Remove shortest paths if an intermediate suspect device already represents the relationship (e.g.,
A-BbecomesA-CandB-Cif is onA-Bpath). - Construct a simplified topology.
- Aggregate non-suspect devices in the same group into a single node for simplification.
- Output: A simplified, relevant network topology for the suspect devices.
3.3.2 Pipeline 3: Event Timeline
- Purpose: To incorporate the temporal sequence of events, as devices with anomalies reported earlier often have higher suspicion.
- Process:
BiAncompiles atimelineof events from all devices, ordered by their start times. - Output: A global
timelineof anomaly events.
3.3.3 Integrated Root Causing
- Two-Stage Reasoning:
BiAndoes not run all pipelines initially. AfterPipeline 1provides initial scores, it filters devices with low suspicion using aTop-\mathcal{P} Concept:**BiAnretains devices whose cumulativesoftmaxscores (ranked descending) reach a threshold . This focuses the second stage on the most suspicious devices.- Formula for Softmax:
Where:
- is the initial failure score for device .
- is the total number of candidate devices.
- is the exponential function applied to the score of device .
- is the sum of exponentials of all device scores, normalizing the output to a probability distribution.
- Formula for Softmax:
Where:
- Second Stage
LLM Agent: An enhanced final reasoningLLM agenttakes inputs from all three pipelines:device anomaly reports(Pipeline 1),topological relationships(Pipeline 2), andtemporal event data(Pipeline 3). Prompt Template(Figure 6): This prompt specifies the task objective, combines information from the three pipelines, and lists principles for root cause analysis (e.g., "Prioritize devices that are upstream or central in the affected topology," "Consider events occurring earlier as potentially causal").- Output: Refined
failure scoresand explanations.
3.3.4 Rank of Ranks for Stability
- Problem:
LLMscan exhibit randomness, leading to inconsistent outputs. - Solution: To mitigate this,
BiAnuses aRank of Ranksapproach:- The
integrated reasoning step(Stage 2) is run times (empirically set to 3). - For each run, devices are ranked by their
failure scores. - The
average rankfor each device across all trials is calculated. - The device with the highest average rank is considered the error device.
- The
- Benefit: This approach smoothes out random fluctuations and improves result stability, particularly when absolute scores are close but relative rankings are consistent.
3.4 Continuous Prompt Updating
To adapt to the rapid evolution of network infrastructures and SOPs, BiAn continuously updates its prompts using an iterative loop (Figure 7):
3.4.1 Exploration
- Initial Step: The
scoring agent(fromjoint scoring) performs 5 reasoning attempts for each incident using the current prompt. - Diversification: For cases where accuracy is not 100%, reasoning is repeated with a
higher temperature. A higher temperature makesLLMoutputs more diverse, increasing the chance of getting both correct and incorrect reasoning paths. CoTGeneration:Zero-shot Chain-of-Thought (CoT)prompting is used to generate intermediate reasoning steps alongside final scores, providing transparency.
3.4.2 Reflection
Agent R: AnLLM agent(Agent R) analyzes the diverse reasoning trials (both correct and incorrect).- Knowledge Extraction:
Agent Ridentifies key factors influencing the results and proposesactionable knowledgethat could lead to correct reasoning or prevent incorrect reasoning.
3.4.3 Knowledge Consolidation
Agent C: A differentLLM agent(Agent C) summarizes the extracted knowledge from all incidents.- Prioritization:
Agent Ccounts the frequency of similar knowledge pieces and ranks them. The six most frequently occurring pieces of knowledge are retained for broad applicability.
3.4.4 Prompt Augmentation
- Integration: The consolidated knowledge is integrated into the task prompt.
- Buffer Array: A buffer stores knowledge from previous iterations. New knowledge is merged, and frequencies are updated.
Refined PromptGeneration: AnotherLLMuses the old prompt and the knowledge buffer to generate a "refined prompt." This ensures logical consistency and controlled adjustments.- Frequency: Prompt updates are performed every few months, aligning with major hardware/software upgrades.
3.5 System Optimizations
3.5.1 Fine-tuning
- Strategy: Smaller, specialized
LLMsarefine-tunedwithdomain-specific informationfor simpler tasks inPipeline 1(monitor alert summary and single-device anomaly analysis). - Data Generation: Synthetic data is generated based on real-world value distributions.
GPT-4is used to produce ground truth labels, which are then verified with rule-based algorithms and small-scale operator fidelity tests. - Benefit: Improves localization accuracy and reduces reasoning latency, as smaller models run faster and are more efficient for specific tasks.
3.5.2 Early Stop
- Purpose: To save computational resources and reduce average response time for simpler cases.
- Mechanism: After the first reasoning stage (Pipeline 1),
BiAncalculates theentropyof thefailure scores.- Formula for Entropy:
Where:
H(P)is theentropyof the probability distribution .- is the probability (or normalized score) that device is the error device.
- is the number of candidate devices.
- A lower
entropyvalue indicates higher confidence (less uncertainty), meaning scores are skewed towards one device.
- Formula for Entropy:
Where:
- Decision: If
entropyis below a predefined threshold,BiAnstops and outputs the current rank. Otherwise, highentropy(greater uncertainty) triggers the second stage (multi-pipeline integration) for more detailed reasoning.
3.5.3 Parallel Execution
- Strategy: All
LLM agentswithin the same step (e.g.,alert summary agents,anomaly analysis agents) run concurrently due to their independence. - Benefit: Maximizes efficiency and reduces overall latency, especially for the
Rank of Rankscalculation in Stage 2.
Experimental Setup
Datasets
- Source: Real incidents from Alibaba Cloud's global network infrastructure.
- Volume: 357 non-trivial cases collected over a period of 17 months.
- Characteristics: These incidents represent complex scenarios encountered in a large-scale production cloud environment. The raw data consists of diverse
monitor alerts(from 11 tools),network topologyinformation, andevent timelinedata. - Data Sample: The paper describes
monitor alertsas having a semi-structured format with human-readable textual fields, including preliminary classifications and statistics. An example alert from the Appendix:
This example shows that the system processes pre-aggregated data, not raw time series, simplifying the{ "event_type": "Device Ping Log", "device_name": "B2", "dropped_ping_count": 150, "timestamp_start": "2024-01-15T10:00:00Z", "timestamp_end": "2024-01-15T10:05:00Z", "device_role": "Core Router", "description": "Significant packet loss detected between B2 and its peer A1." }LLM's task. - Justification: Using real-world data ensures the practical relevance and robustness of
BiAnin a challenging production environment.
Evaluation Metrics
- Accuracy: The primary metric for
failure localization, defined as whether the system identifies the actual error device as thetop-1most suspicious device. The paper also reportstop-2andtop-3accuracies, meaning the error device is within the top 2 or 3 ranked devices.- Conceptual Definition:
Accuracymeasures how oftenBiAncorrectly pinpoints the single, true root cause device.Top-N accuracyis a less strict measure, indicating how often the true root cause is among the top N devices proposed, which is relevant for assisting human operators who might check a few highly ranked candidates. - Mathematical Formula (derived from Appendix E):
Where:
- is the number of incidents where
BiAnsuccessfully identifies thetop-1error device. - is the total number of incidents evaluated.
- is the number of incidents where
- Secondary Metrics (derived from Appendix E, if needed):
- False Positive Rate (
FPR): Where:FPis the number of false positives (a non-root cause device incorrectly identified as root cause).- is the initial number of candidate devices.
- Other metrics like
True Positive Rate,True Negative Rate, andFalse Negative Ratecould also be calculated from the confusion matrix if needed, but the paper primarily focuses on overallAccuracy.
- False Positive Rate (
- Conceptual Definition:
- Time-to-Root-Causing (
TTR): The duration from the initiation of an investigation until the identification of the root cause and error device.- Conceptual Definition:
TTRquantifies the efficiency of the diagnostic phase of incident management. ReducingTTRis crucial for meetingSLAsand minimizing incident impact. It complementsTime-to-Mitigation (TTM), which includes the time taken for actual corrective actions.
- Conceptual Definition:
- Latency: Measurement of the processing time for various components of
BiAnand its end-to-end execution.- Conceptual Definition: Measures how quickly
BiAncan produce its output, which is critical for providing real-time assistance to operators.
- Conceptual Definition: Measures how quickly
- Operator Satisfaction Scores: Feedback collected from operators on the helpfulness of
BiAn's explanations (on a scale of 0-2).- Conceptual Definition: A qualitative measure of the user experience and trustworthiness of the
LLM's output, essential for an assistance tool.
- Conceptual Definition: A qualitative measure of the user experience and trustworthiness of the
- Training and Inference Costs: Monetary costs associated with
fine-tuningmodels and runningLLM inference.- Conceptual Definition: Measures the economic viability of deploying
BiAnat scale.
- Conceptual Definition: Measures the economic viability of deploying
Baselines
Hot Device: One of Alibaba Cloud's existing automatedfailure localizationtools.- Principle: Identifies the device with the most associated alerts. It assumes the error device, being the source, will exhibit various anomalies, while neighboring devices show only partial or less severe symptoms. This is analogous to
007[7] which pinpoints problems inTCPflows. - Why representative: Represents a common, simple, and currently used automated approach in production.
- Principle: Identifies the device with the most associated alerts. It assumes the error device, being the source, will exhibit various anomalies, while neighboring devices show only partial or less severe symptoms. This is analogous to
RCACopILoTvariant: A variant of theRCACopILoT[16] system.- Principle: Outputs coarser-grained root cause categories with explanations after triggering active probing, but does not pinpoint specific error devices.
- Why representative: Represents a more advanced
LLM-based approach but with a different scope (category vs. specific device). This is used in theA/B testsforTTRcomparison.
Models Used
- For performance evaluation:
Qwen2.5: The defaultLLMforBiAn.Qwen2.5-7B-Instruct(fine-tuned): Used for themonitor alert summaryandsingle-device anomaly analysissteps inPipeline 1.Qwen2.5-72B-Instruct: Used for all other tasks (e.g.,joint scoring,integrated root causing).
- For cross-model evaluation:
Llama-3.1 (405B),GPT-4o,Claude-3.5-Sonnet,Gemini-1.5-Pro.
Infrastructure
- Server: Intel Xeon Platinum 8269CY CPU @ 2.50 GHz, 32 GB RAM, 1 TB SSD. This setup is used for both online deployment and evaluation.
Results & Analysis
6.1 Accuracy
BiAn's accuracy is measured by how often it correctly identifies the true error device as the top-1 ranked device. The paper also reports top-2 and top-3 accuracies.
-
Overall Accuracy:
BiAnachieved 95.5%top-1 accuracy(averaged over 10 runs) across 357 cases.- The
Hot Devicebaseline achieved 86.3%top-1 accuracy. - When
Hot Deviceproduced tiedtop-1results (indicating indecision), its accuracy dropped to 70.5%, whileBiAnmaintained 97.1%. This highlightsBiAn's ability to differentiate between multiple suspicious devices. BiAn'stop-2andtop-3accuracies were 98.6% and 99.3%, respectively (compared toHot Device's 88.9% and 93.0%). These hightop-Naccuracies are valuable for operators, who can check a few top-ranked devices.
-
Localization Accuracy under Different Categorizations: The paper categorizes incidents by
risk level,difficulty(resolution time), andincident type(Figure 9).
该图像是图表,展示了图9中不同事故分类(风险等级、难度、事故类型)下的定位准确率比较,分别对比了Hot Device与BIAN方法的表现。Figure 9: Localization accuracy under different incident categorizations (risk level, difficulty, incident type).
- Risk Level (Left):
BiAnmaintains similar accuracy forlow-andmedium-riskincidents (above 95%).- For
high-riskincidents,BiAn's accuracy drops below 90%. This is attributed to the high number of concurrent incidents, which increases noise and complicates information extraction. However,BiAnstill significantly outperformsHot Device, whose accuracy drops more sharply forhigh-riskincidents.
- Difficulty (Middle):
Easy cases() show the highest accuracy forBiAn(95.7%).Medium cases() are also high (92.9%).Hard cases() show lower accuracy, reflecting their inherent complexity, which also challenges human operators. TheHot Devicebaseline fails to report the actual error devices in all hard cases. Despite this,BiAn's speed makes even trial-and-error more efficient than manual investigation for hard cases.
- Incident Type (Right):
BiAnperforms best in resolvingline card-related failures, likely because these typically affect only the hosting device with less propagation.Port,device, andnetwork disconnectionfailures show comparable results.
- Risk Level (Left):
6.2 Latency
BiAn aims to provide real-time assistance.
-
End-to-End Latency: The entire reasoning process completes within 30 seconds. This ensures
BiAn's output is ready by the time operators typically get online after an incident notification. -
Latency Breakdown (
Figure 10):
该图像是图表,展示了图10中的端到端延迟时间细分,分别显示了各阶段及步骤的耗时以及实际与理论总耗时的对比。Figure 10: End-to-end latency breakdown.
- The
Reasoning Task (Stage 2)takes the longest, as it integrates data from all pipelines and devices. - The
end-to-end latencyis only 15% greater than the sum of individual components, demonstrating the effectiveness ofparallel execution.
- The
-
Latency Reduction from Fine-tuning (
Figure 11):
该图像是图表,展示了微调组件的延迟对比,包括‘Monitor Alert Summary’和‘Device Anomaly Analysis’两个部分,从中可以看出微调后延迟显著降低。Figure 11: Latency for fine-tuned components.
Fine-tuningsmaller models forMonitor Alert SummaryandDevice Anomaly Analysissignificantly reduces their latency compared to using larger, general-purpose models. This confirms that specialized, smaller models improve efficiency without compromising performance for less complex, domain-specific tasks.
6.3 Ablation Study
This study evaluates the contribution of different BiAn components.
-
Progressive Design (
Figure 12):
该图像是论文中的柱状图,展示了不同设计阶段的准确率对比,显示采用不同方法后准确率明显提升,最高超过95%。Figure 12: Accuracy at different design phases.
- Running only
Pipeline 1(SOP-based) achieves 87.2% accuracy, already outperformingHot Device. - Adding the
Top-\mathcal{P}filterand a second run ofPipeline 1on filtered candidates increases accuracy by 5.9%. - Further incorporating
topologyortimelinedata individually brings additional accuracy gains. - The full
BiAnwith all components enabled reaches 95.5% accuracy. - Notably, running all three pipelines without the
Top-\mathcal{P}filterresults in lower accuracy. This is because including all devices in all pipelines significantly inflates input token count, diluting theLLM's attention. TheTop-\mathcal{P}filteris crucial for focusing theLLMon relevant information in Stage 2.
- Running only
-
Early Stop Mechanism (
Figure 13):
该图像是图表,展示了早停机制中准确率随时间变化的性能权衡。图中用三条曲线分别表示Top1、Top2和Top3准确率,标注了不同时间点对应的案例比例,体现了早停对准确率和时间的影响。Figure 13: Performance trade-off in the early stop mechanism.
- The
early stopmechanism, controlled by anentropy threshold, allowsBiAnto balanceaccuracyandlatency. - A higher threshold (more cases stop early) leads to lower
latencybut also loweraccuracy. - At zero threshold (all cases go to Stage 2),
BiAnachieves highest accuracy but also highestlatency. - The curves show a convex relationship: initial
early stopadjustments yield substantialaccuracy gainswith minimaltime investment. - The optimal trade-off point (at an
entropy thresholdof 0.75) reduces overall processing time by 70% with only a 0.5%accuracy losscompared to always running Stage 2. This allowsBiAnto handle simpler cases quickly while reserving Stage 2 for more challenging ones.
- The
6.4 Result Stability
The Rank of Ranks technique was introduced to mitigate LLM randomness.
-
Impact of
Rank of RanksRounds (Figure 14left):
该图像是两幅柱状图,展示了图14中Rank轮数对准确率的影响(左图)和微调对步骤1和步骤2准确率提升的好处(右图)。左图横轴表示轮次数,右图比较了Qwen模型与微调模型的表现。Figure 14: Impact of Figure 15: Benefits of fineRank of Ranks rounds. tuning on Steps 1 and 2.
- While average accuracy remains comparable (around 95.5%) across single-run, 3-round, and 5-round executions, the
Rank of Ranksapproach (3 or 5 rounds) significantlyreduces the variance in accuracy. - Increasing rounds from 3 to 5 offers diminishing returns in performance improvement. This suggests that is a good empirical choice for balancing stability and computational cost. The paper notes this is inspired by
self-consistencytechniques.
- While average accuracy remains comparable (around 95.5%) across single-run, 3-round, and 5-round executions, the
6.5 Microbenchmarks
-
Monitor Alert Summary (
Figure 15right - "Benefits of fine-tuning on Steps 1 and 2"):Fine-tuned modelsachieved an average accuracy of 98.7% foralert summarization.- This significantly
outperforms the default Qwen model, highlighting the benefit of domain expertise for effective summarization even in "simpler" tasks.
-
Single-device Anomaly Analysis (
Figure 15right - "Benefits of fine-tuning on Steps 1 and 2"):Fine-tuned modelsachieved 98.6% accuracy foranomaly analysis, beating the default model by over 6%.- This confirms that
fine-tuningsmaller models fordomain-specific knowledgein less complex tasks improves accuracy.
-
Prompt Updating (
Figure 16):
该图像是图表,展示了在提示更新的不同迭代次数下,训练与测试的准确率变化情况(图16)。图中展示了迭代次数为0至3时,训练准确率和测试准确率的对比。Figure 16: Accuracy with number of iterations in prompt updating.
- Using five-fold cross-validation, the
prompt updating algorithmshowed improvements. - For the full dataset, training accuracy increased from 88.7% to 90.0%, and test accuracy from 81.7% to 85.9% after three iterations. The marginal gain was attributed to a high baseline.
- When excluding cases with already 100% accuracy, the improvements were more pronounced: training accuracy from 54.0% to 62.0%, and testing from 36.7% to 50.0% after three iterations.
- Beyond three iterations, excessive
prompt lengthstarted to hinderLLMcomprehension, causing performance degradation, indicating a limitation ofprompt-based trainingcompared to parameterfine-tuning.
- Using five-fold cross-validation, the
6.6 Training and Inference Costs
-
Cost Efficiency (
Figure 17):
该图像是图17,展示了BiAn模型在训练和推理阶段的成本累积分布函数(CDF),横轴为成本(美元),纵轴为CDF,比较了BIAN-Stage 1与完整BIAN模型的成本差异。Figure 17: Training and inference costs.
- The average cost for
fine-tuning specialized LLMswas $121.63. - The
inference costper incident was very low:- $0.17 for running only
Pipeline 1. - $0.18 for running the full
BiAnframework.
- $0.17 for running only
- This demonstrates
BiAn's cost-efficiency for deployment at scale, where even small per-incident gains can translate into significant savings in a large cloud environment. The additional $0.01 per incident for fullBiAnoffers substantial accuracy gains.
- The average cost for
6.7 Running with Other LLMs
To validate generalizability, BiAn was tested with other state-of-the-art LLMs.
Table 1: Performance of different models.
| Model Name | Size | Top 1 | Top 2 | Top 3 |
|---|---|---|---|---|
Qwen2.5 |
72B | 95.5 | 98.6 | 99.3 |
Llama-3.1 |
405B | 95.7 | 98.7 | 99.3 |
GPT-4o |
- | 95.2 | 98.1 | 99.3 |
Claude-3.5-Sonnet |
- | 93.9 | 97.4 | 98.5 |
Gemini-1.5-Pro |
- | 93.2 | 97.9 | 98.7 |
- Consistency:
BiAnconsistently achieves hightop-1totop-3accuracies across various leadingLLMs(Qwen, Llama, GPT, Claude, Gemini). - Robustness: This cross-model evaluation highlights the robustness of the proposed framework, indicating that its design principles are effective regardless of the specific underlying
LLMused.
6.8 Real-World Deployment Experience (A/B Tests & Case Studies)
BiAn has been deployed in Alibaba Cloud for 10 months.
6.8.1 A/B Tests during Deployment
-
Time-to-Root-Causing (TTR)Reduction (Figure 8a):
该图像是论文中图8的复合图,包括(a)不同风险等级下单人和辅助调查的平均故障定位时间对比柱状图,以及(b)4月至12月期间操作人员对BiAn解释满意度的折线图,展示了BiAn在加速定位和提升满意度方面的效果。Figure 8: Comparison of time and satisfaction scores
BiAnreducedTTRby 20.5% on average.- For
high-riskincidents,TTRwas reduced by a notable 55.2%. This is becauseLLM inferencetime does not significantly inflate with incident risk, unlike human manual investigation time. - A
RCACopILoTvariant, which only provides coarse root cause categories, resulted in largerTTRdifferences for higher risk levels, as operators still needed to manually validate and pinpoint devices.
-
Explainability and Operator Satisfaction (
Figure 8b):
该图像是论文中图8的复合图,包括(a)不同风险等级下单人和辅助调查的平均故障定位时间对比柱状图,以及(b)4月至12月期间操作人员对BiAn解释满意度的折线图,展示了BiAn在加速定位和提升满意度方面的效果。Figure 8: Comparison of time and satisfaction scores
- Feedback showed explanations were predominantly rated 1 (somewhat helpful) or 2 (very helpful), with few 0s.
- Even when the
top-1device was incorrect, explanations often provided useful context or intermediate reasoning, assisting operators. - Fluctuations in scores were attributed to rising operator expectations as
BiAniteratively improved.
6.8.2 Case Studies
- Incident 1 (Example from §2.2):
- Scenario: Five alerts on device B2, four on B3, other neighbors affected. Operators struggled to pinpoint the root cause for 22 minutes.
BiAn's Role:BiAnoutput B2 as most suspicious (four anomalies) and B3 second (three anomalies).- Result: Successfully detected anomalies and located the error device in 28 seconds, significantly reducing
TTRcompared to manual effort.
- Incident 2 (Topology Integration):
- Scenario: B1 connected to A1 and A2; A2 also to B2. Stage 1 assigned equal scores to A1 and A2 due to comparable anomalies.
BiAn's Role: Stage 2, withtopology pipeline, revealed A2 was a parent node to B1.- Result:
BiAnflagged A2 as the error device with a much higher score (root cause: line card failure in A2).TTRwas 3 minutes. This highlights the value oftopologyinformation.
- Incident 3 (Timeline Integration):
- Scenario: A1, A2, and B1 showed anomalies (disconnection, network changes on A1; traffic drop on A2 and B1). All had priority-1 anomalies, making initial localization difficult.
BiAn's Role:Timelinedata showed A1's alerts occurred earlier than A2's.- Result:
BiAnidentified A1 as the error device due to disconnection.TTRwas 1.5 minutes. This demonstrates the critical role oftemporaldata.
6.8.3 Operational Experience
- Selection of Candidate Devices:
- Initially, a fixed number of six candidate devices were selected by the upstream monitoring system based on associated alerts. This provided >98% coverage of the error device.
- Active refinement involves a
dynamic selection mechanismconsidering top suspicious devices from various monitoring tools, narrowing down to about six, and achieving >99% coverage.
- Iterative Design Process:
BiAnevolved through continuous testing and feedback.- A key challenge was data retention and format: the
three-pipeline designwas initially delayed becausetopologyandtimelinedata were not retained in older logs, requiring accumulation of new data.
- Onboarding Operators:
BiAnwas integrated into theNOC's on-call platform with a familiar user interface and output format (ranked devices, explanations).- This ensured a comfortable experience for new operators, requiring minimal training due to aligning with existing
SOPsand close collaboration during development.
Conclusion & Personal Thoughts
Conclusion Summary
This paper introduces BiAn, an LLM-based framework for failure localization and root cause analysis in production-scale cloud networks. BiAn addresses the challenges of massive data volume, complex interdependencies, time pressure, and dynamic network environments through several key innovations: a hierarchical reasoning structure, multi-pipeline integration (incorporating SOPs, topology, and timeline data), continuous prompt updating, and various system optimizations (fine-tuning, early stop, parallel execution).
The successful 10-month deployment at Alibaba Cloud demonstrates BiAn's practical effectiveness. It significantly reduces the time-to-root-causing (20.5% overall, 55.2% for high-risk incidents) and improves localization accuracy (95.5% top-1 accuracy, 9.2% improvement over baseline). BiAn also proves cost-efficient and robust across different LLM backbones. The framework's explainability and alignment with operator workflows ensure high adoption and trust.
Limitations & Future Work (as identified by the authors)
- Multi-device Failures:
BiAncurrently assumes a single error device per incident. While rare, multiple devices can contribute equally to a failure.BiAnmight rank them closely, allowing operators to identify both, but future work could explicitly model multi-device root causes. - Reliance on Upstream Monitors:
BiAn's effectiveness depends on the quality of data from upstream monitors. If monitors generate invalid or ambiguous data,BiAnmay struggle to provide accurate results, similar to human operators. - Link-based Incidents:
BiAncurrently focuses on device-based incidents, which account for the majority of cases. It does not yet supportlink-based incidents, which follow different investigation processes. Future work involves extendingBiAnto handle both. - API Latency Fluctuation: External
LLM APIservices can exhibit substantial latency fluctuations, impacting latency-sensitive tasks. Adopting locally deployed, dedicated resources could offer more stable performance. - Explainability and Consistency Validation: While
BiAnprovides explanations, validating the consistency betweenLLMscores and explanations (e.g., ensuring a correct ranking isn't supported by flawed reasoning) is an area for future research. - Prompt Length and Obedience:
LLMperformance can decline with excessive prompt length, andLLM agentssometimes exhibit inconsistent obedience to instructions. Balancing control and flexibility inprompt engineeringremains a challenge. - Adaptive Multi-agent Systems: The current structured, hierarchical framework has limited flexibility. Future work could explore more adaptive
multi-agent systemswhere agents can dynamically plan tasks, retry failures, or invoke other agents to handle increasingly complex and unforeseen incidents.
Personal Insights & Critique
This paper presents a compelling and practically relevant application of LLMs in NetOps. The rigor of its design, including the hierarchical reasoning, multi-pipeline integration, and continuous prompt updating mechanisms, effectively addresses the inherent complexities of large-scale network troubleshooting. The emphasis on explainability and operator satisfaction is crucial for real-world adoption, acknowledging that LLMs are assistants, not replacements.
Strengths:
- Real-world Impact: The deployment in Alibaba Cloud for 10 months and the quantitative improvements in
TTRandaccuracyprovide strong evidence of practical value. This is a significant strength compared to many academic papers that remain in theoretical or lab settings. - Comprehensive Approach:
BiAndoesn't just throw anLLMat the problem; it integratesLLMswithin a carefully designed framework that mirrors humanSOPsand incorporates diverse data types (alerts,topology,timeline). - Adaptability: The
continuous prompt updatingmechanism is a novel and essential feature for dynamic production environments, addressing a key limitation of staticMLmodels inNetOps. - Cost-Effectiveness: The low training and inference costs demonstrate that
LLM-based solutions can be economically viable at scale.
Potential Areas for Further Exploration/Critique:
-
Data Scarcity for Prompt Updating: While the
prompt updatingmechanism is ingenious, its reliance on a relatively small number of "difficult" cases (where accuracy isn't 100%) for knowledge extraction might limit the diversity of learned knowledge over time. The "performance degradation as the LLM begins to ignore most of the augmented instructions" due toprompt lengthis a fundamentalLLMchallenge that needs more robust solutions beyond just summarization. Could more sophisticatedknowledge distillationorsparse attentionmechanisms be explored? -
Generalizability Beyond Alibaba Cloud: While the cross-model evaluation is good, the framework's reliance on specific
SOPsandmonitor alertformats from Alibaba Cloud means direct transferability to other organizations might require significant re-engineering of the initialLLM agentsand knowledge base. This is a common challenge for domain-specificAIOpssolutions. Sharing anonymized datasets or synthetic representations would greatly aid external research. -
Black-box Nature of
LLMReasoning: DespiteCoTand explanations,LLMsare still largely black boxes. The limitation of "correct rankings come with wrong explanations" or "useful explanations are hidden by incorrect rankings" is a deep problem. Future work could exploreformal verificationof reasoning steps orneuro-symbolic AIapproaches to ensure logical soundness, not just plausible-sounding explanations. -
The "Six Candidate Devices" Limit: While 99% coverage is excellent, a 1% miss rate can still be significant in critical infrastructure. The reliance on an upstream system to select "the most suspicious devices" implies that
BiAninherits the limitations or biases of this upstream selection. Further research into making this initial selection more robust andLLM-informed could enhance the overall system. -
Comparison to Human Experts: While
A/B testsshowTTRreduction, a deeper qualitative analysis of howBiAn's reasoning compares to human expert reasoning (e.g., does it find non-obvious root causes that humans often miss?) could provide more profound insights.Overall,
BiAnis an impressive demonstration of howLLMscan be engineered into a practical and impactfulAIOpssolution for complex network management. It sets a strong foundation for future research inLLM-driven network automation and intelligent troubleshooting.
Similar papers
Recommended via semantic vector search.