AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
TL;DR Summary
AIOPSLAB is introduced as a framework to evaluate AI agents for automating IT operations in complex cloud environments. It integrates fault injection, workload generation, and telemetry export, enabling design and assessment of end-to-end AI solutions, showcasing the potential an
Abstract
AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis, to reduce human workload and minimize customer impact. While traditional DevOps tools and AIOps algorithms often focus on addressing isolated operational tasks, recent advances in Large Language Models (LLMs) and AI agents are revolutionizing AIOps by enabling end-to-end and multitask automation. This paper envisions a future where AI agents autonomously manage operational tasks throughout the entire incident lifecycle, leading to self-healing cloud systems, a paradigm we term AgentOps. Realizing this vision requires a comprehensive framework to guide the design, development, and evaluation of these agents. To this end, we present AIOPSLAB, a framework that not only deploys microservice cloud environments, injects faults, generates workloads, and exports telemetry data but also orchestrates these components and provides interfaces for interacting with and evaluating agents. We discuss the key requirements for such a holistic framework and demonstrate how AIOPSLAB can facilitate the evaluation of next-generation AIOps agents. Through evaluations of state-of-the-art LLM agents within the benchmark created by AIOPSLAB, we provide insights into their capabilities and limitations in handling complex operational tasks in cloud environments.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds". It focuses on developing a comprehensive framework for designing, developing, and evaluating AI agents, particularly Large Language Model (LLM)-based agents, for automating IT operations in cloud environments, leading to what the authors term AgentOps.
1.2. Authors
The authors of this paper are:
-
Yinfang Chen ()
-
Manish Shetty ()
-
Gagan Somashekar ()
-
Minghua Ma ()
-
Yogesh Simmhan ()
-
Jonathan Macc ()
-
Chetan Bansal ()
-
Ruja Wang ()
-
Saravan Rajmohan ()
The affiliations are indicated by superscripts:
-
: Not explicitly stated, but often implies a primary academic or research institution or a different corporate entity.
-
: Not explicitly stated.
-
: Most likely a corporate research lab, given the common affiliations seen in
AIOpsresearch. -
: Not explicitly stated.
Based on the nature of
AIOpsand the affiliations typically associated with such research (often a mix of academia and industry, particularly cloud providers or software companies), it's common for authors to come from leading tech companies and universities.
1.3. Journal/Conference
This paper is published as a preprint on arXiv. arXiv (pronounced "archive") is an open-access repository for electronic preprints of scientific papers, primarily in the fields of mathematics, physics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. It is a highly influential platform for rapid dissemination of research findings before formal peer review and publication in academic journals or conference proceedings. Its reputation is significant for quickly sharing cutting-edge research.
1.4. Publication Year
The paper was published at 2025-01-12T00:00:00.000Z, which indicates January 12, 2025. This is a future date, implying the paper is a very recent or upcoming preprint.
1.5. Abstract
The paper addresses the growing complexity of IT operations (ITOps) in cloud environments, which AI for IT Operations (AIOps) aims to automate. While traditional DevOps tools and AIOps algorithms handle isolated tasks, recent advancements in Large Language Models (LLMs) and AI agents are enabling end-to-end and multi-task automation, envisioning self-healing cloud systems—a paradigm termed AgentOps.
To realize this AgentOps vision, the authors propose AIOPSLAB, a comprehensive framework designed to guide the design, development, and evaluation of these AI agents. AIOPSLAB can deploy microservice cloud environments, inject faults, generate workloads, and export telemetry data. Crucially, it orchestrates these components and provides interfaces for agent interaction and evaluation.
The paper discusses the key requirements for such a holistic framework and demonstrates AIOPSLAB's utility by evaluating state-of-the-art LLM agents within the benchmark it creates. This evaluation provides insights into the capabilities and limitations of LLM agents in handling complex operational tasks in cloud environments.
1.6. Original Source Link
The paper is available as a preprint:
-
Original Source Link:
https://arxiv.org/abs/2501.06706v1 -
PDF Link:
https://arxiv.org/pdf/2501.06706.pdfIt is currently a preprint, meaning it has not yet undergone formal peer review and publication in a journal or conference.
2. Executive Summary
2.1. Background & Motivation
The rapid adoption of hyper-scale, cloud-based systems, characterized by distributed architectures like microservices and serverless computing, has introduced unprecedented operational complexity. Managing incidents in these environments is a significant challenge, with potential outages leading to massive financial losses (e.g., $100 million per hour for an Amazon outage).
To address these challenges, the field of AIOps (Artificial Intelligence for IT Operations) emerged, aiming to automate complex operational tasks and eventually achieve autonomous self-healing clouds. While AIOps has existed for over a decade, traditional DevOps tools and AIOps algorithms typically focus on isolated tasks, such as fault localization or root cause analysis (RCA). They lack the capability for end-to-end and multi-task automation across the entire incident lifecycle.
Recent advancements in Large Language Models (LLMs) and AI agents have begun to bridge this gap, allowing AI agents to interact dynamically with environments and manage operational tasks autonomously. This evolution points towards a new paradigm called AgentOps, where agents can make real-time decisions and execute end-to-end actions to ensure system reliability.
A critical barrier to realizing AgentOps is the lack of high-quality, comprehensive benchmarks that can simulate diverse, realistic cloud scenarios and allow for the interactive evaluation of AI agents. Existing benchmarks often rely on static datasets or focus on the 'Dev' side of DevOps, failing to capture the dynamic, unpredictable, and evolving nature of real-world cloud operations. Furthermore, many current AgentOps efforts use proprietary services and datasets, hindering broader research and development. This paper aims to fill this gap by proposing a holistic framework for AgentOps evaluation.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of AIOps and AI agents for cloud operations:
- Requirements and Challenges for a Holistic Framework: The authors meticulously identify and discuss the key requirements and inherent challenges in building a comprehensive framework capable of supporting the design, development, and evaluation of autonomous
AIOps agents. This lays a foundational understanding for future research in the area. - Introduction of
AIOPSLABFramework: The paper develops and introducesAIOPSLAB, an innovative, holistic framework. This framework is designed to automatically manage the entire end-to-end evaluation process forAIOps solutions. Its capabilities include:- Deploying microservice cloud environments.
- Injecting diverse faults.
- Generating realistic workloads.
- Exporting comprehensive telemetry data (logs, metrics, traces).
- Crucially, orchestrating these components and providing a unified
Agent-Cloud Interface (ACI)for agents to interact with and be evaluated in the cloud environment.
- Construction of a Benchmark Suite: Leveraging the
AIOPSLABframework, the authors construct a benchmark suite comprising 48 distinct problems. These problems are designed to cover variousAIOpstasks (detection, localization, root cause analysis, and mitigation) within an interactive environment, creating a realistic testing ground forLLM-based agents. - Evaluation of State-of-the-Art
LLM-based Agents: The paper demonstratesAIOPSLAB's utility by evaluating four state-of-the-artLLM-based agentsagainst its benchmark suite. This evaluation provides crucial insights into the current capabilities and limitations of these agents when tackling complex operational tasks in dynamic cloud environments. - Detailed Analysis of Agent Performance: Through the evaluations, the paper offers a detailed analysis of
agent performance, highlighting specificfailure modes,challenges, andopportunities for improvement. Key findings include:LLM agentsshow promise but struggle with complex tasks likeRCAandmitigation.- Agents often waste steps on unnecessary actions or exhibit invalid
API usage. - Context window limitations due to verbose telemetry data can hinder performance.
Self-repair mechanismscan quickly saturate, suggesting a need for better task decomposition and intermediate feedback.- Some agents (e.g.,
GPT-3.5-w-SHELL) repeatedly make the same errors, while others (REACT) can self-correct. - False positive detection remains an issue for some agents.
- Public Availability: The authors commit to making
AIOPSLABpublicly available, which will significantly benefit the research community by providing a standardized and interactive platform forAgentOpsresearch and development.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with several fundamental concepts in cloud computing, IT operations, and artificial intelligence.
-
Cloud Computing and Microservices:
- Cloud Computing: A model for delivering computing services—including servers, storage, databases, networking, software, analytics, and intelligence—over the Internet ("the cloud"). It offers faster innovation, flexible resources, and economies of scale. Examples include Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).
- Microservices: An architectural style that structures an application as a collection of loosely coupled, independently deployable, small services. Each service runs in its own process and communicates with others using lightweight mechanisms (e.g., HTTP APIs). This approach enhances scalability, agility, and resilience but increases operational complexity due to distribution.
- Serverless Computing: A cloud execution model where the cloud provider dynamically manages the allocation and provisioning of servers. Developers write and deploy code (functions) without managing underlying infrastructure. It's an evolution often intertwined with microservices, further abstracting infrastructure.
- Kubernetes: An open-source container-orchestration system for automating deployment, scaling, and management of containerized applications. It groups containers into logical units for easy management and discovery. Many microservice applications are deployed on Kubernetes.
-
IT Operations (ITOps) and DevOps:
- IT Operations (ITOps): The processes and services managed by an IT department to maintain the smooth functioning of an organization's technology infrastructure, including deployment, monitoring, incident management, and maintenance.
- DevOps (Development and Operations): A set of practices that combines software development (Dev) and IT operations (Ops). It aims to shorten the systems development life cycle and provide continuous delivery with high software quality. Key principles include automation, continuous integration/continuous delivery (CI/CD), and collaboration.
-
AIOps and AgentOps:
- AIOps (AI for IT Operations): The application of
Artificial Intelligence (AI)andMachine Learning (ML)to automateIT operations. Its goal is to analyze large volumes ofIT data(logs, metrics, traces) to detect anomalies, diagnose problems, and predict issues proactively, reducing human intervention. - AgentOps (Agent for Operations): A new paradigm proposed in this paper, representing an evolution of
AIOpswhere autonomousAI agents(especiallyLLM-based ones) manageend-to-endoperational tasks across the entireIT stack, leading toself-healing cloud systems. Unlike traditionalAIOpswhich might focus on specific tasks,AgentOpsenvisions agents that can perform multi-task, holistic management. - Self-healing cloud systems: Cloud environments that can automatically detect, diagnose, and resolve issues (e.g., faults, performance degradation) without human intervention, ensuring continuous availability and optimal performance.
- AIOps (AI for IT Operations): The application of
-
Artificial Intelligence Concepts:
- Large Language Models (LLMs): A class of
AI modelstrained on massive amounts of text data, enabling them to understand, generate, and reason with human language. They can perform a wide range of natural language processing tasks, including summarization, translation, question answering, and code generation.LLMsform the core of theAI agentsevaluated in this paper. - AI Agents:
LLMsaugmented with external tools and the ability to interact with an environment. They can perceive their environment, reason about it, take actions using tools, and observe the outcomes, learning to achieve goals autonomously. ForAIOps, this means interacting with cloud environments via APIs, command-line interfaces, and receiving telemetry feedback. - Telemetry Data: Data collected from
IT systemsto monitor their performance, health, and behavior. It typically includes:- Logs: Time-stamped records of events that occur within a system or application, useful for debugging and auditing.
- Metrics: Numerical measurements collected over time (e.g., CPU utilization, memory usage, network latency, request rates), often presented as time series data.
- Traces: End-to-end records of requests as they propagate through a distributed system, showing the sequence of operations and dependencies between services. Essential for understanding request flow in microservices.
- Fault Injection: A technique used to test the resilience of systems by deliberately introducing errors or failures. This helps identify vulnerabilities and verify
error handling mechanisms. - Root Cause Analysis (RCA): A systematic process for identifying the underlying causes of problems or incidents in a system, rather than just addressing the symptoms.
- Large Language Models (LLMs): A class of
3.2. Previous Works
The paper frames its contribution in the context of existing efforts in AIOps and AI agent development, highlighting their limitations.
- Traditional AIOps Tools and Algorithms: Prior
AIOpsapproaches have primarily focused on solving isolated operational tasks.- Anomaly Detection: Identifying unusual patterns in data (e.g.,
MKsMCby Çetin and Tasgin, 2020). - Fault Localization: Pinpointing the specific component or service responsible for a fault (e.g.,
RMLADby Wang et al., 2020;PDiagnoseby Hou et al., 2021). - These methods often excel at their specific tasks but lack the ability to manage the entire incident lifecycle or interact dynamically with the cloud environment.
- Anomaly Detection: Identifying unusual patterns in data (e.g.,
- Existing Benchmarks for AI Agents:
- For the 'Dev' side of
DevOps(software development), several benchmarks exist that leverageAI agents:WebArena(Zhou et al., 2023): A realistic web environment for building autonomous agents, testing their ability to interact with web applications.R2E(Jain et al., 2024b): Turns any GitHub repository into a programming agent environment for evaluating code-related tasks.HumanEval(Chen et al., 2021): A benchmark for evaluating code generation capabilities ofLLMs.LiveCodeBench(Jain et al., 2024a): A holistic and contamination-free evaluation forLLMsfor code.SWEbench(Jimenez et al., 2024): EvaluatesLLMson real-world GitHub issues.
- These benchmarks primarily focus on software development tasks and do not adequately simulate the complexities of
IT operationsin dynamic cloud environments.
- For the 'Dev' side of
- AIOps Benchmarks:
- Existing
AIOps benchmarksoften rely on static datasets:System metrics(Han et al., 2022; Jacob et al., 2020): Typically time series data, which allows for offline analysis but not interactive agent evaluation.Fixed question-answer format(Liu et al., 2023): Lacks the dynamic interaction required forAgentOps.
- Many recent
AgentOpsefforts (e.g.,RCACopilotby Chen et al., 2024;RCagentby Wang et al., 2023;MonitorAssistantby Yu et al., 2024a;Xpertby Jiang et al., 2024) use proprietary services and datasets, hindering public access and comparative evaluation. - Crucially, these benchmarks often focus only on isolated aspects of the incident lifecycle, not providing a cohesive framework to evaluate
AIOps agentscomprehensively across multiple tasks or decision-making for chaining algorithms.
- Existing
3.3. Technological Evolution
The evolution of IT operations has moved through several stages:
- Manual Operations: Initial stage, highly human-intensive, slow, error-prone, and not scalable for complex systems.
- Scripted Automation/Traditional DevOps: Introduction of scripts and automation tools to handle repetitive tasks and streamline
DevOpspipelines. This improved efficiency but still required human oversight and custom scripting for each scenario. - Traditional AIOps Algorithms: Application of
MLalgorithms for specific tasks like anomaly detection,fault localization, and log analysis. These algorithms provided insights and partial automation but typically operated on passive data or addressed isolated problems without integrated interaction with the operational environment. - LLM-based AIOps (Emerging): The rise of
LLMsintroduced the potential for more intelligent, context-aware analysis and human-like interaction. Initial applications included summarization of incidents, generating recommendations, or basic Q&A overIT data. - AI Agents for AIOps / AgentOps (Current Frontier): The integration of
LLMswith external tools and decision-making loops, allowing them to act as autonomous agents. These agents can perceive the cloud environment, reason about problems, execute actions (e.g., using APIs, shell commands), and learn from feedback, managing the entire incident lifecycle. This is whereAIOPSLABpositions itself, pushing the boundaries towardsself-healing cloud systems.
3.4. Differentiation Analysis
AIOPSLAB distinguishes itself from previous works by addressing several critical limitations:
-
Holistic and End-to-End Evaluation: Unlike traditional
AIOpsbenchmarks that focus on isolated tasks (e.g., only detection or localization),AIOPSLABprovides a unified framework for evaluating agents across the entire incident management lifecycle, encompassing detection, localization,Root Cause Analysis (RCA), and mitigation. This allows for a more comprehensive assessment ofAgentOpscapabilities. -
Dynamic and Interactive Environment: A key innovation is
AIOPSLAB's ability to simulate dynamic cloud environments. Most priorAIOps benchmarksrelied on static datasets, which cannot capture the real-time, unpredictable nature of operational incidents or allow agents to interact and modify the environment.AIOPSLABfacilitatesagent-cloud interactionvia itsAgent-Cloud Interface (ACI), enabling dynamic decision-making and feedback loops. -
Focus on LLM-based AI Agents: The framework is specifically designed to evaluate the
next-generation AIOps agentspowered byLLMs, which are capable of complex reasoning and tool use. This is a critical departure from evaluating traditionalAIOps algorithmsthat often don't have interactive capabilities. -
Realistic Problem Scenarios with Functional Faults:
AIOPSLABgoes beyond simple symptomatic faults (e.g., crash failures) by incorporatingfunctional faults(e.g., misconfigurations, software bugs). Thesefine-grained root causespose a greater challenge toAI agents, requiring deeper diagnostic and mitigation abilities, thus leading to more realistic evaluation scenarios. -
Open and Extensible Framework: By committing to public availability and designing
AIOPSLABwith modular components (fault library, workload generators, telemetry observer), it provides an extensible platform. This contrasts with manyLLM-based cloud managementefforts that rely on proprietary services and closed datasets, making replication and comparative research difficult. -
Unified Interface (
ACI): TheAgent-Cloud Interface (ACI)abstracts the complexity of the cloud environment, providing a standardized set ofAPIsfor agents to interact. This simplifies agent design and allows for consistent evaluation across different agent architectures.In essence,
AIOPSLABshifts the evaluation paradigm from passive analysis on static data to active, interactive problem-solving within a simulated, dynamic cloud environment, specifically tailored for the burgeoning field ofLLM-driven AI agentsinIT operations.
4. Methodology
4.1. Principles
The core idea behind AIOPSLAB is to provide a holistic and interactive environment for the design, development, and evaluation of autonomous AIOps agents, particularly those powered by LLMs. The theoretical basis and intuition are rooted in the need to move beyond isolated AIOps tasks and static datasets towards a dynamic AgentOps paradigm where AI agents can autonomously manage the entire incident lifecycle in complex, real-world cloud environments. The framework aims to bridge the gap between LLM capabilities and practical IT operations by offering a structured way for agents to perceive, reason, act, and learn within a controlled, yet realistic, cloud simulation.
Key principles include:
- Holism: Evaluating agents across the complete
incident management lifecycle(detection, localization,RCA, mitigation). - Interactivity: Enabling agents to dynamically interact with the cloud environment, take actions, and receive real-time feedback.
- Realism: Simulating complex microservice architectures, injecting diverse
symptomaticandfunctional faults, and generating realistic workloads. - Standardization: Providing a unified
Agent-Cloud Interface (ACI)to simplify agent development and ensure consistent evaluation. - Extensibility: Allowing users to easily define new problems, integrate new services, and add different types of faults.
- Observability: Collecting comprehensive
telemetry data(logs, metrics, traces) to facilitate agent reasoning and performance analysis.
4.2. Core Methodology In-depth (Layer by Layer)
AIOPSLAB is designed as a modular framework with an Orchestrator at its core, coordinating interactions between various components. The overall architecture is depicted in Figure 2 of the original paper.
The following figure (Figure 2 from the original paper) provides an overview of AIOPsLAB's architecture:
Figure 2. Overview of AIOPsLAB. The Orchestrator coordinates interactions between various system components and serves as the Agent-Cloud-Interface (ACI). Agents engage with the Orchestrator to solve tasks, receiving a problem description, instructions, and relevant APIs. The Orchestrator generates diverse problems using the Workload and Fault Generators, injecting these into applications it can deploy. The deployed service has observability, providing telemetry such as metrics, traces, and logs. Agents act via the Orchestrator, which executes them and updates the service's state. The Orchestrator evaluates the final solution using predefined metrics for the task.
Here's a breakdown of its components and workflow:
4.2.1. Problem Definition
To support a wide range of evaluation scenarios, AIOPSLAB formalizes an AIOps problem as a tuple:
Where:
-
: Represents a
task. This defines the specificAIOps operationto be performed. -
: Represents a
context. This provides the environment and information related to the problem. -
: Represents the
expected solution(oracle). This is used to evaluate the agent's performance.The
taskis categorized into four types, reflecting the stages of the incident management lifecycle, with increasing complexity:
-
Detection: Identifying the presence of unusual behavior or faults.
-
Localization: Pinpointing the exact source of a fault (e.g., a specific microservice or pod).
-
Root Cause Analysis (RCA): Determining the underlying cause of the fault (e.g., misconfiguration, software bug).
-
Mitigation: Applying effective solutions to recover the environment from the fault.
Each task type has associated success criteria and evaluation metrics (e.g.,
Time-to-Detect (TTD)for detection).
The context is further formalized as:
Where:
-
: The
operational environmentin which the problem occurs. This includes the cloud service, thefault model, and theworkload modelused to generate the problem, but this information is not directly shared with the agent. -
: The
problem informationshared directly with the agent. This comprises service descriptions, task descriptions, documentation about availableAPIs, and indirect information (logs, metrics, traces) that the agent can query at runtime.The
solutionis the expected outcome, typically problem and task-specific. For mitigation tasks,AIOPSLABevaluates the overall system state (e.g., all services running) rather than just the targeted resource, accounting for potential side effects of mitigation.
Example 2.1: Problem Definition The paper provides an example of defining a localization problem:
# interface.
Example 2.1. Consider the problem of localizing a Kubernetes target port misconfiguration in a social network application. AIOPSLAB makes it easy to define this problem in just a few lines by extending the Loc. configuration Task. def __init__(self): self. app = SocialNetwork() self. ans def start_workload(self): wrk Wrk(rate=100, duration=10) wrk.start_workload(ur1=self.app. frontend_url) 11 def inject_fault(self): inj inj.inject([self.ans], "misconfig_k8s") def eval(self,soln, trace, duration): res[TTT]=duration res["success"] i.findAll(soln, self resposta res) return res
Explanation of the example:
__init__(self): Initializes the problem.self.app = SocialNetwork(): Sets up theSocialNetworkmicroservice application as the target environment.self.ans = "user-service": Defines the ground truth solution, indicating that the fault is expected to be in the "user-service".
start_workload(self): Defines how to generate traffic/load.wrk = Wrk(rate=100, duration=10): Initializes awrktool (a common HTTP benchmarking tool) to generate a workload at a rate of 100 requests per second for 10 seconds.wrk.start_workload(url=self.app.frontend_url): Starts injecting this workload to the frontend URL of theSocialNetworkapplication.
inject_fault(self): Defines how to inject the fault.- : This appears to be a placeholder or corrupted text in the OCR, but it implies initializing a fault injector with a specific fault type,
"misconfig_k8s", which likely refers to a Kubernetes misconfiguration. - : Injects the
misconfig_k8sfault specifically into the service identified byself.ans("user-service").
- : This appears to be a placeholder or corrupted text in the OCR, but it implies initializing a fault injector with a specific fault type,
eval(self, soln, trace, duration): Defines how to evaluate the agent's proposed solution.-
: Records the time-to-task completion.
-
: Checks if the agent's solution (
soln) matches the expected answer (self.ans). The specific functioni.findAllandself resposta resare likely internal helper functions for comparison. -
return res: Returns the evaluation results.In this example, is fault localization, is "user-service", and includes the
SocialNetworkapplication, amisconfig_k8sfault, and a standardwrkworkload.
-
4.2.2. Orchestrator
The Orchestrator is the central component of AIOPSLAB, enforcing separation of concerns between the AI agent and the cloud service. It provides robust interfaces for integration and extension.
4.2.2.1. Agent-Cloud Interface (ACI)
The ACI is a critical part of the Orchestrator, defining how an AI agent interacts with the cloud environment. It specifies:
-
The set of valid actions available to the agent.
-
How the service's state (observations) is conveyed back to the agent after its actions.
The
ACIabstracts the complexity of the cloud, offering a concise, documented list ofAPIs. Examples of defaultAPIsprovided byAIOPSLABinclude:
-
get_logs(ns: str) -> str: Fetches logs from a specified Kubernetes namespace (ns). -
get_metrics(ns: str) -> str: Fetches metrics from a specified Kubernetes namespace (ns). -
get_traces(ns: str, duration: int = 5) -> str: Fetches trace data for a specified namespace and duration. -
exec_shell(command: str) -> str: Executes shell commands (with security policy filters).Upon problem initialization, the
Orchestratorautomatically extracts documentation from theseAPIsand provides it as part of thecontextto the agent. Agents can specify various actions (e.g., scaling, redeploying, patching) via theOrchestrator's privileged access. TheOrchestratorthen provides high-quality feedback (outputs, error messages, tracebacks) on the service's state.
Example 2.2: ACI Definition (get_traces)
The paper illustrates how an ACI API is defined:
1 class TaskActions: 2 def get_traces(ns: str, duration: int = 5) str: 3 get: 4 Capts Sce t raie dat o the serrices from aeger. 5 Args: 6 ns (str): The K8S namespace. 7 duration (int): Duration to collect traces. 8 Returns: 9 str: Path to the directory where traces saved. 10 = case_api = TraceAPI(ns) 11 end_t = datetime.now() 12 start_t = end_t - timedelta(duration) 13 traces traceapi.extract_traces(start_t, 14 end_t) 15 return traceapi.save_traces(traces)
Explanation of the example:
class TaskActions: Defines a class for available actions.def get_traces(ns: str, duration: int = 5) -> str: Declares theget_tracesfunction, taking a Kubernetes namespace (nsas string) and a duration (integer, default 5) as input, and returning a string (path to saved traces).- Lines 3-9: These are likely docstrings or comments explaining the API's purpose, arguments, and return value. "Capts Sce t raie dat o the serrices from aeger." seems to be an OCR error for "Captures service trace data from Jaeger."
case_api = TraceAPI(ns): Initializes aTraceAPIobject for the specified namespace.end_t = datetime.now(): Gets the current timestamp.start_t = end_t - timedelta(duration): Calculates the start timestamp for trace collection based on theduration.traces = traceapi.extract_traces(start_t, end_t): Uses theTraceAPIto extract traces within the defined time window.return traceapi.save_traces(traces): Saves the extracted traces and returns the path to the saved files.
4.2.2.2. Session Interface
The Orchestrator manages the lifecycle of the agent and the service through a session-based system. A Session is created for each instance of an agent solving a problem. Agents must implement a get_action method with the signature async def get_action(state: str) -> str, which takes the service's state as input and returns the agent's next action.
Example 2.3: Agent Onboarding The paper illustrates how an agent can be onboarded:
from aiopslab import Orchestrator class Agent: def init def _ init_ self, prob, instructs, apis: self.promp t = self.get처pt (prob,. instructs, apis) self.llm = GPT4() async de.get_action(self, state: str) - > str: return self.llm. generate(self.promp t + state) initialize the orchestrator orbchort = Orchestrator () pid "miscconfig_app_hotel_res-mitigation- 1" prob_desc instructs,apis = orch.init problem(pid) #register and evaluate the agent agent Agent(prob_desc, instructs, apis) orch.register_agent (agent, name="myAgent") asyncio.run (orch.start problem (max steps =10))
Explanation of the example:
from aiopslab import Orchestrator: Imports theOrchestratorclass.class Agent:: Defines a generic agent class.def _init_self, prob, instructs, apis:: Initializes the agent.self.prompt = self.get처pt(prob, instructs, apis): Seems to be an OCR error, likelyself.prompt = self.get_prompt(...), constructing the initial prompt for theLLMusing problem description, instructions, and availableAPIs.self.llm = GPT4(): Instantiates anLLM(e.g.,GPT-4).
async def get_action(self, state: str) -> str: The crucial method where the agent decides its next action.return self.llm.generate(self.prompt + state): Generates the next action by feeding theLLMthe current prompt (including problem context) and the current state of the environment.
orch = Orchestrator(): Initializes theAIOPSLAB Orchestrator.pid = "miscconfig_app_hotel_res-mitigation-1": Defines a problem ID.prob_desc, instructs, apis = orch.init_problem(pid): Initializes a specific problem instance, retrieving its description, instructions, and availableAPIs.agent = Agent(prob_desc, instructs, apis): Creates an instance of the custom agent with the problem context.orch.register_agent(agent, name="myAgent"): Registers the agent with theOrchestrator.asyncio.run(orch.start_problem(max_steps=10)): Starts the evaluation of the problem, allowing the agent to take up to 10 steps. TheOrchestratorpolls the agent'sget_actionmethod for its next action.
4.2.2.3. Other Interfaces
- Problem Initializers: The
Orchestratordeploys cloud services for each problem usinginfrastructure-as-codetools likeHelmandKubernetes APIs. It then interfaces with aWorkload Generatorand aFault Generator.- Workload Generator: Currently uses
wrk2to simulate realistic traffic with various policies and industry workload replays. - Fault Generator: Uses a custom
fault libraryintegrated withChaosMeshto inject diverse, fine-grained, and parametric faults across system layers (application, virtualization) that model underlying root causes.
- Workload Generator: Currently uses
- Problem Evaluators: The
Orchestratorcompares the agent's solutions against predefined success criteria and metrics for each task (e.g.,TTDfor detection, number of steps/tokens forLLM agents). It also supports optionalqualitative evaluationusingLLMs-as-Judges(e.g.,Zheng et al., 2024) to assess agent reasoning. All agent trajectories and system states are logged for detailed analysis.
4.2.3. Cloud Services
AIOPSLAB utilizes live microservice applications as its cloud environments (② in Figure 2). It is integrated with DeathStarBench (Gan et al., 2019), specifically using:
- SocialNetwork: A complex application with 28 microservices (including
Memcached,MongoDB,Redis) implementing social networking features. - HotelReservation: An application implemented with Go and gRPC, supporting hotel recommendation and reservation services.
4.2.4. Task-oriented Fault Library
The fault library is central to creating realistic and challenging problems for AIOps agents.
4.2.4.1. Task Taxonomy
The paper presents a task-level taxonomy (Table 1) categorizing AIOps tasks by increasing complexity:
The following are the results from Table 1 of the original paper:
| Level | Task (# sub tasks) | Evaluation Focus |
| 1 | Detection (1) | Can the approach accurately detect anomalies or deviations? |
| 2 | Localization (1) | Can the approach pinpoint a fault's exact source (e.g., microservice)? |
| 3 | Root Cause Analysis (RCA) (2) | Can the approach determine the underlying cause of the fault? |
| 4 | Mitigation (1) | Can the approach give effective solutions to recover the environment? |
- Level 1: Detection: Simplest, focused on identifying unusual behavior (e.g., a malfunctioning
Kubernetes pod). - Level 2: Localization: Identifying the exact source of a fault (e.g., a specific microservice).
- Level 3: Root Cause Analysis (RCA): More complex, requiring agents to determine the underlying cause. This level has sub-tasks: identifying the affected
system layerand thefault type. - Level 4: Mitigation: Most complex, requiring agents to apply corrective actions to restore the system.
4.2.4.2. Symptomatic Faults
Symptomatic faults (e.g., performance degradation, crash failures) manifest as observable symptoms like increased latency or service outages. They are used to construct Level 1 (detection) and Level 2 (localization) tasks. These faults indicate a problem exists but don't inherently reveal deep root causes. AIOPSLAB integrates ChaosMesh (ChaosMesh Authors, 2022) for injecting these.
The following figure (Figure 3 from the original paper) categorizes faults:
Figure 3.Fault categories to instantiate problems in AIOPSLAB.
4.2.4.3. Functional Faults
Most traditional fault injection tools focus on system symptoms. Functional faults, however, model underlying, fine-grained root causes like misconfigurations or software bugs. These faults are crucial for Level 3 (RCA) and Level 4 (mitigation) tasks, as they challenge agents to not only detect and localize but also diagnose the specific cause and apply correct mitigation strategies.
Example: Revoke Authentication Fault (Figure 4)
The paper illustrates a functional fault: revoking admin authentication for a MongoDB database used by a geographic microservice (Mongodb-geo). This causes errors in the Geo service that relies on it.
The following figure (Figure 4 from the original paper) shows an example of a revoke authentication fault:
Figure 4. Revoke authentication fault example. Injection happens at Mongodb-geo service, while Geo service will be abnormal and generate error logs.
Example 2.4: Application-level Fault Injector
The structure for injecting an application-level revoke authentication fault is shown:
1 from aiopslab.generators.fault.base import FaultInjector 2 from aiopslab.service.apps.hoteles import HotelReservation 3 class ApplicationFaultInjector(FaultInjector): 4 def inject_revoke_auth(self, microservices: list[str]): 5 """Revoke MongoDB admin privileges. " 6 ..
Explanation of the example:
-
from aiopslab.generators.fault.base import FaultInjector: Imports the base class for fault injectors. -
from aiopslab.service.apps.hoteles import HotelReservation: Imports theHotelReservationapplication service definition. -
class ApplicationFaultInjector(FaultInjector):: Defines a customApplicationFaultInjectorinheriting from the base class. -
def inject_revoke_auth(self, microservices: list[str]):: A method to inject therevoke authenticationfault, targeting a list of microservices. -
Docstring: "Revoke MongoDB admin privileges."
Users can define problems by selecting existing faults, specifying target services, or even creating custom faults.
AIOPSLABprovides injection functions and corresponding mitigation mechanisms for recovery.
4.2.5. Observability
AIOPSLAB includes an extensible observability layer to collect comprehensive telemetry data (③ in Figure 2):
-
Traces: From
Jaeger(Jaeger Authors, 2024), detailing end-to-end request paths in distributed systems. -
Logs: Application logs retrieved by
Kubectl, or formatted and recorded byFilebeat(Elasticsearch, 2024b) andLogstash(Elasticsearch, 2024a). -
System Metrics: Monitored by
Prometheus(Prometheus Authors, 2024).This data is collected during agent interaction and can also be exported offline for evaluating traditional
AIOps algorithms. The framework is designed to capture other information like codebase, configuration, and cluster details, and expose low-level system information (e.g.,syscall logs) via its interface.
5. Experimental Setup
5.1. Datasets
The experimental evaluation utilizes a benchmark suite constructed using AIOPSLAB, consisting of 48 problems. These problems are instantiated by injecting various faults into two microservice applications from DeathStarBench (Gan et al., 2019):
-
HotelReservation: An application for hotel booking.
-
SocialNetwork: A complex social media application.
The choice of these
microservice applicationsprovides a realistic, distributed cloud environment, critical for evaluatingAIOps agents.
The problems are generated using the faults listed in Table 2. These faults cover both symptomatic and functional types, and are designed to challenge agents across all four task levels (Detection, Localization, RCA, Mitigation).
The following are the results from Table 2 of the original paper:
| No. | Application | Task Level | Category | Ext. | #Problem | Description | |||
| 1 | AuthenticationMissing | HotelReservation | 1,2,3,4 | Functional Virtualization | ① | 4 | Missing authentication credentials cause access denial to MongoDB. | ||
| 2 | TargetPortMiscountina | SocialNetwork | 1,2,3,4 | Functional Virtualization | ● | 12 | The service cannot connect to the specified port due to misconfiguration. | ||
| 3 | RevokeAuth | HotelReservation | 1,2,3,4 | Functional Application | ① | 8 | Revoked authentication causes database connection failure. | ||
| 4 | UserUnregistered | HotelReservation | 1,2,3,4 | Functional Application | ① | 8 | The database service has access failures after the user was unregistered. | ||
| 5 | BuggyAppImage | HotelReservation | 1,2,3,4 | Functional Application | ○ | 4 | Connection code bug in the application image causes access issues. | ||
| 6 | ScalePod | SocialNetwork | 1,2,3,4 | Functional Virtualization | ● | 4 | Incorrect scaling operation makes the number of pod zero for a service. | ||
| 7 | AssignNonExistentNode | SocialNetwork | 1,2,3,4 | Functional Virtualization | ● | 4 | Pod in a pending a failure status due to wrong assignment to a non-existent node. | ||
| 8 | NetworkLoss | HotelReservation | 1,2 | Symptomatic | ● | 2 | Network loss causes communication failures for a specific service. | ||
| 9 | PodFailure | HotelReservation | 1,2 | Symptomatic | ● | 2 | Service interruption due to a pod failure. | ||
| 10 | Noop | HotelReservation | 1 | - | ● | 2 | No faults injected into the system. |
Note on Extensibility (Ext. column): ① indicates the fault can be easily used to construct other problems; ● denotes there is some manual effort needed to create new problems; while ○ means the fault is specific to some problems and cannot be applied to create other problems.
Example of a data sample: A problem in AIOPSLAB isn't a static data sample, but an interactive scenario. For example, for "TargetPortMisconfig" (Fault 2) on SocialNetwork's "user-service":
-
The system would simulate a
Kubernetesmisconfiguration for theuser-service. -
A workload would be generated against the
SocialNetworkfrontend. -
Telemetry data (logs, metrics, traces) reflecting the misconfiguration and its impact (e.g., failed requests, error logs from
user-service) would be observable by the agent. -
The agent's goal might be to localize the fault to "user-service" (Localization task), or propose a fix (Mitigation task).
These datasets were chosen because they represent realistic cloud incidents in complex microservice environments, allowing for a comprehensive evaluation of
AIOps agents'diagnostic and mitigation abilities in dynamic settings.
5.2. Evaluation Metrics
AIOPSLAB employs several metrics to evaluate the performance of AIOps agents:
-
Correctness:
- Conceptual Definition: Measures the accuracy of the agent's response, assessing whether it successfully detects, localizes, analyzes, or resolves problems as expected. For
localization tasks, correctness can be evaluated based on the top-ranked predictions. - Mathematical Formula (Accuracy): $ \mathrm{Accuracy} = \frac{\mathrm{Number~of~Correct~Predictions}}{\mathrm{Total~Number~of~Predictions}} $
- Symbol Explanation:
- : The count of instances where the agent's output matches the ground truth solution.
- : The total number of problems or tasks evaluated.
- For
localization tasks, accuracy is often reported as:Acc.@1(Accuracy at 1): The percentage of times the top prediction by the agent is correct.Acc.@3(Accuracy at 3): The percentage of times the correct answer is among the top 3 predictions by the agent.
- Conceptual Definition: Measures the accuracy of the agent's response, assessing whether it successfully detects, localizes, analyzes, or resolves problems as expected. For
-
Time/Steps:
- Conceptual Definition: Evaluates the efficiency of the
AIOps agentfor each task type. - Metrics:
Time-to-Detect (TTD): The time elapsed from the occurrence of a fault to its detection by the agent.Time-to-Mitigate (TTM): The time taken from the detection of a fault to its complete mitigation by the agent.Number of Steps: The count of interactions (actions) an agent takes withAIOPSLABto solve a problem. This is distinct from the number of requests sent to thebackend LLM.
- No specific mathematical formulas provided in the paper for these, as they are direct measurements.
- Conceptual Definition: Evaluates the efficiency of the
-
Cost:
- Conceptual Definition: Measures the computational expense associated with agent operation, specifically for
LLM-powered agents. - Metric:
Tokens: The total number of tokens (input tokens fed to theLLM+ output tokens generated by theLLM) produced by the agents/environment. This is a proxy for the computational cost ofLLMusage.
- No specific mathematical formula provided, as it's a direct count.
- Conceptual Definition: Measures the computational expense associated with agent operation, specifically for
5.3. Baselines
The paper evaluates two categories of agents/algorithms:
-
LLM-based Agents: These are the primary focus, leveraging
LLMsfor reasoning and interaction.- GPT-4-w-SHELL: An
LLM(specificallyGPT-4-turboby Achiam et al., 2023) that has access to a secure shell for executing commands. This serves as a strong baseline, representing a powerful, general-purposeLLMwith basic tool-use capabilities. - GPT-3.5-w-SHELL: An
LLM(specificallyGPT-3.5-turbo) also with secure shell access, serving as a more cost-effective and faster, but potentially less capable, baseline compared toGPT-4-w-SHELL. - REACT (Reasoning and Acting): (Yao et al., 2023) An
LLM-based agentframework that combineschain-of-thought reasoning(Wei et al., 2022b) with acting in an interleaved manner. It reasons about a problem, plans an action, executes it, and then reasons again based on the observation. - FLASH (Workflow Automation Agent): (Zhang et al., 2024b) An
AIOps-specificLLM agentthat employs aworkflow automation system, monitors execution status, decomposes complex instructions, and incorporateshindsight generationto learn from past interactions. The paper notes that a simplified version was developed for this evaluation, as the full version was not publicly available.
- GPT-4-w-SHELL: An
-
Non-LLM AIOps Algorithms: These represent traditional
AIOpsmethods specialized for certain tasks, using multimodaltelemetry dataas input. They are included to show the comparative advantage (or disadvantage) ofLLM-based agents.- For Detection:
- MKsMC (Multivariate K-sigma score using Monte Carlo): (Çetin and Tasgin, 2020) An anomaly detection method.
- For Localization:
-
RMLAD: (Wang et al., 2020) Likely an
anomaly detectionorlocalizationalgorithm. -
PDiagnose: (Hou et al., 2021) A method for diagnosing performance issues in microservices using heterogeneous data sources.
These baselines were chosen to cover a spectrum from general-purpose powerful
LLMs(with basic tool access) to more specializedLLM agents(REACT,FLASH) and traditional, task-specificAIOps algorithms, allowing for a comprehensive comparison ofAIOPSLAB's evaluation capabilities.
-
- For Detection:
6. Results & Analysis
6.1. Core Results Analysis
The evaluation of AIOps agents on the AIOPSLAB benchmark reveals key insights into their capabilities and limitations across different AIOps tasks. The overall performance is summarized in Table 3, while task-specific results are detailed in Table 4.
The following are the results from Table 3 of the original paper:
| Agent | LoC | Time (s) | # Steps | Tokens | Acc. |
| GPT-4-w-SHELL | 41 | 28.61 | 6.44 | 6,394.5 | 49.15% |
| GPT-3.5-w-SHELL | 41 | 12.44 | 14.70 | 2,557.95 | 15.25% |
| REACT | 49 | 43.79 | 11.50 | 16,941.46 | 55.93% |
| FLASH | 60 | 99.64 | 8.48 | 6,484.25 | 59.32% |
Table 3. Overall performance of different agents. We show the lines of code (LoC) to register the agent in AIOPSLAB, average running time in seconds, average number of steps taken, average tokens used, and accuracy across all problems.
Overall Performance (Table 3):
-
Accuracy:
FLASHachieves the highest overall accuracy (59.32%), indicating its strength in problem-solving across various tasks.REACTfollows closely (55.93%), thenGPT-4-w-SHELL(49.15%).GPT-3.5-w-SHELLperforms the poorest (15.25%). -
Time (s):
GPT-3.5-w-SHELLis the fastest on average (12.44s), likely due to its lower complexity and often failing quickly.FLASHis the slowest (99.64s), suggesting more extensive reasoning or interaction. -
# Steps:
GPT-3.5-w-SHELLtakes the most steps (14.70), often implying inefficient or repetitive actions.GPT-4-w-SHELLtakes the fewest (6.44).FLASHandREACTare moderate. -
Tokens:
REACTconsumes the most tokens (16,941.46), reflecting its verbosechain-of-thought reasoning.GPT-3.5-w-SHELLconsumes the least (2,557.95), but also has the lowest accuracy.These results suggest a trade-off between speed/cost and accuracy, with more sophisticated agents like
FLASHandREACTachieving better results at higher computational expense or time.
Task-Specific Performance (Table 4): The following are the results from Table 4 of the original paper:
| Agent | Accuracy | Time (s) | # Steps | Input | Output |
| GPT-4-w-SHELL | 69.23% | 7.08 | 3.85 | 5,492 | 132 |
| GPT-3.5-w-SHELL | 23.07% | 11.05 | 13.60 | 1,940.44 | 385.56 |
| REACT | 76.92% | 39.00 | 11.46 | 15,608.08 | 933.15 |
| FLASH | 100% | 78.27 | 6.77 | 12,869.08 | 125.69 |
| MKSMC | 15.38% | 1.00 | N/A | N/A | N/A |
| Agent | Acc.@3 | Acc.@1 | Time (s) | # Steps | Input | Output |
| GPT-4-w-SHELL | 61.54% | 61.54% | 7.04 | 4.23 | 4,588.07 | 133.23 |
| GPT-3.5-w-SHELL | 30.77% | 30.77% | 6.26 | 11.92 | 1,784.23 | 217.08 |
| REACT | 69.23% | 53.85% | 38.65 | 11.08 | 4,760.77 | 880.92 |
| FLASH | 61.54% | 46.15% | 56.60 | 5.77 | 1,875.08 | 123.31 |
| DDDAGOSE | 15.38% | 15.38% | 1.02 | N/A | N/A | N/A |
| RMLAD | 7.69% | 7.69% | 1.98 | N/A | N/A | N/A |
| Agent | Accuracy | Time (s) | # Steps | Input | Output |
| GPT-4-w-SHELL | 40.90% | 8.68 | 4.81 | 4,297.91 | 176.18 |
| GPT-3.5-w-SHELL | 9.09% | 10.06 | 14.00 | 1,495.55 | 406.27 |
| REACT | 45.45% | 32.16 | 8.00 | 16,276.09 | 757.27 |
| FLASH | 36.36% | 59.00 | 6.09 | 1,193.90 | 152.45 |
| Agent | Accuracy | Time (s) | # Steps | Input | Output |
| GPT-4-w-SHELL | 27.27% | 99.47 | 13.72 | 10,142.55 | 1,060.00 |
| GPT-3.5-w-SHELL | 0% | 23.78 | 20.00 | 3,178.33 | 967.71 |
| REACT | 36.36% | 67.18 | 15.54 | 29,211.90 | 1,464.90 |
| FLASH | 54.55% | 216.41 | 16.09 | 8,469.00 | 760.36 |
Table 4. Agent performance by task. This table summaries the performance of different agents across various tasks including detection, localization, RCA, and mitigation. Acc. stands for accuracy. Input/Output represents the number of tokens given to and produced by the agent, respectively.
a) Detection Task:
FLASHachieves100% accuracy, significantly outperforming all otherLLM agentsand traditional methods.REACT(76.92%) andGPT-4-w-SHELL(69.23%) also perform well.- Traditional
MKsMChas very low accuracy (15.38%). This confirms thatLLM agentsare strong at simple detection.
b) Localization Task:
REACTshows the bestAcc.@3(69.23%), indicating it often includes the correct answer in its top 3 predictions.GPT-4-w-SHELLperforms best inAcc.@1(61.54%), meaning its top prediction is more often correct.- Traditional methods
DDDAGOSEandRMLAD(15.38%,7.69%respectively) are notably poor, highlighting the advantage ofLLM agentsin this interactive task.
c) RCA (Root Cause Analysis) Task:
- This task proves more challenging.
REACTleads with45.45%accuracy, followed byGPT-4-w-SHELL(40.90%). FLASHsurprisingly underperforms here (36.36%), whileGPT-3.5-w-SHELLis very weak (9.09%).RCArequires deeper understanding and reasoning, where currentLLM agentsstill have significant room for improvement.
d) Mitigation Task:
- This is the most challenging task.
FLASHachieves the highest accuracy (54.55%), but with the longest average time (216.41s). REACTis next (36.36%).GPT-4-w-SHELLhas low accuracy (27.27%).GPT-3.5-w-SHELLcompletely fails (0%) to mitigate any faults.- The high time and token consumption for mitigation indicate the complexity of interacting with the environment to fix issues.
Overall Observations:
- LLM agents vs. Traditional AIOps: For detection and localization,
LLM agents(especiallyFLASH,REACT,GPT-4-w-SHELL) significantly outperform traditional non-LLMAIOpsmethods, demonstrating their advantage in interactive problem-solving. - Problem Difficulty: The
RCAandmitigation tasksare substantially harder for all agents, highlighting the gap between currentLLM capabilitiesand the full vision ofAgentOps. No agent consistently achieves high accuracy across all task categories. - Cost-Performance Trade-offs: While
GPT-3.5-w-SHELLis fast and cheap, its accuracy is unacceptably low. More capable agents likeFLASHandREACTare slower and more expensive but deliver better results.
6.2. Ablation Studies / Parameter Analysis
The paper includes an analysis of the influence of the step limit on agent performance, which can be seen as a form of parameter analysis.
The following figure (Figure 5 from the original paper) shows agent performance vs. number of steps taken:
Figure 5. Agent performance vs. number of steps taken.
-
Impact of Step Limit: The
maximum number of allowed stepssignificantly affects agent performance.REACTandFLASHshow improved accuracy as the number of steps increases, withFLASHreaching its peak accuracy of59.32%at20 steps. This indicates that these agents can leverage more interactions with the environment to refine their understanding and actions.GPT-4-w-SHELLalso shows a general upward trend, but with less pronounced gains after around10-15 steps.- For
GPT-3.5-TURBO, increasing the step limit beyond5does not lead to better performance; instead, it primarily increasestoken consumptionwithout improving accuracy. This suggestsGPT-3.5-TURBOmight lack the deeper reasoning or effective self-correction mechanisms to benefit from more interaction steps forAIOpsproblems.
-
Self-repair Saturation: The plateauing of accuracy after a certain number of steps for some agents suggests that
self-repair with environment feedbackcan saturate quickly inAIOps problems. This contrasts with development tasks (like code generation) where continuous feedback (linters, type checkers, tests) allows for more sustained improvement. This implies a need for:- Better
task decompositionandplanningforAIOps problems. - Improved
feedback mechanismsfor intermediate steps. - Solutions that go beyond simple
environment feedbackandself-repair.
- Better
6.3. Agent Behavior: The Good, the Bad and the Gaps
The paper further analyzes specific behaviors, including API usage patterns and common failure modes.
The following figure (Figure 6 from the original paper) shows the total percentage of actions taken by different agents:
Figure 6. Total percentage of actions taken by different agents.
The following are the results from Table 5 of the original paper:
| Agent | Kubectl Get | Kubectl Describe | Kubectl Exec | Cat | Other |
| GPT-4-w-SHELL | 21.84% | 2.06% | 0.14% | 1.92% | 0.77% |
| GPT-3.5-w-SHELL | 27.22% | 1.52% | 0.19% | 3.62% | 0.95% |
| REACT | 19.70% | 1.49% | 0.00% | 1.39% | 0.14% |
| FLASH | 27.35% | 1.18% | 0.00% | 0.00% | 0.00% |
Table 5. Occurrences of system commands.
Telemetry API Usage (Figure 6):
get_logsis the most frequently used API across all agents, followed byget_metrics.get_tracesis used less frequently.FLASHnotably does not useget_tracesat all. This suggests agents prioritize log and metric data, possibly due to their perceived directness or easier interpretability forLLMs.
System Command Usage (Table 5):
kubectl getis the most commonshell commandacross agents, indicating a tendency to queryKubernetesresources for information.catis also used, suggesting agents might view raw log/metric files.kubectl describeis used less, andkubectl exec(for executing commands within a pod) is very rare.
6.3.1. Wasting steps on unnecessary actions
- Agents often waste steps by repeatedly calling the same
API, generating non-existentAPIs, or engaging in excessivemulti-agent communication. GPT-3.5-w-SHELLis particularly prone to generating incorrectAPI commandsin loops, leading to repeated execution errors. This indicates a lack of robust error handling orself-correction.- Over-reliance on
telemetry APIswithout careful analysis can overwhelm theLLM's input context windowand lead totoken exhaustion, adding noise rather than useful information. This points to a need for more refinedtelemetry data processing and filtering mechanisms.
6.3.2. Invalid API usage
- Agents struggle with improper formatting of
API calls.GPT-3.5-w-SHELLfrequently generates syntactically incorrect commands or malformed parameters, often apologizing and repeating the same mistake. REACToccasionally generates incorrectAPI commandsbut demonstrates betterself-correction, reasoning through errors and adjusting its commands in subsequent steps. The paper provides an example whereREACTuses an incorrect parameter forget_logs, receives an error, and then correctly usesexec_shellto list services to find the correct name. This highlights the importance of robustreasoning and actingcapabilities.
6.3.3. False positive detection issues
- In "no operation" (
Noop, Fault 10) problems where no faults were injected, onlyGPT-4-w-SHELLcorrectly identified the system as normal. - Other agents reported
false positives, misinterpreting normal system activities (e.g., standard workload generation) as faults. This is a critical issue forAIOps, asfalse positivescan lead to unnecessary alerts and wasted human effort.
6.4. Data Presentation (Tables)
The following are the results from Table 2 of the original paper:
| No. | Application | Task Level | Category | Ext. | #Problem | Description | |||
| 1 | AuthenticationMissing | HotelReservation | 1,2,3,4 | Functional Virtualization | ① | 4 | Missing authentication credentials cause access denial to MongoDB. | ||
| 2 | TargetPortMiscountina | SocialNetwork | 1,2,3,4 | Functional Virtualization | ● | 12 | The service cannot connect to the specified port due to misconfiguration. | ||
| 3 | RevokeAuth | HotelReservation | 1,2,3,4 | Functional Application | ① | 8 | Revoked authentication causes database connection failure. | ||
| 4 | UserUnregistered | HotelReservation | 1,2,3,4 | Functional Application | ① | 8 | The database service has access failures after the user was unregistered. | ||
| 5 | BuggyAppImage | HotelReservation | 1,2,3,4 | Functional Application | ○ | 4 | Connection code bug in the application image causes access issues. | ||
| 6 | ScalePod | SocialNetwork | 1,2,3,4 | Functional Virtualization | ● | 4 | Incorrect scaling operation makes the number of pod zero for a service. | ||
| 7 | AssignNonExistentNode | SocialNetwork | 1,2,3,4 | Functional Virtualization | ● | 4 | Pod in a pending a failure status due to wrong assignment to a non-existent node. | ||
| 8 | NetworkLoss | HotelReservation | 1,2 | Symptomatic | ● | 2 | Network loss causes communication failures for a specific service. | ||
| 9 | PodFailure | HotelReservation | 1,2 | Symptomatic | ● | 2 | Service interruption due to a pod failure. | ||
| 10 | Noop | HotelReservation | 1 | - | ● | 2 | No faults injected into the system. |
Note on Extensibility (Ext. column): ① indicates the fault can be easily used to construct other problems; ● denotes there is some manual effort needed to create new problems; while ○ means the fault is specific to some problems and cannot be applied to create other problems.
The following are the results from Table 3 of the original paper:
| Agent | LoC | Time (s) | # Steps | Tokens | Acc. |
| GPT-4-w-SHELL | 41 | 28.61 | 6.44 | 6,394.5 | 49.15% |
| GPT-3.5-w-SHELL | 41 | 12.44 | 14.70 | 2,557.95 | 15.25% |
| REACT | 49 | 43.79 | 11.50 | 16,941.46 | 55.93% |
| FLASH | 60 | 99.64 | 8.48 | 6,484.25 | 59.32% |
Table 3. Overall performance of different agents. We show the lines of code (LoC) to register the agent in AIOPSLAB, average running time in seconds, average number of steps taken, average tokens used, and accuracy across all problems.
The following are the results from Table 4(a) (Detection Task) of the original paper:
| Agent | Accuracy | Time (s) | # Steps | Input | Output |
| GPT-4-w-SHELL | 69.23% | 7.08 | 3.85 | 5,492 | 132 |
| GPT-3.5-w-SHELL | 23.07% | 11.05 | 13.60 | 1,940.44 | 385.56 |
| REACT | 76.92% | 39.00 | 11.46 | 15,608.08 | 933.15 |
| FLASH | 100% | 78.27 | 6.77 | 12,869.08 | 125.69 |
| MKSMC | 15.38% | 1.00 | N/A | N/A | N/A |
The following are the results from Table 4(b) (Localization Task) of the original paper:
| Agent | Acc.@3 | Acc.@1 | Time (s) | # Steps | Input | Output |
| GPT-4-w-SHELL | 61.54% | 61.54% | 7.04 | 4.23 | 4,588.07 | 133.23 |
| GPT-3.5-w-SHELL | 30.77% | 30.77% | 6.26 | 11.92 | 1,784.23 | 217.08 |
| REACT | 69.23% | 53.85% | 38.65 | 11.08 | 4,760.77 | 880.92 |
| FLASH | 61.54% | 46.15% | 56.60 | 5.77 | 1,875.08 | 123.31 |
| DDDAGOSE | 15.38% | 15.38% | 1.02 | N/A | N/A | N/A |
| RMLAD | 7.69% | 7.69% | 1.98 | N/A | N/A | N/A |
The following are the results from Table 4(c) (RCA Task) of the original paper:
| Agent | Accuracy | Time (s) | # Steps | Input | Output |
| GPT-4-w-SHELL | 40.90% | 8.68 | 4.81 | 4,297.91 | 176.18 |
| GPT-3.5-w-SHELL | 9.09% | 10.06 | 14.00 | 1,495.55 | 406.27 |
| REACT | 45.45% | 32.16 | 8.00 | 16,276.09 | 757.27 |
| FLASH | 36.36% | 59.00 | 6.09 | 1,193.90 | 152.45 |
The following are the results from Table 4(d) (Mitigation Task) of the original paper:
| Agent | Accuracy | Time (s) | # Steps | Input | Output |
| GPT-4-w-SHELL | 27.27% | 99.47 | 13.72 | 10,142.55 | 1,060.00 |
| GPT-3.5-w-SHELL | 0% | 23.78 | 20.00 | 3,178.33 | 967.71 |
| REACT | 36.36% | 67.18 | 15.54 | 29,211.90 | 1,464.90 |
| FLASH | 54.55% | 216.41 | 16.09 | 8,469.00 | 760.36 |
The following are the results from Table 5 of the original paper:
| Agent | Kubectl Get | Kubectl Describe | Kubectl Exec | Cat | Other |
| GPT-4-w-SHELL | 21.84% | 2.06% | 0.14% | 1.92% | 0.77% |
| GPT-3.5-w-SHELL | 27.22% | 1.52% | 0.19% | 3.62% | 0.95% |
| REACT | 19.70% | 1.49% | 0.00% | 1.39% | 0.14% |
| FLASH | 27.35% | 1.18% | 0.00% | 0.00% | 0.00% |
Table 5. Occurrences of system commands.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces AIOPSLAB, a novel and holistic framework for the design, development, and comprehensive evaluation of autonomous AI agents targeting IT operations in cloud environments, a paradigm termed AgentOps. The framework integrates key components: a fault injector, workload generator, cloud-agent orchestrator with an Agent-Cloud Interface (ACI), and telemetry observer. This setup allows for the simulation of realistic cloud incidents and enables AI agents to interact dynamically with the environment.
Through AIOPSLAB, the authors constructed a benchmark suite of 48 diverse problems spanning detection, localization, root cause analysis (RCA), and mitigation tasks. The evaluation of four state-of-the-art LLM-based agents (GPT-4-w-SHELL, GPT-3.5-w-SHELL, REACT, and FLASH) on this benchmark demonstrated that while LLM agents show significant promise, particularly in detection and localization tasks where they outperform traditional AIOps algorithms, they still face considerable challenges in more complex RCA and mitigation scenarios. The paper provides detailed insights into agent behaviors, including issues like wasted steps, invalid API usage, context window limitations, false positives, and the saturation of self-repair mechanisms. By committing to making AIOPSLAB publicly available, the authors aim to foster further research and development in AgentOps.
7.2. Limitations & Future Work
The paper explicitly and implicitly highlights several limitations and suggests avenues for future research:
- Current Agent Limitations in Complex Tasks:
LLM agentsstruggle significantly withRCAandmitigation tasks. This points to a need forLLMswith more robust reasoning, planning, and long-term memory capabilities tailored for sequential decision-making inIT operations. - Inefficient Agent Behaviors: Observations like agents wasting steps, repeatedly making the same
API usageerrors (especiallyGPT-3.5-w-SHELL), andself-repair saturationindicate that currentagentic frameworksneed improvement.- Need for Better Task Decomposition and Planning: The quick saturation of
self-repairsuggests that agents require better internal planning mechanisms to break down complexAIOps problemsinto manageable sub-tasks. - Improved Intermediate Feedback: Beyond simple environment feedback, agents could benefit from more structured and informative feedback during intermediate steps, similar to how linters and test cases aid software development.
- Need for Better Task Decomposition and Planning: The quick saturation of
- Context Window Management: The issue of
telemetry dataoverwhelming theLLM's context window, leading totoken exhaustionand distraction, is a fundamentalLLMchallenge. Future work needs to focus on more refinedtelemetry data processing, filtering, and summarization techniques to provide agents with relevant information without cognitive overload. - Qualitative Evaluation: For tasks like
detection, agents might provide a correct answer but with incorrect reasoning. The paper suggests utilizingLLMs-as-Judgesto perform more fine-grained qualitative evaluation of agent reasoning chains against problem descriptions. - Extensibility of
AIOPSLAB: WhileAIOPSLABis designed to be extensible, the paper notes that some complexfunctional faults(e.g.,AuthenticationMissing,RevokeAuth) require manual effort to set up (e.g., preparing scripts, updatingKubernetes config maps). Simplifying the definition and injection of such complex faults would enhance the framework's usability. - Broader Fault Types and Problem Scenarios: The framework is adaptable to other fault types (e.g., anomaly detection workloads) and problem scenarios (e.g., requiring agents to label telemetry data). This is an ongoing area for expanding the benchmark.
- Specific Agent Implementations: The
FLASHagent used in the evaluation was a simplified version due to its unavailability, implying that its full potential might not have been captured, and further evaluation with a complete version would be beneficial.
7.3. Personal Insights & Critique
AIOPSLAB is a highly valuable contribution to the AIOps and AI agent research landscape. Its holistic, interactive, and realistic approach fills a significant gap in existing benchmarks, which often fall short in simulating the dynamic and multi-faceted nature of real-world IT operations.
Strengths:
- Pioneering AgentOps Evaluation: The paper clearly articulates the
AgentOpsvision and provides a concrete framework to evaluateLLM-based agentsin this context. This is crucial for advancing the field beyond isolatedAIOps tasks. - Realism and Interaction: The use of live microservice applications, diverse
symptomaticandfunctional faults, and theAgent-Cloud Interface (ACI)creates a genuinely interactive and realistic testing ground, far superior to static datasets. This allows for the study of dynamic agent behaviors,self-correction, andtool-use. - Comprehensive Problem Taxonomy: The four-level task taxonomy is well-defined and progressively challenging, offering a structured way to assess agent capabilities from simple detection to complex mitigation.
- Actionable Insights: The detailed analysis of agent
failure modes,API usage patterns, and the impact ofstep limitsprovides valuable insights forAI agentdevelopers, pointing to specific areas for improvement (e.g., planning, context management, error handling). The observation aboutself-repair saturationis particularly profound for futureagentic AIdevelopment. - Commitment to Open Source: Making
AIOPSLABpublicly available is a significant boon to the research community, enabling reproducibility, comparative studies, and collaborative development.
Potential Issues & Areas for Improvement:
-
Scalability of the Benchmark: While 48 problems are a good start, real-world cloud environments are vastly more complex, with thousands of services and countless potential incident scenarios. Expanding the problem pool automatically and dynamically could be a future challenge.
-
Complexity of Fault Injection: As noted, injecting some
functional faultsrequires manual setup. Further automation or a more intuitive declarative language for defining complexmulti-service,multi-stage faultswould be beneficial. -
Security of
exec_shell: The paper mentionssecurity policy filtersforexec_shell. Given thatAI agentscould potentially execute arbitrary commands, the robustness and restrictiveness of these filters are critical for real-world application and need to be thoroughly detailed or customizable within the framework to prevent unintended consequences or malicious actions by an agent. -
"LLM-as-Judge" Bias: While
LLMs-as-Judgesoffer a promising avenue for qualitative evaluation of reasoning,LLMsthemselves can exhibit biases or inconsistencies. Care must be taken in designing the judging criteria and validating thejudge LLM'sfairness and accuracy. -
Beyond Reactive Agents: The current agents, even
REACTandFLASH, are largely reactive (perceive, then act). Future agents could incorporate more proactive elements, such as predictive maintenance, anomaly prevention, or self-optimization, requiring an even more sophisticated evaluation framework. -
Evaluation of Non-LLM Agents: While the paper includes some traditional
AIOpsmethods, the primary focus and benchmark design are clearly geared towardsLLM-based agents. A more dedicated set of metrics or evaluation scenarios that specifically highlight the strengths and weaknesses of non-LLM, specializedAIOps algorithmscould offer a richer comparative analysis.Overall,
AIOPSLABis an impressive and timely research effort that pushes the boundaries ofAIOpsby providing a much-needed robust platform for evaluatingAI agentsin complex, dynamic cloud environments. Its insights are invaluable for guiding the next generation of autonomous cloud management systems.
Similar papers
Recommended via semantic vector search.