Paper status: completed

AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds

Published:01/12/2025

AIOps Framework (1)Application of Large Language Models in AIOps (1)Microservice Cloud Environment (1)Proactive Fault Management (1)Holistic AI Agent Evaluation (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

AIOPSLAB is introduced as a framework to evaluate AI agents for automating IT operations in complex cloud environments. It integrates fault injection, workload generation, and telemetry export, enabling design and assessment of end-to-end AI solutions, showcasing the potential an

Abstract

AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis, to reduce human workload and minimize customer impact. While traditional DevOps tools and AIOps algorithms often focus on addressing isolated operational tasks, recent advances in Large Language Models (LLMs) and AI agents are revolutionizing AIOps by enabling end-to-end and multitask automation. This paper envisions a future where AI agents autonomously manage operational tasks throughout the entire incident lifecycle, leading to self-healing cloud systems, a paradigm we term AgentOps. Realizing this vision requires a comprehensive framework to guide the design, development, and evaluation of these agents. To this end, we present AIOPSLAB, a framework that not only deploys microservice cloud environments, injects faults, generates workloads, and exports telemetry data but also orchestrates these components and provides interfaces for interacting with and evaluating agents. We discuss the key requirements for such a holistic framework and demonstrate how AIOPSLAB can facilitate the evaluation of next-generation AIOps agents. Through evaluations of state-of-the-art LLM agents within the benchmark created by AIOPSLAB, we provide insights into their capabilities and limitations in handling complex operational tasks in cloud environments.

Mind Map

In-depth Reading

English Analysis~33 min read · 48,599 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is "AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds". It focuses on developing a comprehensive framework for designing, developing, and evaluating AI agents, particularly Large Language Model (LLM)-based agents, for automating IT operations in cloud environments, leading to what the authors term AgentOps.

1.2. Authors

The authors of this paper are:

Yinfang Chen ( $^{1}$ )
Manish Shetty ( $^{2}$ )
Gagan Somashekar ( $^{3}$ )
Minghua Ma ( $^{3}$ )
Yogesh Simmhan ( $^{4}$ )
Jonathan Macc ( $^{3}$ )
Chetan Bansal ( $^{3}$ )
Ruja Wang ( $^{3}$ )
Saravan Rajmohan ( $^{3}$ )

The affiliations are indicated by superscripts:
$^{1}$ : Not explicitly stated, but often implies a primary academic or research institution or a different corporate entity.
$^{2}$ : Not explicitly stated.
$^{3}$ : Most likely a corporate research lab, given the common affiliations seen in AIOps research.
$^{4}$ : Not explicitly stated.

Based on the nature of AIOps and the affiliations typically associated with such research (often a mix of academia and industry, particularly cloud providers or software companies), it's common for authors to come from leading tech companies and universities.

1.3. Journal/Conference

This paper is published as a preprint on arXiv. arXiv (pronounced "archive") is an open-access repository for electronic preprints of scientific papers, primarily in the fields of mathematics, physics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. It is a highly influential platform for rapid dissemination of research findings before formal peer review and publication in academic journals or conference proceedings. Its reputation is significant for quickly sharing cutting-edge research.

1.4. Publication Year

The paper was published at 2025-01-12T00:00:00.000Z, which indicates January 12, 2025. This is a future date, implying the paper is a very recent or upcoming preprint.

1.5. Abstract

The paper addresses the growing complexity of IT operations (ITOps) in cloud environments, which AI for IT Operations (AIOps) aims to automate. While traditional DevOps tools and AIOps algorithms handle isolated tasks, recent advancements in Large Language Models (LLMs) and AI agents are enabling end-to-end and multi-task automation, envisioning self-healing cloud systems—a paradigm termed AgentOps.

To realize this AgentOps vision, the authors propose AIOPSLAB, a comprehensive framework designed to guide the design, development, and evaluation of these AI agents. AIOPSLAB can deploy microservice cloud environments, inject faults, generate workloads, and export telemetry data. Crucially, it orchestrates these components and provides interfaces for agent interaction and evaluation.

The paper discusses the key requirements for such a holistic framework and demonstrates AIOPSLAB's utility by evaluating state-of-the-art LLM agents within the benchmark it creates. This evaluation provides insights into the capabilities and limitations of LLM agents in handling complex operational tasks in cloud environments.

1.6. Original Source Link

The paper is available as a preprint:

Original Source Link: https://arxiv.org/abs/2501.06706v1
PDF Link: https://arxiv.org/pdf/2501.06706.pdf

It is currently a preprint, meaning it has not yet undergone formal peer review and publication in a journal or conference.

2. Executive Summary

2.1. Background & Motivation

The rapid adoption of hyper-scale, cloud-based systems, characterized by distributed architectures like microservices and serverless computing, has introduced unprecedented operational complexity. Managing incidents in these environments is a significant challenge, with potential outages leading to massive financial losses (e.g., $100 million per hour for an Amazon outage).

To address these challenges, the field of AIOps (Artificial Intelligence for IT Operations) emerged, aiming to automate complex operational tasks and eventually achieve autonomous self-healing clouds. While AIOps has existed for over a decade, traditional DevOps tools and AIOps algorithms typically focus on isolated tasks, such as fault localization or root cause analysis (RCA). They lack the capability for end-to-end and multi-task automation across the entire incident lifecycle.

Recent advancements in Large Language Models (LLMs) and AI agents have begun to bridge this gap, allowing AI agents to interact dynamically with environments and manage operational tasks autonomously. This evolution points towards a new paradigm called AgentOps, where agents can make real-time decisions and execute end-to-end actions to ensure system reliability.

A critical barrier to realizing AgentOps is the lack of high-quality, comprehensive benchmarks that can simulate diverse, realistic cloud scenarios and allow for the interactive evaluation of AI agents. Existing benchmarks often rely on static datasets or focus on the 'Dev' side of DevOps, failing to capture the dynamic, unpredictable, and evolving nature of real-world cloud operations. Furthermore, many current AgentOps efforts use proprietary services and datasets, hindering broader research and development. This paper aims to fill this gap by proposing a holistic framework for AgentOps evaluation.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of AIOps and AI agents for cloud operations:

Requirements and Challenges for a Holistic Framework: The authors meticulously identify and discuss the key requirements and inherent challenges in building a comprehensive framework capable of supporting the design, development, and evaluation of autonomous AIOps agents. This lays a foundational understanding for future research in the area.
Introduction of AIOPSLAB Framework: The paper develops and introduces AIOPSLAB, an innovative, holistic framework. This framework is designed to automatically manage the entire end-to-end evaluation process for AIOps solutions. Its capabilities include:
- Deploying microservice cloud environments.
- Injecting diverse faults.
- Generating realistic workloads.
- Exporting comprehensive telemetry data (logs, metrics, traces).
- Crucially, orchestrating these components and providing a unified Agent-Cloud Interface (ACI) for agents to interact with and be evaluated in the cloud environment.
Construction of a Benchmark Suite: Leveraging the AIOPSLAB framework, the authors construct a benchmark suite comprising 48 distinct problems. These problems are designed to cover various AIOps tasks (detection, localization, root cause analysis, and mitigation) within an interactive environment, creating a realistic testing ground for LLM-based agents.
Evaluation of State-of-the-Art LLM-based Agents: The paper demonstrates AIOPSLAB's utility by evaluating four state-of-the-art LLM-based agents against its benchmark suite. This evaluation provides crucial insights into the current capabilities and limitations of these agents when tackling complex operational tasks in dynamic cloud environments.
Detailed Analysis of Agent Performance: Through the evaluations, the paper offers a detailed analysis of agent performance, highlighting specific failure modes, challenges, and opportunities for improvement. Key findings include:
- LLM agents show promise but struggle with complex tasks like RCA and mitigation.
- Agents often waste steps on unnecessary actions or exhibit invalid API usage.
- Context window limitations due to verbose telemetry data can hinder performance.
- Self-repair mechanisms can quickly saturate, suggesting a need for better task decomposition and intermediate feedback.
- Some agents (e.g., GPT-3.5-w-SHELL) repeatedly make the same errors, while others (REACT) can self-correct.
- False positive detection remains an issue for some agents.
Public Availability: The authors commit to making AIOPSLAB publicly available, which will significantly benefit the research community by providing a standardized and interactive platform for AgentOps research and development.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with several fundamental concepts in cloud computing, IT operations, and artificial intelligence.

Cloud Computing and Microservices:
- Cloud Computing: A model for delivering computing services—including servers, storage, databases, networking, software, analytics, and intelligence—over the Internet ("the cloud"). It offers faster innovation, flexible resources, and economies of scale. Examples include Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).
- Microservices: An architectural style that structures an application as a collection of loosely coupled, independently deployable, small services. Each service runs in its own process and communicates with others using lightweight mechanisms (e.g., HTTP APIs). This approach enhances scalability, agility, and resilience but increases operational complexity due to distribution.
- Serverless Computing: A cloud execution model where the cloud provider dynamically manages the allocation and provisioning of servers. Developers write and deploy code (functions) without managing underlying infrastructure. It's an evolution often intertwined with microservices, further abstracting infrastructure.
- Kubernetes: An open-source container-orchestration system for automating deployment, scaling, and management of containerized applications. It groups containers into logical units for easy management and discovery. Many microservice applications are deployed on Kubernetes.
IT Operations (ITOps) and DevOps:
- IT Operations (ITOps): The processes and services managed by an IT department to maintain the smooth functioning of an organization's technology infrastructure, including deployment, monitoring, incident management, and maintenance.
- DevOps (Development and Operations): A set of practices that combines software development (Dev) and IT operations (Ops). It aims to shorten the systems development life cycle and provide continuous delivery with high software quality. Key principles include automation, continuous integration/continuous delivery (CI/CD), and collaboration.
AIOps and AgentOps:
- AIOps (AI for IT Operations): The application of Artificial Intelligence (AI) and Machine Learning (ML) to automate IT operations. Its goal is to analyze large volumes of IT data (logs, metrics, traces) to detect anomalies, diagnose problems, and predict issues proactively, reducing human intervention.
- AgentOps (Agent for Operations): A new paradigm proposed in this paper, representing an evolution of AIOps where autonomous AI agents (especially LLM-based ones) manage end-to-end operational tasks across the entire IT stack, leading to self-healing cloud systems. Unlike traditional AIOps which might focus on specific tasks, AgentOps envisions agents that can perform multi-task, holistic management.
- Self-healing cloud systems: Cloud environments that can automatically detect, diagnose, and resolve issues (e.g., faults, performance degradation) without human intervention, ensuring continuous availability and optimal performance.
Artificial Intelligence Concepts:
- Large Language Models (LLMs): A class of AI models trained on massive amounts of text data, enabling them to understand, generate, and reason with human language. They can perform a wide range of natural language processing tasks, including summarization, translation, question answering, and code generation. LLMs form the core of the AI agents evaluated in this paper.
- AI Agents: LLMs augmented with external tools and the ability to interact with an environment. They can perceive their environment, reason about it, take actions using tools, and observe the outcomes, learning to achieve goals autonomously. For AIOps, this means interacting with cloud environments via APIs, command-line interfaces, and receiving telemetry feedback.
- Telemetry Data: Data collected from IT systems to monitor their performance, health, and behavior. It typically includes:
  - Logs: Time-stamped records of events that occur within a system or application, useful for debugging and auditing.
  - Metrics: Numerical measurements collected over time (e.g., CPU utilization, memory usage, network latency, request rates), often presented as time series data.
  - Traces: End-to-end records of requests as they propagate through a distributed system, showing the sequence of operations and dependencies between services. Essential for understanding request flow in microservices.
- Fault Injection: A technique used to test the resilience of systems by deliberately introducing errors or failures. This helps identify vulnerabilities and verify error handling mechanisms.
- Root Cause Analysis (RCA): A systematic process for identifying the underlying causes of problems or incidents in a system, rather than just addressing the symptoms.

3.2. Previous Works

The paper frames its contribution in the context of existing efforts in AIOps and AI agent development, highlighting their limitations.

Traditional AIOps Tools and Algorithms: Prior AIOps approaches have primarily focused on solving isolated operational tasks.
- Anomaly Detection: Identifying unusual patterns in data (e.g., MKsMC by Çetin and Tasgin, 2020).
- Fault Localization: Pinpointing the specific component or service responsible for a fault (e.g., RMLAD by Wang et al., 2020; PDiagnose by Hou et al., 2021).
- These methods often excel at their specific tasks but lack the ability to manage the entire incident lifecycle or interact dynamically with the cloud environment.
Existing Benchmarks for AI Agents:
- For the 'Dev' side of DevOps (software development), several benchmarks exist that leverage AI agents:
  - WebArena (Zhou et al., 2023): A realistic web environment for building autonomous agents, testing their ability to interact with web applications.
  - R2E (Jain et al., 2024b): Turns any GitHub repository into a programming agent environment for evaluating code-related tasks.
  - HumanEval (Chen et al., 2021): A benchmark for evaluating code generation capabilities of LLMs.
  - LiveCodeBench (Jain et al., 2024a): A holistic and contamination-free evaluation for LLMs for code.
  - SWEbench (Jimenez et al., 2024): Evaluates LLMs on real-world GitHub issues.
- These benchmarks primarily focus on software development tasks and do not adequately simulate the complexities of IT operations in dynamic cloud environments.
AIOps Benchmarks:
- Existing AIOps benchmarks often rely on static datasets:
  - System metrics (Han et al., 2022; Jacob et al., 2020): Typically time series data, which allows for offline analysis but not interactive agent evaluation.
  - Fixed question-answer format (Liu et al., 2023): Lacks the dynamic interaction required for AgentOps.
- Many recent AgentOps efforts (e.g., RCACopilot by Chen et al., 2024; RCagent by Wang et al., 2023; MonitorAssistant by Yu et al., 2024a; Xpert by Jiang et al., 2024) use proprietary services and datasets, hindering public access and comparative evaluation.
- Crucially, these benchmarks often focus only on isolated aspects of the incident lifecycle, not providing a cohesive framework to evaluate AIOps agents comprehensively across multiple tasks or decision-making for chaining algorithms.

3.3. Technological Evolution

The evolution of IT operations has moved through several stages:

Manual Operations: Initial stage, highly human-intensive, slow, error-prone, and not scalable for complex systems.
Scripted Automation/Traditional DevOps: Introduction of scripts and automation tools to handle repetitive tasks and streamline DevOps pipelines. This improved efficiency but still required human oversight and custom scripting for each scenario.
Traditional AIOps Algorithms: Application of ML algorithms for specific tasks like anomaly detection, fault localization, and log analysis. These algorithms provided insights and partial automation but typically operated on passive data or addressed isolated problems without integrated interaction with the operational environment.
LLM-based AIOps (Emerging): The rise of LLMs introduced the potential for more intelligent, context-aware analysis and human-like interaction. Initial applications included summarization of incidents, generating recommendations, or basic Q&A over IT data.
AI Agents for AIOps / AgentOps (Current Frontier): The integration of LLMs with external tools and decision-making loops, allowing them to act as autonomous agents. These agents can perceive the cloud environment, reason about problems, execute actions (e.g., using APIs, shell commands), and learn from feedback, managing the entire incident lifecycle. This is where AIOPSLAB positions itself, pushing the boundaries towards self-healing cloud systems.

3.4. Differentiation Analysis

AIOPSLAB distinguishes itself from previous works by addressing several critical limitations:

Holistic and End-to-End Evaluation: Unlike traditional AIOps benchmarks that focus on isolated tasks (e.g., only detection or localization), AIOPSLAB provides a unified framework for evaluating agents across the entire incident management lifecycle, encompassing detection, localization, Root Cause Analysis (RCA), and mitigation. This allows for a more comprehensive assessment of AgentOps capabilities.
Dynamic and Interactive Environment: A key innovation is AIOPSLAB's ability to simulate dynamic cloud environments. Most prior AIOps benchmarks relied on static datasets, which cannot capture the real-time, unpredictable nature of operational incidents or allow agents to interact and modify the environment. AIOPSLAB facilitates agent-cloud interaction via its Agent-Cloud Interface (ACI), enabling dynamic decision-making and feedback loops.
Focus on LLM-based AI Agents: The framework is specifically designed to evaluate the next-generation AIOps agents powered by LLMs, which are capable of complex reasoning and tool use. This is a critical departure from evaluating traditional AIOps algorithms that often don't have interactive capabilities.
Realistic Problem Scenarios with Functional Faults: AIOPSLAB goes beyond simple symptomatic faults (e.g., crash failures) by incorporating functional faults (e.g., misconfigurations, software bugs). These fine-grained root causes pose a greater challenge to AI agents, requiring deeper diagnostic and mitigation abilities, thus leading to more realistic evaluation scenarios.
Open and Extensible Framework: By committing to public availability and designing AIOPSLAB with modular components (fault library, workload generators, telemetry observer), it provides an extensible platform. This contrasts with many LLM-based cloud management efforts that rely on proprietary services and closed datasets, making replication and comparative research difficult.
Unified Interface (ACI): The Agent-Cloud Interface (ACI) abstracts the complexity of the cloud environment, providing a standardized set of APIs for agents to interact. This simplifies agent design and allows for consistent evaluation across different agent architectures.

In essence, AIOPSLAB shifts the evaluation paradigm from passive analysis on static data to active, interactive problem-solving within a simulated, dynamic cloud environment, specifically tailored for the burgeoning field of LLM-driven AI agents in IT operations.

4. Methodology

4.1. Principles

The core idea behind AIOPSLAB is to provide a holistic and interactive environment for the design, development, and evaluation of autonomous AIOps agents, particularly those powered by LLMs. The theoretical basis and intuition are rooted in the need to move beyond isolated AIOps tasks and static datasets towards a dynamic AgentOps paradigm where AI agents can autonomously manage the entire incident lifecycle in complex, real-world cloud environments. The framework aims to bridge the gap between LLM capabilities and practical IT operations by offering a structured way for agents to perceive, reason, act, and learn within a controlled, yet realistic, cloud simulation.

Key principles include:

Holism: Evaluating agents across the complete incident management lifecycle (detection, localization, RCA, mitigation).
Interactivity: Enabling agents to dynamically interact with the cloud environment, take actions, and receive real-time feedback.
Realism: Simulating complex microservice architectures, injecting diverse symptomatic and functional faults, and generating realistic workloads.
Standardization: Providing a unified Agent-Cloud Interface (ACI) to simplify agent development and ensure consistent evaluation.
Extensibility: Allowing users to easily define new problems, integrate new services, and add different types of faults.
Observability: Collecting comprehensive telemetry data (logs, metrics, traces) to facilitate agent reasoning and performance analysis.

4.2. Core Methodology In-depth (Layer by Layer)

AIOPSLAB is designed as a modular framework with an Orchestrator at its core, coordinating interactions between various components. The overall architecture is depicted in Figure 2 of the original paper.

The following figure (Figure 2 from the original paper) provides an overview of AIOPsLAB's architecture:

fig 2 Figure 2. Overview of AIOPsLAB. The Orchestrator coordinates interactions between various system components and serves as the Agent-Cloud-Interface (ACI). Agents engage with the Orchestrator to solve tasks, receiving a problem description, instructions, and relevant APIs. The Orchestrator generates diverse problems using the Workload and Fault Generators, injecting these into applications it can deploy. The deployed service has observability, providing telemetry such as metrics, traces, and logs. Agents act via the Orchestrator, which executes them and updates the service's state. The Orchestrator evaluates the final solution using predefined metrics for the task.

Here's a breakdown of its components and workflow:

4.2.1. Problem Definition

To support a wide range of evaluation scenarios, AIOPSLAB formalizes an AIOps problem $P$ as a tuple: $P = \langle T, C, S \rangle$

Where:

$T$ : Represents a task. This defines the specific AIOps operation to be performed.
$C$ : Represents a context. This provides the environment and information related to the problem.
$S$ : Represents the expected solution (oracle). This is used to evaluate the agent's performance.

The task $T$ is categorized into four types, reflecting the stages of the incident management lifecycle, with increasing complexity:

Detection: Identifying the presence of unusual behavior or faults.
Localization: Pinpointing the exact source of a fault (e.g., a specific microservice or pod).
Root Cause Analysis (RCA): Determining the underlying cause of the fault (e.g., misconfiguration, software bug).
Mitigation: Applying effective solutions to recover the environment from the fault.

Each task type has associated success criteria and evaluation metrics (e.g., Time-to-Detect (TTD) for detection).

The context $C$ is further formalized as: $C = \langle E, I \rangle$

Where:

$E$ : The operational environment in which the problem occurs. This includes the cloud service, the fault model, and the workload model used to generate the problem, but this information is not directly shared with the agent.
$I$ : The problem information shared directly with the agent. This comprises service descriptions, task descriptions, documentation about available APIs, and indirect information (logs, metrics, traces) that the agent can query at runtime.

The solution $S$ is the expected outcome, typically problem and task-specific. For mitigation tasks, AIOPSLAB evaluates the overall system state (e.g., all services running) rather than just the targeted resource, accounting for potential side effects of mitigation.

Example 2.1: Problem Definition The paper provides an example of defining a localization problem:

# interface.
Example 2.1. Consider the problem of localizing a Kubernetes target port misconfiguration in a social network application. AIOPSLAB makes it easy to define this problem in just a few lines by extending the Loc.  $\mathrm{t\begin{array}{ll}\mathbf{u=l}\end{array}}$  configuration Task. def __init__(self): self. app = SocialNetwork() self. ans  $\mathbf{\Sigma} = \mathrm{~"user - service~}^{\prime}$  def start_workload(self): wrk  $=$  Wrk(rate=100, duration=10) wrk.start_workload(ur1=self.app. frontend_url) 11 def inject_fault(self)： inj  $\mathbf{\Sigma} = \mathrm{~UNITIe~a u p n t~i e n J n e t~i e n}$   $\mathrm{"misconfig_k8s"}^{(}$  inj.inject([self.ans], "misconfig_k8s") def eval(self,soln, trace, duration): res[TTT]=duration res["success"]  $=$  i.findAll(soln, self resposta res) return res

Explanation of the example:

__init__(self): Initializes the problem.
- self.app = SocialNetwork(): Sets up the SocialNetwork microservice application as the target environment.
- self.ans = "user-service": Defines the ground truth solution, indicating that the fault is expected to be in the "user-service".
start_workload(self): Defines how to generate traffic/load.
- wrk = Wrk(rate=100, duration=10): Initializes a wrk tool (a common HTTP benchmarking tool) to generate a workload at a rate of 100 requests per second for 10 seconds.
- wrk.start_workload(url=self.app.frontend_url): Starts injecting this workload to the frontend URL of the SocialNetwork application.
inject_fault(self): Defines how to inject the fault.
- $inj = UNITIe a u p n t ienJ net ien "misconfig_k8s"$ : This appears to be a placeholder or corrupted text in the OCR, but it implies initializing a fault injector with a specific fault type, "misconfig_k8s", which likely refers to a Kubernetes misconfiguration.
- $inj.inject([self.ans], "misconfig_k8s")$ : Injects the misconfig_k8s fault specifically into the service identified by self.ans ("user-service").
eval(self, soln, trace, duration): Defines how to evaluate the agent's proposed solution.
- $res[TTT]=duration$ : Records the time-to-task completion.
- $res["success"] = i.findAll(soln, self resposta res)$ : Checks if the agent's solution (soln) matches the expected answer (self.ans). The specific function i.findAll and self resposta res are likely internal helper functions for comparison.
- return res: Returns the evaluation results.
  
  In this example, $T$ is fault localization, $S$ is "user-service", and $C$ includes the SocialNetwork application, a misconfig_k8s fault, and a standard wrk workload.

4.2.2. Orchestrator

The Orchestrator is the central component of AIOPSLAB, enforcing separation of concerns between the AI agent and the cloud service. It provides robust interfaces for integration and extension.

4.2.2.1. Agent-Cloud Interface (ACI)

The ACI is a critical part of the Orchestrator, defining how an AI agent interacts with the cloud environment. It specifies:

The set of valid actions available to the agent.
How the service's state (observations) is conveyed back to the agent after its actions.

The ACI abstracts the complexity of the cloud, offering a concise, documented list of APIs. Examples of default APIs provided by AIOPSLAB include:

get_logs(ns: str) -> str: Fetches logs from a specified Kubernetes namespace (ns).
get_metrics(ns: str) -> str: Fetches metrics from a specified Kubernetes namespace (ns).
get_traces(ns: str, duration: int = 5) -> str: Fetches trace data for a specified namespace and duration.
exec_shell(command: str) -> str: Executes shell commands (with security policy filters).

Upon problem initialization, the Orchestrator automatically extracts documentation from these APIs and provides it as part of the context $I$ to the agent. Agents can specify various actions (e.g., scaling, redeploying, patching) via the Orchestrator's privileged access. The Orchestrator then provides high-quality feedback (outputs, error messages, tracebacks) on the service's state.

Example 2.2: ACI Definition (get_traces) The paper illustrates how an ACI API is defined:

1 class TaskActions: 2 def get_traces(ns: str, duration: int = 5)  $\rightarrow$  str: 3 get: 4 Capts Sce t raie dat o the serrices from aeger. 5 Args: 6 ns (str): The K8S namespace. 7 duration (int): Duration to collect traces. 8 Returns: 9 str: Path to the directory where traces saved. 10 = case_api = TraceAPI(ns) 11 end_t = datetime.now() 12 start_t = end_t - timedelta(duration) 13 traces  traceapi.extract_traces(start_t, 14 end_t) 15 return traceapi.save_traces(traces)

Explanation of the example:

class TaskActions: Defines a class for available actions.
def get_traces(ns: str, duration: int = 5) -> str: Declares the get_traces function, taking a Kubernetes namespace (ns as string) and a duration (integer, default 5) as input, and returning a string (path to saved traces).
Lines 3-9: These are likely docstrings or comments explaining the API's purpose, arguments, and return value. "Capts Sce t raie dat o the serrices from aeger." seems to be an OCR error for "Captures service trace data from Jaeger."
case_api = TraceAPI(ns): Initializes a TraceAPI object for the specified namespace.
end_t = datetime.now(): Gets the current timestamp.
start_t = end_t - timedelta(duration): Calculates the start timestamp for trace collection based on the duration.
traces = traceapi.extract_traces(start_t, end_t): Uses the TraceAPI to extract traces within the defined time window.
return traceapi.save_traces(traces): Saves the extracted traces and returns the path to the saved files.

4.2.2.2. Session Interface

The Orchestrator manages the lifecycle of the agent and the service through a session-based system. A Session is created for each instance of an agent solving a problem. Agents must implement a get_action method with the signature async def get_action(state: str) -> str, which takes the service's state as input and returns the agent's next action.

Example 2.3: Agent Onboarding The paper illustrates how an agent can be onboarded:

from aiopslab import Orchestrator class Agent: def init def _ init_ self, prob, instructs, apis: self.promp t = self.get처pt (prob,. instructs, apis) self.llm = GPT4() async de.get_action(self, state: str) - > str: return self.llm. generate(self.promp t + state) initialize the orchestrator orbchort = Orchestrator () pid "miscconfig_app_hotel_res-mitigation- 1" prob_desc instructs,apis = orch.init problem(pid) #register and evaluate the agent agent  $\eqcirc$  Agent(prob_desc, instructs, apis) orch.register_agent (agent, name="myAgent") asyncio.run (orch.start problem (max steps =10))

Explanation of the example:

from aiopslab import Orchestrator: Imports the Orchestrator class.
class Agent:: Defines a generic agent class.
def _init_self, prob, instructs, apis:: Initializes the agent.
- self.prompt = self.get처pt(prob, instructs, apis): Seems to be an OCR error, likely self.prompt = self.get_prompt(...), constructing the initial prompt for the LLM using problem description, instructions, and available APIs.
- self.llm = GPT4(): Instantiates an LLM (e.g., GPT-4).
async def get_action(self, state: str) -> str: The crucial method where the agent decides its next action.
- return self.llm.generate(self.prompt + state): Generates the next action by feeding the LLM the current prompt (including problem context) and the current state of the environment.
orch = Orchestrator(): Initializes the AIOPSLAB Orchestrator.
pid = "miscconfig_app_hotel_res-mitigation-1": Defines a problem ID.
prob_desc, instructs, apis = orch.init_problem(pid): Initializes a specific problem instance, retrieving its description, instructions, and available APIs.
agent = Agent(prob_desc, instructs, apis): Creates an instance of the custom agent with the problem context.
orch.register_agent(agent, name="myAgent"): Registers the agent with the Orchestrator.
asyncio.run(orch.start_problem(max_steps=10)): Starts the evaluation of the problem, allowing the agent to take up to 10 steps. The Orchestrator polls the agent's get_action method for its next action.

4.2.2.3. Other Interfaces

Problem Initializers: The Orchestrator deploys cloud services for each problem using infrastructure-as-code tools like Helm and Kubernetes APIs. It then interfaces with a Workload Generator and a Fault Generator.
- Workload Generator: Currently uses wrk2 to simulate realistic traffic with various policies and industry workload replays.
- Fault Generator: Uses a custom fault library integrated with ChaosMesh to inject diverse, fine-grained, and parametric faults across system layers (application, virtualization) that model underlying root causes.
Problem Evaluators: The Orchestrator compares the agent's solutions against predefined success criteria and metrics for each task (e.g., TTD for detection, number of steps/tokens for LLM agents). It also supports optional qualitative evaluation using LLMs-as-Judges (e.g., Zheng et al., 2024) to assess agent reasoning. All agent trajectories and system states are logged for detailed analysis.

4.2.3. Cloud Services

AIOPSLAB utilizes live microservice applications as its cloud environments (② in Figure 2). It is integrated with DeathStarBench (Gan et al., 2019), specifically using:

SocialNetwork: A complex application with 28 microservices (including Memcached, MongoDB, Redis) implementing social networking features.
HotelReservation: An application implemented with Go and gRPC, supporting hotel recommendation and reservation services.

4.2.4. Task-oriented Fault Library

The fault library is central to creating realistic and challenging problems for AIOps agents.

4.2.4.1. Task Taxonomy

The paper presents a task-level taxonomy (Table 1) categorizing AIOps tasks by increasing complexity:

The following are the results from Table 1 of the original paper:

Level	Task (# sub tasks)	Evaluation Focus
1	Detection (1)	Can the approach accurately detect anomalies or deviations?
2	Localization (1)	Can the approach pinpoint a fault's exact source (e.g., microservice)?
3	Root Cause Analysis (RCA) (2)	Can the approach determine the underlying cause of the fault?
4	Mitigation (1)	Can the approach give effective solutions to recover the environment?

Level 1: Detection: Simplest, focused on identifying unusual behavior (e.g., a malfunctioning Kubernetes pod).
Level 2: Localization: Identifying the exact source of a fault (e.g., a specific microservice).
Level 3: Root Cause Analysis (RCA): More complex, requiring agents to determine the underlying cause. This level has sub-tasks: identifying the affected system layer and the fault type.
Level 4: Mitigation: Most complex, requiring agents to apply corrective actions to restore the system.

4.2.4.2. Symptomatic Faults

Symptomatic faults (e.g., performance degradation, crash failures) manifest as observable symptoms like increased latency or service outages. They are used to construct Level 1 (detection) and Level 2 (localization) tasks. These faults indicate a problem exists but don't inherently reveal deep root causes. AIOPSLAB integrates ChaosMesh (ChaosMesh Authors, 2022) for injecting these.

The following figure (Figure 3 from the original paper) categorizes faults:

fig 3 Figure 3.Fault categories to instantiate problems in AIOPSLAB.

4.2.4.3. Functional Faults

Most traditional fault injection tools focus on system symptoms. Functional faults, however, model underlying, fine-grained root causes like misconfigurations or software bugs. These faults are crucial for Level 3 (RCA) and Level 4 (mitigation) tasks, as they challenge agents to not only detect and localize but also diagnose the specific cause and apply correct mitigation strategies.

Example: Revoke Authentication Fault (Figure 4) The paper illustrates a functional fault: revoking admin authentication for a MongoDB database used by a geographic microservice (Mongodb-geo). This causes errors in the Geo service that relies on it.

The following figure (Figure 4 from the original paper) shows an example of a revoke authentication fault:

fig 4 Figure 4. Revoke authentication fault example. Injection happens at Mongodb-geo service, while Geo service will be abnormal and generate error logs.

Example 2.4: Application-level Fault Injector The structure for injecting an application-level revoke authentication fault is shown:

1 from aiopslab.generators.fault.base import FaultInjector 2 from aiopslab.service.apps.hoteles import HotelReservation 3 class ApplicationFaultInjector(FaultInjector): 4 def inject_revoke_auth(self, microservices: list[str]): 5 """Revoke MongoDB admin privileges. " 6 ..

Explanation of the example:

from aiopslab.generators.fault.base import FaultInjector: Imports the base class for fault injectors.
from aiopslab.service.apps.hoteles import HotelReservation: Imports the HotelReservation application service definition.
class ApplicationFaultInjector(FaultInjector):: Defines a custom ApplicationFaultInjector inheriting from the base class.
def inject_revoke_auth(self, microservices: list[str]):: A method to inject the revoke authentication fault, targeting a list of microservices.
Docstring: "Revoke MongoDB admin privileges."

Users can define problems by selecting existing faults, specifying target services, or even creating custom faults. AIOPSLAB provides injection functions and corresponding mitigation mechanisms for recovery.

4.2.5. Observability

AIOPSLAB includes an extensible observability layer to collect comprehensive telemetry data (③ in Figure 2):

Traces: From Jaeger (Jaeger Authors, 2024), detailing end-to-end request paths in distributed systems.
Logs: Application logs retrieved by Kubectl, or formatted and recorded by Filebeat (Elasticsearch, 2024b) and Logstash (Elasticsearch, 2024a).
System Metrics: Monitored by Prometheus (Prometheus Authors, 2024).

This data is collected during agent interaction and can also be exported offline for evaluating traditional AIOps algorithms. The framework is designed to capture other information like codebase, configuration, and cluster details, and expose low-level system information (e.g., syscall logs) via its interface.

5. Experimental Setup

5.1. Datasets

The experimental evaluation utilizes a benchmark suite constructed using AIOPSLAB, consisting of 48 problems. These problems are instantiated by injecting various faults into two microservice applications from DeathStarBench (Gan et al., 2019):

HotelReservation: An application for hotel booking.
SocialNetwork: A complex social media application.

The choice of these microservice applications provides a realistic, distributed cloud environment, critical for evaluating AIOps agents.

The problems are generated using the faults listed in Table 2. These faults cover both symptomatic and functional types, and are designed to challenge agents across all four task levels (Detection, Localization, RCA, Mitigation).

The following are the results from Table 2 of the original paper:

No.		Application	Task Level	Category	Ext.	#Problem	Description
1	AuthenticationMissing	HotelReservation	1,2,3,4	Functional Virtualization	①	4	Missing authentication credentials cause access denial to MongoDB.
2	TargetPortMiscountina	SocialNetwork	1,2,3,4	Functional Virtualization	●	12	The service cannot connect to the specified port due to misconfiguration.
3	RevokeAuth	HotelReservation	1,2,3,4	Functional Application	①	8	Revoked authentication causes database connection failure.
4	UserUnregistered	HotelReservation	1,2,3,4	Functional Application	①	8	The database service has access failures after the user was unregistered.
5	BuggyAppImage	HotelReservation	1,2,3,4	Functional Application	○	4	Connection code bug in the application image causes access issues.
6	ScalePod	SocialNetwork	1,2,3,4	Functional Virtualization	●	4	Incorrect scaling operation makes the number of pod zero for a service.
7	AssignNonExistentNode	SocialNetwork	1,2,3,4	Functional Virtualization	●	4	Pod in a pending a failure status due to wrong assignment to a non-existent node.
8	NetworkLoss	HotelReservation	1,2	Symptomatic	●	2	Network loss causes communication failures for a specific service.
9	PodFailure	HotelReservation	1,2	Symptomatic	●	2	Service interruption due to a pod failure.
10	Noop	HotelReservation	1	-	●	2	No faults injected into the system.

Note on Extensibility (Ext. column): ① indicates the fault can be easily used to construct other problems; ● denotes there is some manual effort needed to create new problems; while ○ means the fault is specific to some problems and cannot be applied to create other problems.

Example of a data sample: A problem in AIOPSLAB isn't a static data sample, but an interactive scenario. For example, for "TargetPortMisconfig" (Fault 2) on SocialNetwork's "user-service":

The system would simulate a Kubernetes misconfiguration for the user-service.
A workload would be generated against the SocialNetwork frontend.
Telemetry data (logs, metrics, traces) reflecting the misconfiguration and its impact (e.g., failed requests, error logs from user-service) would be observable by the agent.
The agent's goal might be to localize the fault to "user-service" (Localization task), or propose a fix (Mitigation task).

These datasets were chosen because they represent realistic cloud incidents in complex microservice environments, allowing for a comprehensive evaluation of AIOps agents' diagnostic and mitigation abilities in dynamic settings.

5.2. Evaluation Metrics

AIOPSLAB employs several metrics to evaluate the performance of AIOps agents:

Correctness:
- Conceptual Definition: Measures the accuracy of the agent's response, assessing whether it successfully detects, localizes, analyzes, or resolves problems as expected. For localization tasks, correctness can be evaluated based on the top-ranked predictions.
- Mathematical Formula (Accuracy): $ \mathrm{Accuracy} = \frac{\mathrm{Number~of~Correct~Predictions}}{\mathrm{Total~Number~of~Predictions}} $
- Symbol Explanation:
  - $\mathrm{Number~of~Correct~Predictions}$ : The count of instances where the agent's output matches the ground truth solution.
  - $\mathrm{Total~Number~of~Predictions}$ : The total number of problems or tasks evaluated.
- For localization tasks, accuracy is often reported as:
  - Acc.@1 (Accuracy at 1): The percentage of times the top prediction by the agent is correct.
  - Acc.@3 (Accuracy at 3): The percentage of times the correct answer is among the top 3 predictions by the agent.
Time/Steps:
- Conceptual Definition: Evaluates the efficiency of the AIOps agent for each task type.
- Metrics:
  - Time-to-Detect (TTD): The time elapsed from the occurrence of a fault to its detection by the agent.
  - Time-to-Mitigate (TTM): The time taken from the detection of a fault to its complete mitigation by the agent.
  - Number of Steps: The count of interactions (actions) an agent takes with AIOPSLAB to solve a problem. This is distinct from the number of requests sent to the backend LLM.
- No specific mathematical formulas provided in the paper for these, as they are direct measurements.
Cost:
- Conceptual Definition: Measures the computational expense associated with agent operation, specifically for LLM-powered agents.
- Metric:
  - Tokens: The total number of tokens (input tokens fed to the LLM + output tokens generated by the LLM) produced by the agents/environment. This is a proxy for the computational cost of LLM usage.
- No specific mathematical formula provided, as it's a direct count.

5.3. Baselines

The paper evaluates two categories of agents/algorithms:

LLM-based Agents: These are the primary focus, leveraging LLMs for reasoning and interaction.
- GPT-4-w-SHELL: An LLM (specifically GPT-4-turbo by Achiam et al., 2023) that has access to a secure shell for executing commands. This serves as a strong baseline, representing a powerful, general-purpose LLM with basic tool-use capabilities.
- GPT-3.5-w-SHELL: An LLM (specifically GPT-3.5-turbo) also with secure shell access, serving as a more cost-effective and faster, but potentially less capable, baseline compared to GPT-4-w-SHELL.
- REACT (Reasoning and Acting): (Yao et al., 2023) An LLM-based agent framework that combines chain-of-thought reasoning (Wei et al., 2022b) with acting in an interleaved manner. It reasons about a problem, plans an action, executes it, and then reasons again based on the observation.
- FLASH (Workflow Automation Agent): (Zhang et al., 2024b) An AIOps-specific LLM agent that employs a workflow automation system, monitors execution status, decomposes complex instructions, and incorporates hindsight generation to learn from past interactions. The paper notes that a simplified version was developed for this evaluation, as the full version was not publicly available.
Non-LLM AIOps Algorithms: These represent traditional AIOps methods specialized for certain tasks, using multimodal telemetry data as input. They are included to show the comparative advantage (or disadvantage) of LLM-based agents.
- For Detection:
  - MKsMC (Multivariate K-sigma score using Monte Carlo): (Çetin and Tasgin, 2020) An anomaly detection method.
- For Localization:
  - RMLAD: (Wang et al., 2020) Likely an anomaly detection or localization algorithm.
  - PDiagnose: (Hou et al., 2021) A method for diagnosing performance issues in microservices using heterogeneous data sources.
    
    These baselines were chosen to cover a spectrum from general-purpose powerful LLMs (with basic tool access) to more specialized LLM agents (REACT, FLASH) and traditional, task-specific AIOps algorithms, allowing for a comprehensive comparison of AIOPSLAB's evaluation capabilities.

6. Results & Analysis

6.1. Core Results Analysis

The evaluation of AIOps agents on the AIOPSLAB benchmark reveals key insights into their capabilities and limitations across different AIOps tasks. The overall performance is summarized in Table 3, while task-specific results are detailed in Table 4.

The following are the results from Table 3 of the original paper:

Agent	LoC	Time (s)	# Steps	Tokens	Acc.
GPT-4-w-SHELL	41	28.61	6.44	6,394.5	49.15%
GPT-3.5-w-SHELL	41	12.44	14.70	2,557.95	15.25%
REACT	49	43.79	11.50	16,941.46	55.93%
FLASH	60	99.64	8.48	6,484.25	59.32%

Table 3. Overall performance of different agents. We show the lines of code (LoC) to register the agent in AIOPSLAB, average running time in seconds, average number of steps taken, average tokens used, and accuracy across all problems.

Overall Performance (Table 3):

Accuracy: FLASH achieves the highest overall accuracy (59.32%), indicating its strength in problem-solving across various tasks. REACT follows closely (55.93%), then GPT-4-w-SHELL (49.15%). GPT-3.5-w-SHELL performs the poorest (15.25%).
Time (s): GPT-3.5-w-SHELL is the fastest on average (12.44s), likely due to its lower complexity and often failing quickly. FLASH is the slowest (99.64s), suggesting more extensive reasoning or interaction.
# Steps: GPT-3.5-w-SHELL takes the most steps (14.70), often implying inefficient or repetitive actions. GPT-4-w-SHELL takes the fewest (6.44). FLASH and REACT are moderate.
Tokens: REACT consumes the most tokens (16,941.46), reflecting its verbose chain-of-thought reasoning. GPT-3.5-w-SHELL consumes the least (2,557.95), but also has the lowest accuracy.

These results suggest a trade-off between speed/cost and accuracy, with more sophisticated agents like FLASH and REACT achieving better results at higher computational expense or time.

Task-Specific Performance (Table 4): The following are the results from Table 4 of the original paper:

Agent	Accuracy	Time (s)	# Steps	Input	Output
GPT-4-w-SHELL	69.23%	7.08	3.85	5,492	132
GPT-3.5-w-SHELL	23.07%	11.05	13.60	1,940.44	385.56
REACT	76.92%	39.00	11.46	15,608.08	933.15
FLASH	100%	78.27	6.77	12,869.08	125.69
MKSMC	15.38%	1.00	N/A	N/A	N/A

Agent	Acc.@3	Acc.@1	Time (s)	# Steps	Input	Output
GPT-4-w-SHELL	61.54%	61.54%	7.04	4.23	4,588.07	133.23
GPT-3.5-w-SHELL	30.77%	30.77%	6.26	11.92	1,784.23	217.08
REACT	69.23%	53.85%	38.65	11.08	4,760.77	880.92
FLASH	61.54%	46.15%	56.60	5.77	1,875.08	123.31
DDDAGOSE	15.38%	15.38%	1.02	N/A	N/A	N/A
RMLAD	7.69%	7.69%	1.98	N/A	N/A	N/A

Agent	Accuracy	Time (s)	# Steps	Input	Output
GPT-4-w-SHELL	40.90%	8.68	4.81	4,297.91	176.18
GPT-3.5-w-SHELL	9.09%	10.06	14.00	1,495.55	406.27
REACT	45.45%	32.16	8.00	16,276.09	757.27
FLASH	36.36%	59.00	6.09	1,193.90	152.45

Agent	Accuracy	Time (s)	# Steps	Input	Output
GPT-4-w-SHELL	27.27%	99.47	13.72	10,142.55	1,060.00
GPT-3.5-w-SHELL	0%	23.78	20.00	3,178.33	967.71
REACT	36.36%	67.18	15.54	29,211.90	1,464.90
FLASH	54.55%	216.41	16.09	8,469.00	760.36

Table 4. Agent performance by task. This table summaries the performance of different agents across various tasks including detection, localization, RCA, and mitigation. Acc. stands for accuracy. Input/Output represents the number of tokens given to and produced by the agent, respectively.

a) Detection Task:

FLASH achieves 100% accuracy, significantly outperforming all other LLM agents and traditional methods.
REACT (76.92%) and GPT-4-w-SHELL (69.23%) also perform well.
Traditional MKsMC has very low accuracy (15.38%). This confirms that LLM agents are strong at simple detection.

b) Localization Task:

REACT shows the best Acc.@3 (69.23%), indicating it often includes the correct answer in its top 3 predictions.
GPT-4-w-SHELL performs best in Acc.@1 (61.54%), meaning its top prediction is more often correct.
Traditional methods DDDAGOSE and RMLAD (15.38%, 7.69% respectively) are notably poor, highlighting the advantage of LLM agents in this interactive task.

c) RCA (Root Cause Analysis) Task:

This task proves more challenging. REACT leads with 45.45% accuracy, followed by GPT-4-w-SHELL (40.90%).
FLASH surprisingly underperforms here (36.36%), while GPT-3.5-w-SHELL is very weak (9.09%).
RCA requires deeper understanding and reasoning, where current LLM agents still have significant room for improvement.

d) Mitigation Task:

This is the most challenging task. FLASH achieves the highest accuracy (54.55%), but with the longest average time (216.41s).
REACT is next (36.36%).
GPT-4-w-SHELL has low accuracy (27.27%).
GPT-3.5-w-SHELL completely fails (0%) to mitigate any faults.
The high time and token consumption for mitigation indicate the complexity of interacting with the environment to fix issues.

Overall Observations:

LLM agents vs. Traditional AIOps: For detection and localization, LLM agents (especially FLASH, REACT, GPT-4-w-SHELL) significantly outperform traditional non-LLM AIOps methods, demonstrating their advantage in interactive problem-solving.
Problem Difficulty: The RCA and mitigation tasks are substantially harder for all agents, highlighting the gap between current LLM capabilities and the full vision of AgentOps. No agent consistently achieves high accuracy across all task categories.
Cost-Performance Trade-offs: While GPT-3.5-w-SHELL is fast and cheap, its accuracy is unacceptably low. More capable agents like FLASH and REACT are slower and more expensive but deliver better results.

6.2. Ablation Studies / Parameter Analysis

The paper includes an analysis of the influence of the step limit on agent performance, which can be seen as a form of parameter analysis.

The following figure (Figure 5 from the original paper) shows agent performance vs. number of steps taken:

fig 5 Figure 5. Agent performance vs. number of steps taken.

Impact of Step Limit: The maximum number of allowed steps significantly affects agent performance.
- REACT and FLASH show improved accuracy as the number of steps increases, with FLASH reaching its peak accuracy of 59.32% at 20 steps. This indicates that these agents can leverage more interactions with the environment to refine their understanding and actions.
- GPT-4-w-SHELL also shows a general upward trend, but with less pronounced gains after around 10-15 steps.
- For GPT-3.5-TURBO, increasing the step limit beyond 5 does not lead to better performance; instead, it primarily increases token consumption without improving accuracy. This suggests GPT-3.5-TURBO might lack the deeper reasoning or effective self-correction mechanisms to benefit from more interaction steps for AIOps problems.
Self-repair Saturation: The plateauing of accuracy after a certain number of steps for some agents suggests that self-repair with environment feedback can saturate quickly in AIOps problems. This contrasts with development tasks (like code generation) where continuous feedback (linters, type checkers, tests) allows for more sustained improvement. This implies a need for:
1. Better task decomposition and planning for AIOps problems.
2. Improved feedback mechanisms for intermediate steps.
3. Solutions that go beyond simple environment feedback and self-repair.

6.3. Agent Behavior: The Good, the Bad and the Gaps

The paper further analyzes specific behaviors, including API usage patterns and common failure modes.

The following figure (Figure 6 from the original paper) shows the total percentage of actions taken by different agents:

fig 6 Figure 6. Total percentage of actions taken by different agents.

The following are the results from Table 5 of the original paper:

Agent	Kubectl Get	Kubectl Describe	Kubectl Exec	Cat	Other
GPT-4-w-SHELL	21.84%	2.06%	0.14%	1.92%	0.77%
GPT-3.5-w-SHELL	27.22%	1.52%	0.19%	3.62%	0.95%
REACT	19.70%	1.49%	0.00%	1.39%	0.14%
FLASH	27.35%	1.18%	0.00%	0.00%	0.00%

Table 5. Occurrences of system commands.

Telemetry API Usage (Figure 6):

get_logs is the most frequently used API across all agents, followed by get_metrics.
get_traces is used less frequently. FLASH notably does not use get_traces at all. This suggests agents prioritize log and metric data, possibly due to their perceived directness or easier interpretability for LLMs.

System Command Usage (Table 5):

kubectl get is the most common shell command across agents, indicating a tendency to query Kubernetes resources for information.
cat is also used, suggesting agents might view raw log/metric files.
kubectl describe is used less, and kubectl exec (for executing commands within a pod) is very rare.

6.3.1. Wasting steps on unnecessary actions

Agents often waste steps by repeatedly calling the same API, generating non-existent APIs, or engaging in excessive multi-agent communication.
GPT-3.5-w-SHELL is particularly prone to generating incorrect API commands in loops, leading to repeated execution errors. This indicates a lack of robust error handling or self-correction.
Over-reliance on telemetry APIs without careful analysis can overwhelm the LLM's input context window and lead to token exhaustion, adding noise rather than useful information. This points to a need for more refined telemetry data processing and filtering mechanisms.

6.3.2. Invalid API usage

Agents struggle with improper formatting of API calls. GPT-3.5-w-SHELL frequently generates syntactically incorrect commands or malformed parameters, often apologizing and repeating the same mistake.
REACT occasionally generates incorrect API commands but demonstrates better self-correction, reasoning through errors and adjusting its commands in subsequent steps. The paper provides an example where REACT uses an incorrect parameter for get_logs, receives an error, and then correctly uses exec_shell to list services to find the correct name. This highlights the importance of robust reasoning and acting capabilities.

6.3.3. False positive detection issues

In "no operation" (Noop, Fault 10) problems where no faults were injected, only GPT-4-w-SHELL correctly identified the system as normal.
Other agents reported false positives, misinterpreting normal system activities (e.g., standard workload generation) as faults. This is a critical issue for AIOps, as false positives can lead to unnecessary alerts and wasted human effort.

6.4. Data Presentation (Tables)

The following are the results from Table 2 of the original paper:

No.		Application	Task Level	Category	Ext.	#Problem	Description
1	AuthenticationMissing	HotelReservation	1,2,3,4	Functional Virtualization	①	4	Missing authentication credentials cause access denial to MongoDB.
2	TargetPortMiscountina	SocialNetwork	1,2,3,4	Functional Virtualization	●	12	The service cannot connect to the specified port due to misconfiguration.
3	RevokeAuth	HotelReservation	1,2,3,4	Functional Application	①	8	Revoked authentication causes database connection failure.
4	UserUnregistered	HotelReservation	1,2,3,4	Functional Application	①	8	The database service has access failures after the user was unregistered.
5	BuggyAppImage	HotelReservation	1,2,3,4	Functional Application	○	4	Connection code bug in the application image causes access issues.
6	ScalePod	SocialNetwork	1,2,3,4	Functional Virtualization	●	4	Incorrect scaling operation makes the number of pod zero for a service.
7	AssignNonExistentNode	SocialNetwork	1,2,3,4	Functional Virtualization	●	4	Pod in a pending a failure status due to wrong assignment to a non-existent node.
8	NetworkLoss	HotelReservation	1,2	Symptomatic	●	2	Network loss causes communication failures for a specific service.
9	PodFailure	HotelReservation	1,2	Symptomatic	●	2	Service interruption due to a pod failure.
10	Noop	HotelReservation	1	-	●	2	No faults injected into the system.

The following are the results from Table 3 of the original paper:

Agent	LoC	Time (s)	# Steps	Tokens	Acc.
GPT-4-w-SHELL	41	28.61	6.44	6,394.5	49.15%
GPT-3.5-w-SHELL	41	12.44	14.70	2,557.95	15.25%
REACT	49	43.79	11.50	16,941.46	55.93%
FLASH	60	99.64	8.48	6,484.25	59.32%

The following are the results from Table 4(a) (Detection Task) of the original paper:

Agent	Accuracy	Time (s)	# Steps	Input	Output
GPT-4-w-SHELL	69.23%	7.08	3.85	5,492	132
GPT-3.5-w-SHELL	23.07%	11.05	13.60	1,940.44	385.56
REACT	76.92%	39.00	11.46	15,608.08	933.15
FLASH	100%	78.27	6.77	12,869.08	125.69
MKSMC	15.38%	1.00	N/A	N/A	N/A

The following are the results from Table 4(b) (Localization Task) of the original paper:

Agent	Acc.@3	Acc.@1	Time (s)	# Steps	Input	Output
GPT-4-w-SHELL	61.54%	61.54%	7.04	4.23	4,588.07	133.23
GPT-3.5-w-SHELL	30.77%	30.77%	6.26	11.92	1,784.23	217.08
REACT	69.23%	53.85%	38.65	11.08	4,760.77	880.92
FLASH	61.54%	46.15%	56.60	5.77	1,875.08	123.31
DDDAGOSE	15.38%	15.38%	1.02	N/A	N/A	N/A
RMLAD	7.69%	7.69%	1.98	N/A	N/A	N/A

The following are the results from Table 4(c) (RCA Task) of the original paper:

Agent	Accuracy	Time (s)	# Steps	Input	Output
GPT-4-w-SHELL	40.90%	8.68	4.81	4,297.91	176.18
GPT-3.5-w-SHELL	9.09%	10.06	14.00	1,495.55	406.27
REACT	45.45%	32.16	8.00	16,276.09	757.27
FLASH	36.36%	59.00	6.09	1,193.90	152.45

The following are the results from Table 4(d) (Mitigation Task) of the original paper:

Agent	Accuracy	Time (s)	# Steps	Input	Output
GPT-4-w-SHELL	27.27%	99.47	13.72	10,142.55	1,060.00
GPT-3.5-w-SHELL	0%	23.78	20.00	3,178.33	967.71
REACT	36.36%	67.18	15.54	29,211.90	1,464.90
FLASH	54.55%	216.41	16.09	8,469.00	760.36

The following are the results from Table 5 of the original paper:

Agent	Kubectl Get	Kubectl Describe	Kubectl Exec	Cat	Other
GPT-4-w-SHELL	21.84%	2.06%	0.14%	1.92%	0.77%
GPT-3.5-w-SHELL	27.22%	1.52%	0.19%	3.62%	0.95%
REACT	19.70%	1.49%	0.00%	1.39%	0.14%
FLASH	27.35%	1.18%	0.00%	0.00%	0.00%

Table 5. Occurrences of system commands.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces AIOPSLAB, a novel and holistic framework for the design, development, and comprehensive evaluation of autonomous AI agents targeting IT operations in cloud environments, a paradigm termed AgentOps. The framework integrates key components: a fault injector, workload generator, cloud-agent orchestrator with an Agent-Cloud Interface (ACI), and telemetry observer. This setup allows for the simulation of realistic cloud incidents and enables AI agents to interact dynamically with the environment.

Through AIOPSLAB, the authors constructed a benchmark suite of 48 diverse problems spanning detection, localization, root cause analysis (RCA), and mitigation tasks. The evaluation of four state-of-the-art LLM-based agents (GPT-4-w-SHELL, GPT-3.5-w-SHELL, REACT, and FLASH) on this benchmark demonstrated that while LLM agents show significant promise, particularly in detection and localization tasks where they outperform traditional AIOps algorithms, they still face considerable challenges in more complex RCA and mitigation scenarios. The paper provides detailed insights into agent behaviors, including issues like wasted steps, invalid API usage, context window limitations, false positives, and the saturation of self-repair mechanisms. By committing to making AIOPSLAB publicly available, the authors aim to foster further research and development in AgentOps.

7.2. Limitations & Future Work

The paper explicitly and implicitly highlights several limitations and suggests avenues for future research:

Current Agent Limitations in Complex Tasks: LLM agents struggle significantly with RCA and mitigation tasks. This points to a need for LLMs with more robust reasoning, planning, and long-term memory capabilities tailored for sequential decision-making in IT operations.
Inefficient Agent Behaviors: Observations like agents wasting steps, repeatedly making the same API usage errors (especially GPT-3.5-w-SHELL), and self-repair saturation indicate that current agentic frameworks need improvement.
- Need for Better Task Decomposition and Planning: The quick saturation of self-repair suggests that agents require better internal planning mechanisms to break down complex AIOps problems into manageable sub-tasks.
- Improved Intermediate Feedback: Beyond simple environment feedback, agents could benefit from more structured and informative feedback during intermediate steps, similar to how linters and test cases aid software development.
Context Window Management: The issue of telemetry data overwhelming the LLM's context window, leading to token exhaustion and distraction, is a fundamental LLM challenge. Future work needs to focus on more refined telemetry data processing, filtering, and summarization techniques to provide agents with relevant information without cognitive overload.
Qualitative Evaluation: For tasks like detection, agents might provide a correct answer but with incorrect reasoning. The paper suggests utilizing LLMs-as-Judges to perform more fine-grained qualitative evaluation of agent reasoning chains against problem descriptions.
Extensibility of AIOPSLAB: While AIOPSLAB is designed to be extensible, the paper notes that some complex functional faults (e.g., AuthenticationMissing, RevokeAuth) require manual effort to set up (e.g., preparing scripts, updating Kubernetes config maps). Simplifying the definition and injection of such complex faults would enhance the framework's usability.
Broader Fault Types and Problem Scenarios: The framework is adaptable to other fault types (e.g., anomaly detection workloads) and problem scenarios (e.g., requiring agents to label telemetry data). This is an ongoing area for expanding the benchmark.
Specific Agent Implementations: The FLASH agent used in the evaluation was a simplified version due to its unavailability, implying that its full potential might not have been captured, and further evaluation with a complete version would be beneficial.

7.3. Personal Insights & Critique

AIOPSLAB is a highly valuable contribution to the AIOps and AI agent research landscape. Its holistic, interactive, and realistic approach fills a significant gap in existing benchmarks, which often fall short in simulating the dynamic and multi-faceted nature of real-world IT operations.

Strengths:

Pioneering AgentOps Evaluation: The paper clearly articulates the AgentOps vision and provides a concrete framework to evaluate LLM-based agents in this context. This is crucial for advancing the field beyond isolated AIOps tasks.
Realism and Interaction: The use of live microservice applications, diverse symptomatic and functional faults, and the Agent-Cloud Interface (ACI) creates a genuinely interactive and realistic testing ground, far superior to static datasets. This allows for the study of dynamic agent behaviors, self-correction, and tool-use.
Comprehensive Problem Taxonomy: The four-level task taxonomy is well-defined and progressively challenging, offering a structured way to assess agent capabilities from simple detection to complex mitigation.
Actionable Insights: The detailed analysis of agent failure modes, API usage patterns, and the impact of step limits provides valuable insights for AI agent developers, pointing to specific areas for improvement (e.g., planning, context management, error handling). The observation about self-repair saturation is particularly profound for future agentic AI development.
Commitment to Open Source: Making AIOPSLAB publicly available is a significant boon to the research community, enabling reproducibility, comparative studies, and collaborative development.

Potential Issues & Areas for Improvement:

Scalability of the Benchmark: While 48 problems are a good start, real-world cloud environments are vastly more complex, with thousands of services and countless potential incident scenarios. Expanding the problem pool automatically and dynamically could be a future challenge.
Complexity of Fault Injection: As noted, injecting some functional faults requires manual setup. Further automation or a more intuitive declarative language for defining complex multi-service, multi-stage faults would be beneficial.
Security of exec_shell: The paper mentions security policy filters for exec_shell. Given that AI agents could potentially execute arbitrary commands, the robustness and restrictiveness of these filters are critical for real-world application and need to be thoroughly detailed or customizable within the framework to prevent unintended consequences or malicious actions by an agent.
"LLM-as-Judge" Bias: While LLMs-as-Judges offer a promising avenue for qualitative evaluation of reasoning, LLMs themselves can exhibit biases or inconsistencies. Care must be taken in designing the judging criteria and validating the judge LLM's fairness and accuracy.
Beyond Reactive Agents: The current agents, even REACT and FLASH, are largely reactive (perceive, then act). Future agents could incorporate more proactive elements, such as predictive maintenance, anomaly prevention, or self-optimization, requiring an even more sophisticated evaluation framework.
Evaluation of Non-LLM Agents: While the paper includes some traditional AIOps methods, the primary focus and benchmark design are clearly geared towards LLM-based agents. A more dedicated set of metrics or evaluation scenarios that specifically highlight the strengths and weaknesses of non-LLM, specialized AIOps algorithms could offer a richer comparative analysis.

Overall, AIOPSLAB is an impressive and timely research effort that pushes the boundaries of AIOps by providing a much-needed robust platform for evaluating AI agents in complex, dynamic cloud environments. Its insights are invaluable for guiding the next generation of autonomous cloud management systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~33 min read · 48,599 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Definition

4.2.2. Orchestrator

4.2.2.1. Agent-Cloud Interface (ACI)

4.2.2.2. Session Interface

4.2.2.3. Other Interfaces

4.2.3. Cloud Services

4.2.4. Task-oriented Fault Library

4.2.4.1. Task Taxonomy

4.2.4.2. Symptomatic Faults

4.2.4.3. Functional Faults

4.2.5. Observability

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.3. Agent Behavior: The Good, the Bad and the Gaps

6.3.1. Wasting steps on unnecessary actions

6.3.2. Invalid API usage

6.3.3. False positive detection issues

6.4. Data Presentation (Tables)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers