Paper status: completed

Large Language Models as Realistic Microservice Trace Generators

Published:12/16/2024
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
6 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study fine-tunes large language models with recursive generation and instruction tuning to create accurate, diverse synthetic microservice traces, effectively replacing real data and supporting downstream tasks like feature prediction and data completion.

Abstract

Workload traces are essential to understand complex computer systems' behavior and manage processing and memory resources. Since real-world traces are hard to obtain, synthetic trace generation is a promising alternative. This paper proposes a first-of-a-kind approach that relies on training a large language model (LLM) to generate synthetic workload traces, specifically microservice call graphs. To capture complex and arbitrary hierarchical structures and implicit constraints in such traces, we show how to fine-tune LLMs to generate recursively, making call graph generation a sequence of easier steps. To further enforce learning constraints in traces and generate uncommon situations, we argue for applying additional instruction tuning steps to align our model with the desired trace features. Our evaluation results show that we can generate diverse realistic traces under various conditions and outperform existing methods in accuracy and validity. We demonstrate that our synthetically generated traces can effectively replace real data to optimize important microservice management tasks. Additionally, our model adapts to downstream trace-related tasks, such as predicting key trace features and infilling missing data.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Large Language Models as Realistic Microservice Trace Generators

1.2. Authors

The authors of this paper are Donghyun Kim, Sriram Ravula, Taemin Ha, Alexandros G. Dimakis, Daehyeok Kim, and Aditya Akella. They are affiliated with the University of Texas at Austin.

1.3. Journal/Conference

The paper is published as a preprint on arXiv (arXiv:2502.17439v2). As an arXiv preprint, it has not yet undergone formal peer review or been published in a specific journal or conference proceedings at the time of its publication on arXiv. However, arXiv is a widely recognized platform for disseminating early research in fields like computer science, often preceding formal publication in reputable conferences or journals.

1.4. Publication Year

The paper was published on 2024-12-16 (UTC).

1.5. Abstract

This paper introduces a novel method for generating synthetic workload traces, specifically microservice call graphs, by fine-tuning a large language model (LLM). Real-world traces are difficult to obtain, making synthetic generation a valuable alternative for understanding complex system behaviors and managing resources. To effectively capture the intricate hierarchical structures and implicit constraints inherent in microservice call graphs, the authors propose a recursive generation strategy, breaking down the complex task into a sequence of simpler steps. Furthermore, they enhance constraint adherence and enable the generation of uncommon scenarios through instruction tuning, aligning the model with desired trace features. Evaluation results demonstrate that this approach generates diverse and realistic traces under various conditions, outperforming existing methods in accuracy and validity. The synthetically generated traces are shown to be effective replacements for real data in optimizing microservice management tasks. Additionally, the model is adaptable to downstream trace-related tasks, such as predicting key features and infilling missing data.

The original source link is https://arxiv.org/abs/2502.17439v2. The PDF link is https://arxiv.org/pdf/2502.17439v2.pdf. The publication status is a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the scarcity and difficulty of obtaining real-world workload traces, particularly microservice call graphs, which are essential for analyzing complex computer systems, optimizing resource allocation (CPU, memory, networking), and managing applications. Real traces are often proprietary, hard to collect at scale across diverse environments, and limited in public availability. This limitation hinders thorough testing, analysis, and the development of robust microservice management techniques.

Previous attempts at synthetic trace generation using generative machine learning models like LSTMs, GANs, and diffusion models have made progress. However, these methods typically suffer from significant limitations: they often only generate specific fields (e.g., number of requests, resource types) or are restricted to fixed-structure traces (e.g., network packets), making them unsuitable for the complex, hierarchical, and variable-depth structures of microservice call graphs. These existing methods struggle to capture the implicit, inter-feature constraints crucial for realistic system behavior (e.g., parent-child process timing relationships).

The paper's entry point is the recent success of large language models (LLMs) in adapting to diverse domains beyond natural language, such as protein sequences, code, and tabular data. LLMs' ability to process sequential data, learn complex patterns, and adhere to user specifications through fine-tuning and instruction tuning positions them as a promising, yet unexplored, solution for generating synthetic system traces that accurately capture real-world structures and constraints. The innovative idea is to adapt general-purpose LLMs to generate these complex, graph-structured traces, specifically microservice call graphs, which are a rich, directed acyclic graph (DAG) structure foundational to modern cloud applications.

2.2. Main Contributions / Findings

The paper makes several significant contributions:

  • First-of-a-Kind LLM-based Approach: It proposes the first approach that leverages large language models (LLMs) for generating synthetic microservice call graphs. This is a novel application of LLMs to system traces, handling their unique structural and constraint challenges.

  • Recursive Generation for Hierarchical Structures: The paper introduces a key innovation: recursively generating subgraphs or layers within a call graph. This method breaks down the complex task of generating an entire hierarchical graph into a sequence of more manageable sub-tasks, significantly improving the model's ability to produce valid and accurate complex graph structures.

  • Instruction Tuning with Intermediate Instructions for Constraint Adherence: To further enforce implicit constraints (e.g., timing relationships, parent-child dependencies) and to enable the generation of user-specified uncommon situations, the authors employ instruction tuning with "intermediate instructions." These natural language reasoning steps, inserted during the generation process, guide the LLM to perform arithmetic and logical checks, ensuring strict adherence to desired structural properties.

  • Superior Synthetic Trace Quality: The evaluation demonstrates that the proposed method generates diverse and realistic traces, outperforming existing generative and probabilistic models in terms of structural validity, distribution similarity to real traces (e.g., popular calls, heavy-hitters, branching, response times), and reduced memorization of training data.

  • Utility for Microservice Management Tasks: The synthetically generated traces are shown to effectively substitute real data for training machine learning models used in critical microservice management tasks, such as critical component extraction (FIRM) and anomaly detection (TraceVAE), with minimal performance degradation compared to training on real data.

  • Adaptability to Downstream Trace-Related Tasks: The fine-tuned LLM demonstrates strong performance in other trace-related downstream tasks, including predicting key trace features (e.g., uncommon communications) and infilling missing data within traces, even outperforming much larger foundation models (like Llama-3.1 405B) that lack domain-specific training.

    These findings collectively solve the problem of scarce and rigid synthetic trace generation by providing a flexible, accurate, and structurally valid method that can replace real data for critical system analysis and optimization tasks.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully grasp the paper, a reader should understand several key concepts:

  • Workload Traces: In computer systems, a workload trace is a detailed record of events that occur during the execution of applications or systems. These events can include CPU usage, memory allocation, network I/O, API calls, and process start/end times. Traces are vital for understanding system behavior, identifying bottlenecks, and optimizing resource management.

  • Microservices: Microservices represent an architectural style where an application is structured as a collection of loosely coupled, independently deployable services. Each service typically performs a specific business function and communicates with others via lightweight mechanisms, often HTTP APIs or message queues. This contrasts with monolithic architectures where all functionalities are bundled into a single unit. Examples include popular services from Uber and Netflix.

  • Microservice Call Graphs: When a user request traverses through an application built with microservices, it triggers a sequence of communications between different services. A microservice call graph is a representation of this execution path. It is typically a directed acyclic graph (DAG), where:

    • Vertices (Nodes): Represent individual microservices or the client initiating the request.
    • Edges: Represent API calls or communications between microservices. Each edge can have features like source, destination, communication type (e.g., HTTP, RPC), and start/finish times.
    • Hierarchical Structure: A call graph has a hierarchical nature, where one microservice might call several child microservices, forming layers of dependencies.
  • Large Language Models (LLMs): Large Language Models are a class of artificial intelligence models, typically based on the transformer architecture, that are trained on vast amounts of text data to understand, generate, and process human language. They are autoregressive, meaning they predict the next token in a sequence based on the preceding tokens.

    • Fine-tuning: The process of taking a pre-trained LLM (trained on a general dataset) and further training it on a smaller, specific dataset to adapt it to a particular task or domain. This allows the model to specialize while retaining its broad linguistic understanding.
    • Instruction Tuning: A form of fine-tuning where an LLM is trained on a dataset of instructions (prompts) and desired responses. This teaches the model to follow user commands and generate outputs aligned with specific instructions, improving its ability to respond to diverse prompts.
    • In-context Learning: The ability of an LLM to learn from examples provided directly within the prompt, without requiring explicit fine-tuning. The model uses the given examples to infer the desired task and generate appropriate responses.
    • Llama-2 7B / Llama-3.1 405B: Specific LLM models developed by Meta. "7B" and "405B" refer to the number of parameters in the model (7 billion and 405 billion, respectively), which indicates their size and computational complexity. Larger models generally have greater capacity for learning complex patterns.
  • Generative Models: Machine learning models designed to generate new data instances that resemble the training data.

    • LSTMs (Long Short-Term Memory): A type of recurrent neural network (RNN) capable of learning long-term dependencies, often used for sequence prediction tasks. (Sherstinsky, 2020)
    • GANs (Generative Adversarial Networks): Consist of two neural networks, a generator and a discriminator, that compete against each other to generate realistic synthetic data. The generator tries to produce data that fools the discriminator, while the discriminator tries to distinguish real from fake data. (Goodfellow et al., 2014)
    • Diffusion Models: A class of generative models that learn to reverse a gradual diffusion process (adding noise) to generate data. They start with random noise and iteratively denoise it to produce a sample. (Ho et al., 2020)
  • LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning technique for large language models. Instead of fine-tuning all parameters of a large pre-trained model, LoRA injects small, low-rank matrices into the transformer layers. During fine-tuning, only these newly added matrices are trained, significantly reducing the number of trainable parameters and computational cost, while achieving comparable performance to full fine-tuning. (Hu et al., 2022)

3.2. Previous Works

The paper frames its contribution against the backdrop of existing research in synthetic trace generation and LLM adaptation.

  • Existing Synthetic Trace Generation Methods:

    • Network Packet Traces: Previous works like Lin et al. (2020), Jiang et al. (2023), and Yin et al. (2022) have explored GANs and diffusion models to generate synthetic network packet traces. These methods are effective for their specific domains but are typically limited to generating fixed-structure data or specific fields within the traces. They do not address the complex hierarchical and variable-depth structures of microservice call graphs.
    • Virtual Machine Workload Traces: Bergsma et al. (2021) utilized LSTMs for generating virtual machine workload traces. Similar to network traces, these methods often focus on specific numerical fields or time-series data without explicitly modeling complex structural dependencies like those found in call graphs.
    • Limitations: The primary drawback of these generative models for microservice call graphs is their inability to enforce structural constraints (e.g., parent-child timing, DAG formation) or handle the inherent hierarchical nature of call graphs. They are often confined to predicting specific fields or following training data distributions without explicit structural adherence.
  • Synthetic Tabular Data Generation Methods:

    • Since traces can be represented in tabular form, methods for synthetic tabular data generation are relevant. Approaches like TVAE (Xu et al., 2019), a VAE-based model, and GReaT (Borisov et al., 2023), which leverages language models for tabular data, have advanced the field.
    • Limitations: While these methods are adept at generating tabular data, they fundamentally do not account for the hierarchical structure of call graphs within their tabular representations. The relationships between rows (edges) in a call graph are crucial and are not inherently preserved by general tabular data generators. The GReaT approach, which encodes tabular data as text for LLMs, serves as a direct baseline in this paper, highlighting the need for specialized handling of graph structures.
  • LLM Adaptations to Specialized Domains:

    • LLMs have demonstrated versatility beyond natural language. Examples include their successful adaptation to:
      • Protein Sequences: Shen et al. (2024)
      • Code Generation: Roziere et al. (2023)
      • Tabular Data: Borisov et al. (2023)
      • Quantitative Reasoning: Lewkowycz et al. (2022)
      • Semiconductor Manufacturing: Liu et al. (2023)
    • These works illustrate the potential of LLMs to model complex patterns and structures in diverse data types, forming the conceptual basis for applying them to microservice traces.
  • Making Language Models Follow Instructions & Multi-step Reasoning:

    • The paper draws inspiration from advancements in enhancing LLMs' instruction-following capabilities through prompting (Li and Liang, 2021; Shin et al., 2020; Wei et al., 2022) and instruction tuning (Ouyang et al., 2022; Wei et al., 2021; Chung et al., 2022). These methods are crucial for guiding LLMs to generate specific outputs based on user requirements.
    • The idea of multi-step reasoning with LLMs, as seen in Chain-of-Thought prompting (Wei et al., 2022) and Tree-of-Thoughts (Yao et al., 2024), where problems are broken down into smaller, explicit steps, directly informs the paper's use of "intermediate instructions" to enforce structural constraints and guide recursive generation.

3.3. Technological Evolution

The evolution of synthetic data generation has moved from domain-specific statistical models and rule-based systems, through general-purpose generative deep learning models (GANs, VAEs, LSTMs) applied to specific data types (e.g., time series, tabular data, fixed-structure network packets), and is now entering an era where large language models are being adapted to handle increasingly complex and structured data.

  • Early Stages (Pre-Deep Learning): Rule-based systems and statistical models, often limited by their reliance on explicit rules and assumptions about data distributions.
  • Deep Learning for Specific Fields: LSTMs for time-series data (e.g., VM workloads), GANs/Diffusion for fixed-format data (e.g., network packets), and VAEs for tabular data. These models excelled at capturing patterns within their specific data types but often lacked the flexibility to handle arbitrary structures or complex inter-feature constraints across different domains.
  • LLM Era: The advent of LLMs, with their transformer-based architectures and massive pre-training, provided models capable of understanding and generating highly complex sequential data. Initially focused on natural language, their success led to adaptations in code, protein sequences, and generic tabular data (e.g., GReaT). This paper places its work within this LLM era, pushing the boundary to highly structured, hierarchical graph data that requires explicit constraint satisfaction.

3.4. Differentiation Analysis

Compared to the main methods in related work, this paper's approach offers several core differences and innovations:

  • Handling Hierarchical Graph Structures: Unlike existing generative models that are confined to fixed-structure traces or predict only specific fields (e.g., network packets, VM workloads), this paper explicitly tackles the hierarchical, DAG structure of microservice call graphs. This is a crucial distinction, as call graphs involve complex parent-child relationships and variable depths/widths that simpler models cannot effectively capture.
  • Recursive Generation: The introduction of recursive generation is a novel mechanism specifically designed for hierarchical data. By breaking down the complex task into layer-by-layer sub-generations, the LLM can manage the complexity of graph construction more effectively. This contrasts with tabular data generators like TVAE or GReaT, which treat rows independently or sequentially without deep structural awareness across them.
  • Constraint Enforcement via Instruction Tuning and Intermediate Instructions: This is a major innovation. While other LLM adaptations might use fine-tuning for domain-specific language, this paper uses instruction tuning with intermediate, natural language reasoning steps to actively guide the LLM in enforcing complex implicit constraints (e.g., timing consistency, parent-child relationships, valid DAG formation). This goes beyond simply learning data distributions; it teaches the model to reason about and adhere to structural rules during generation.
  • Domain-Specific Adaptation of LLMs for System Traces: While LLMs have been adapted to various domains (code, proteins), this is the first documented application to computer system traces that involve such intricate graph structures and temporal/logical constraints. It showcases that LLMs can be successfully specialized to generate data that is not only statistically similar but also structurally valid for systems engineering needs.
  • Demonstrated Practical Utility: Beyond just generating data, the paper rigorously demonstrates that its synthetic traces are practically useful, serving as effective training data for real-world microservice management tasks and enabling downstream tasks like prediction and infilling, where generic LLMs or other generative models fail.

4. Methodology

The core methodology of this paper centers on training a large language model (LLM) to generate synthetic microservice call graphs. This is achieved through a two-stage process: an initial pre-training phase to adapt the LLM to the domain of call graphs and a subsequent instruction tuning phase to enhance its ability to follow user specifications and enforce structural constraints. The fundamental principle is to represent the inherently graph-structured call graphs as text sequences, enabling autoregressive LLMs to process and generate them. The key innovation lies in a recursive generation strategy and the use of natural language intermediate instructions to manage complexity and ensure structural validity.

4.1. Principles

The core idea is to leverage the powerful sequence modeling capabilities of pre-trained LLMs, which are highly effective at understanding and generating complex text, by transforming microservice call graphs into a textual format. The theoretical basis is that if a complex, structured data type like a call graph can be adequately serialized into a text sequence, an autoregressive LLM can learn to model its distribution and generate new, similar sequences. The intuition behind recursive generation is to decompose the challenging task of generating an entire, potentially large and deep, graph into a series of simpler, sequential decisions. Instead of generating all edges at once, the model generates one "layer" of the graph, then uses the context of that layer to generate the next, much like how a graph might be traversed or expanded. The instruction tuning phase, augmented with intermediate instructions, capitalizes on LLMs' capacity for step-by-step reasoning. By explicitly "telling" the model, in natural language, to perform checks or derive values, it can be guided to produce outputs that strictly adhere to complex, implicit constraints that are difficult for standard generative models to learn implicitly.

4.2. Core Methodology In-depth (Layer by Layer)

The training of LLMs for microservice trace generation proceeds in two main stages: pre-training and instruction tuning.

4.2.1. Pre-training

The pre-training stage adapts a general-purpose LLM, initially trained on natural language, to the specific domain of microservice call graphs. This is done using an autoregressive language modeling objective.

4.2.1.1. Encoding Call Graphs as Text

Since LLMs are designed to process and generate sequences of text, the first crucial step is to convert the microservice call graphs, which are initially represented in a tabular format, into a text-based representation.

  • Tabular Representation: A call graph X \mathbf{X} is initially stored as a table where each row represents an edge (a communication between microservices), and columns represent features of that edge. For instance, an edge might have features like Source, Destination, Communication Type, Start Time, and Finish Time.

    • Let X \mathbf{X} have mm feature columns {f1,,fm} \{f_1, \ldots, f_m\} and nn edge rows {x1,,xn} \{\mathbf{x}_1, \ldots, \mathbf{x}_n\} .
    • The value of feature jj for edge ii is denoted vij v_{ij} .
  • Edge Encoding: Each edge xi \mathbf{x}_i is encoded as a text sequence ti=[ti1,,tim] \mathbf{t}_i = [t_{i1}, \ldots, t_{im}] .

    • Each component tij t_{ij} is formed by [ϕ(fj),vij] [\phi(f_j), v_{ij}] , where ϕ(f) \phi(f) is a function that converts the feature name ff into a natural language template that describes the value vij v_{ij} .
    • Example: For an edge's feature "Communication starts at" with value "0 ms", ϕ("Communication starts at") \phi(\text{"Communication starts at"}) would be "Communication starts at", and vij v_{ij} would be "0 ms". The resulting text would be "Communication starts at 0 ms".
    • The paper provides an example: [Edge ID is 0, Source is Client, Destination is Front end, Type is HTTP, Communication starts at 0 ms, Communication finishes at 24 ms].
    • To eliminate spurious position-based associations that might arise from a fixed column order, the feature order within each edge xi \mathbf{x}_i is randomly shuffled during training.
  • Full Graph Encoding: The entire call graph is represented as a sequence of these text-encoded edges: t=[t1,,tn] \mathbf{t} = [\mathbf{t}_1, \ldots, \mathbf{t}_n] .

  • Global Attribute Encoding (Conditioning Information): In addition to edge features, global attributes of the call graph are encoded to provide conditioning information for the model. These attributes summarize complex interactions, such as maximum depth, total edges, and total communication latency.

    • Let call graph X \mathbf{X} have rr attributes with names {a1,,ar} \{a_1, \ldots, a_r\} and values {w1,,wr} \{w_1, \ldots, w_r\} .
    • These are encoded as a text sequence c=[c1,,cr] \mathbf{c} = [c_1, \ldots, c_r] , where cj=[aj,", ", wj]c_j = [a_j, \text{", ", } w_j] .
    • Example (from Figure 2): [ServiceIDisS32647104,NumEdgesis3,MaxDepthis2,StartNodeisUSER,Latencyis12ms][Service ID is S_32647104, Num Edges is 3, Max Depth is 2, Start Node is USER, Latency is 12ms]
    • These attributes c \mathbf{c} are prepended to each text-encoded call graph t \mathbf{t} and are predicted alongside the edges during pre-training.
    • Similar to edge features, attributes are randomly shuffled. Each attribute is independently dropped with a probability pdropp_{drop} (set to 0.9 for most attributes, except service ID which is always kept with pdrop=1p_{drop}=1) to enable flexible prompting during inference.

4.2.1.2. Recursive Generation

This is a key innovation for handling the hierarchical structure of call graphs. Instead of generating the entire graph at once, the task is broken down into a series of recursive layer generation tasks.

  • Layer Partitioning: An encoded call graph t=[t1,t2,,tn] \mathbf{t} = [\mathbf{t}_1, \mathbf{t}_2, \ldots, \mathbf{t}_n] (sequence of edges) is partitioned into a sequence of layers [t1,t2,,tl] [\mathbf{t}^1, \mathbf{t}^2, \ldots, \mathbf{t}^l] .
    • Each layer tk \mathbf{t}^k consists of a sequence of edges that share the same parent (source) node. This ensures that no edge appears in multiple layers.
  • Layer Conditions: For the overall call graph conditions c \mathbf{c} that describe t \mathbf{t} , layer conditions cj \mathbf{c}^j are introduced for j{1,2,,l+1} j \in \{1, 2, \ldots, l+1\} .
    • cj \mathbf{c}^j encodes the attributes of the remaining portion of the call graph after the sequence of layers [t1,t2,,tj1] [\mathbf{t}^1, \mathbf{t}^2, \ldots, \mathbf{t}^{j-1}] has been generated.
    • The initial overall conditions are set as c1:=c \mathbf{c}^1 := \mathbf{c} .
    • The final condition indicates no remaining graph: cl+1:= \mathbf{c}^{l+1} := \varnothing .
  • Decomposition of Conditional Distribution: The conditional call graph distribution is decomposed as a chain of conditional layer distributions: $ p(\mathbf{t} | \mathbf{c}) = \prod_{k=1}^l p(\mathbf{c}^{k+1}, \mathbf{t}^k | \mathbf{c}^k) $
    • Explanation: This formula states that the probability of generating the full trace t \mathbf{t} given the initial conditions c \mathbf{c} can be expressed as the product of probabilities of generating each layer tk \mathbf{t}^k and its subsequent layer conditions ck+1 \mathbf{c}^{k+1} , conditioned on the current layer conditions ck \mathbf{c}^k .
    • Process: The model iteratively predicts call graphs layer-by-layer.
      1. For layer kk, the model takes ck \mathbf{c}^k (the conditions for the current layer) as input.

      2. It then produces the sequence of edges tk \mathbf{t}^k (the edges within the current layer).

      3. Immediately after tk \mathbf{t}^k , it generates the conditions ck+1 \mathbf{c}^{k+1} for the next layer.

      4. This process continues recursively, using ck+1 \mathbf{c}^{k+1} to predict layer k+1k+1, and so on, until all edges are generated or the conditions indicate completion (e.g., remaining depth is zero).

        Figure 2 from the paper visually illustrates this recursive generation process.

        该图像是一张示意图,展示了基于分层递归方式生成微服务调用图的过程,包括三个层次的调用节点及其边的条件和子图关系。 Figure 2: A training data sample of a call graph with 3 edges for recursive generation as illustrated in Figure 10, detailing edge attributes and subgraph structure.

(Note: The alt text for Image 2 in the prompt describes Figure 10, but the image provided in the prompt is actually Figure 2 from the paper, which illustrates recursive generation. I will use the image and its appropriate context.) The image shows a "Layer 1" with a Start Node: USER and Start Communication: 0ms. It generates Edge 1 from USER to MS_55040. Based on this, a Condition for Layer 2 (Subgraph 1) is generated for MS_55040, which then generates Edge 2 and Edge 3. After these, a Condition for Layer 3 (Subgraph 2 and 3) is generated for the child nodes of MS_55040, and the process continues. This demonstrates how the model generates edges for one layer and then the conditions for the next layer's subgraphs.

4.2.2. Instruction Tuning

After the initial pre-training, a supervised fine-tuning phase, referred to as instruction tuning, is performed. This stage is crucial for enhancing the model's ability to generate call graphs precisely based on user instructions and to enforce structural constraints more robustly.

  • Fixed Prompt: Unlike pre-training, during instruction tuning, the initial call graph attributes c \mathbf{c} (equivalent to the first-layer conditions c1 \mathbf{c}^1 ) are excluded from the loss computation. They are treated as a fixed prompt that guides the generation.
  • Additional User Instructions: Users can provide additional instructions, which are incorporated into the prompt. The paper studies two types of instructions: High Latency (e.g., "Build a call graph with high latency") and Uncommon Communications (e.g., "Include an edge from (SRC) to (DEST) with (TYPE) communication type").
  • Programmatically Generated Prompts: To aid reasoning and align with user specifications, numerical and non-linguistic attributes (e.g., application IDs) are converted into natural language descriptions and added to the prompt. This helps the LLM "understand" the context better.

4.2.2.1. Intermediate Instructions

A critical component of instruction tuning is the introduction of intermediate instructions. These are natural language reasoning steps inserted into the generation sequence to reinforce constraint adherence, particularly when the model struggles to generate consistent next-layer conditions ck+1 \mathbf{c}^{k+1} based on the current layer's edges tk \mathbf{t}^k and conditions ck \mathbf{c}^k .

  • Motivation: LLMs often benefit from explicit, step-by-step reasoning, similar to Chain-of-Thought prompting. Without these, the model might violate physical constraints (e.g., assigning a child layer higher latency than its parent or the overall call graph allows).
  • Mechanism: These intermediate instructions are inserted before the generation of ck+1 \mathbf{c}^{k+1} during the instruction fine-tuning phase. They act as explicit computational or logical checks that the model must "perform" in natural language.
    • Example (from paper):

      1. "compute remaining edges from the Num Edges attribute in ck \mathbf{c}^k and edges in tk \mathbf{t}^k ."
      2. "derive the Remaining Depth in ck+1 \mathbf{c}^{k+1} as Child's remaining depth = current remaining depth - 1 = ... ."
    • Figure 11 in the paper provides an example of instruction tuning data, showing how these intermediate instructions (called "Chain-of-Thought scratchpads") are embedded.

      Figure 11: A training data sample of a call graph layer for instruction-tuning. Figure 11: A training data sample of a call graph layer for instruction-tuning.

      In this example, the "Output" section shows an <edges><edges> block followed by "Thoughts:" where calculations like numgeneratededges=thelastedgeidthefirstedgeid+1num generated edges = the last edge id - the first edge id + 1 are explicitly stated. Then, before the <subgraph><subgraph> block, more "Thoughts:" explain how remaining_depth is derived (Childsremainingdepth=currentremainingdepth1Child's remaining depth = current remaining depth - 1). These explicit reasoning steps guide the LLM to learn and internalize the constraint-checking logic.

This two-stage process, combining domain-specific pre-training with recursive generation and constraint-reinforcing instruction tuning, allows the LLM to effectively learn and generate complex, valid, and user-conditioned microservice call graphs.

5. Experimental Setup

5.1. Datasets

The primary dataset used for training and evaluation is the Alibaba microservice v2022 dataset (Luo et al., 2022).

  • Source and Scale: The dataset consists of 1.36 million microservice call graph samples, which translate to 1.1 billion tokens after text encoding. These traces were collected from the first hour of Alibaba's operations, spanning across more than 10 clusters and involving 6434 unique microservices.

  • Characteristics: The traces capture API calls, where each call provides communication information between two microservices. The service IDs are anonymized (e.g., S058367691S_058367691 for services, MS55040MS_55040 for microservices) to ensure privacy, meaning the model does not disclose sensitive information regarding actual service names.

  • Data Preprocessing:

    • Call graphs are constructed using the trace ID field, grouping API calls that belong to the same request.
    • Calls with missing information (e.g., unknown destination microservice IDs) are removed.
    • Disconnected call graphs (those with missing edges leading to an incomplete graph) are removed.
    • Redundant call graphs (those with identical structures and feature values across all API calls) are filtered out to ensure diversity in the training data.
  • Splitting: 10% of the preprocessed samples are reserved for validation.

  • Why these datasets were chosen: The Alibaba dataset is a realistic, large-scale, enterprise-level microservice trace dataset, making it highly suitable for validating the method's ability to generate traces reflecting real-world complexity and scale. The anonymization also addresses ethical concerns regarding sensitive data.

    The distribution of the training data after preprocessing is illustrated in Figure 8:

    Figure 8: Training data distribution after preprocessing. Figure 8: Training data distribution after preprocessing.

The upper graph shows the distribution by maximum call graph depth, indicating that most call graphs have a depth of 1 or 2, with fewer graphs having greater depths. The lower graph shows the distribution by the number of edges, revealing that the majority of call graphs have a small number of edges (e.g., less than 10), with a long tail for graphs with more edges. This highlights the varying complexity of call graphs in the dataset.

5.2. Evaluation Metrics

The paper employs a variety of evaluation metrics to assess different aspects of the synthetic traces and the model's capabilities.

  • Structured Reasoning Accuracy:

    • Conceptual Definition: This metric quantifies the model's ability to generate synthetically valid call graphs. A trace is considered accurate if it precisely matches specified user prompts (e.g., num_edges and depth) and adheres to all internal structural constraints of a call graph. These constraints include:
      1. Valid DAG Formations: The generated graph must be a directed acyclic graph.
      2. Appropriate Start/Finish Times: Communication start times must precede finish times for each edge.
      3. Parent-Child Time Consistency: A parent's start time must precede its child's start time, and its finish time must follow the child's finish time.
      4. Hierarchical ID Consistency: IDs (dot-decimal numbers) must follow hierarchical rules.
    • Mathematical Formula: This is a binary classification metric per generated trace: Accuracy=Number of structurally valid traces matching promptsTotal number of traces generated×100% \text{Accuracy} = \frac{\text{Number of structurally valid traces matching prompts}}{\text{Total number of traces generated}} \times 100\%
    • Symbol Explanation:
      • Number of structurally valid traces matching prompts: The count of generated traces that pass all structural checks and correctly match the target num_edges and depth.
      • Total number of traces generated: The total count of synthetic traces attempted to be generated.
  • KL Divergence (Kullback-Leibler Divergence):

    • Conceptual Definition: Used to measure the similarity between the distribution of popular calls (defined by Source, Destination, Communication type) in synthetic traces and real traces. KL divergence quantifies how much one probability distribution diverges from another. A lower KL divergence indicates higher similarity between the two distributions.
    • Mathematical Formula: For two discrete probability distributions PP (real data) and QQ (synthetic data) over the same event space, the KL divergence is: DKL(PQ)=iP(i)log(P(i)Q(i)) D_{KL}(P || Q) = \sum_i P(i) \log \left(\frac{P(i)}{Q(i)}\right)
    • Symbol Explanation:
      • P(i): The probability of event ii occurring in the real data distribution.
      • Q(i): The probability of event ii occurring in the synthetic data distribution.
      • i\sum_i: Summation over all possible events ii (in this context, distinct call patterns).
      • log\log: Natural logarithm. (Note: If Q(i)=0Q(i)=0 for some ii where P(i)>0P(i) > 0, DKLD_{KL} becomes infinite, indicating a severe mismatch. Practical implementations often handle this with smoothing).
  • Heavy-hitter Prediction Accuracy:

    • Conceptual Definition: Measures the model's ability to generate traces containing the top-K heavy-hitter microservices (most frequently triggered microservices) that are present in corresponding real validation traces. This is crucial for resource optimization and anomaly detection.
    • Mathematical Formula: This is calculated as the proportion of intersection between the top-K sets of synthetic and real traces: Accuracy=Top-K(Synthetic)Top-K(Real)K×100% \text{Accuracy} = \frac{|\text{Top-K}(\text{Synthetic}) \cap \text{Top-K}(\text{Real})|}{K} \times 100\%
    • Symbol Explanation:
      • Top-K(Synthetic)Top-K(Real)|\text{Top-K}(\text{Synthetic}) \cap \text{Top-K}(\text{Real})|: The number of microservices that are present in both the top-K list from synthetic traces and the top-K list from real traces.
      • KK: The number of top microservices being considered.
  • ROC AUC (Receiver Operating Characteristic Area Under the Curve):

    • Conceptual Definition: Used to evaluate the performance of anomaly detection models (e.g., TraceVAE) trained on synthetic data. AUC measures the ability of a binary classifier to distinguish between positive (anomalous) and negative (normal) classes. An AUC of 1 represents perfect classification, while 0.5 indicates performance no better than random guessing.
    • Mathematical Formula: The AUC is the area under the ROC curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification thresholds.
      • TPR=Recall=TPTP+FN TPR = \text{Recall} = \frac{TP}{TP + FN}
      • FPR=FPFP+TN FPR = \frac{FP}{FP + TN}
      • AUC=01TPR(FPR)dFPR \text{AUC} = \int_0^1 TPR(FPR) dFPR
    • Symbol Explanation:
      • TP: True Positives (correctly identified anomalous traces).
      • FN: False Negatives (anomalous traces incorrectly identified as normal).
      • FP: False Positives (normal traces incorrectly identified as anomalous).
      • TN: True Negatives (correctly identified normal traces).
      • TPR: True Positive Rate (also known as Sensitivity or Recall).
      • FPR: False Positive Rate.
  • Instruction-following Accuracy:

    • Conceptual Definition: Measures how accurately the model generates call graphs that fulfill specific user instructions, such as generating "high latency" traces (above the 90th percentile of latency) or including "uncommon communications" (those occurring in less than 10% of training data).
    • Mathematical Formula: Accuracy=Number of generated traces meeting instruction criteriaTotal number of traces generated with instruction×100% \text{Accuracy} = \frac{\text{Number of generated traces meeting instruction criteria}}{\text{Total number of traces generated with instruction}} \times 100\%
    • Symbol Explanation:
      • Number of generated traces meeting instruction criteria: The count of synthetic traces that satisfy the specific condition outlined in the instruction.
      • Total number of traces generated with instruction: The total count of synthetic traces generated when prompted with a specific instruction.
  • Normalized Earth Mover's Distance (EMD) / Wasserstein Distance:

    • Conceptual Definition: Used to compare the distribution similarities of microservice branching (in-degree and out-degree) and response times between real and synthetic traces. EMD is a metric that quantifies the minimum cost to transform one distribution into another. A lower EMD indicates greater similarity between the shapes and positions of the distributions. "Normalized" implies scaling the distance to a common range for comparability.
    • Mathematical Formula: For two probability distributions PP and QQ on a metric space MM with a distance function d(x, y), the EMD (or 1-Wasserstein distance) is defined as: W1(P,Q)=infγΓ(P,Q)Eγ[d(X,Y)]=infγΓ(P,Q)M×Md(x,y)dγ(x,y) W_1(P, Q) = \inf_{\gamma \in \Gamma(P, Q)} \mathbb{E}_{\gamma}[d(X, Y)] = \inf_{\gamma \in \Gamma(P, Q)} \int_{M \times M} d(x, y) d\gamma(x, y) For 1D distributions, it simplifies to: W1(P,Q)=FP(x)FQ(x)dx W_1(P, Q) = \int_{-\infty}^{\infty} |F_P(x) - F_Q(x)| dx where FPF_P and FQF_Q are the cumulative distribution functions. Normalization might involve dividing by the range of values or maximum possible EMD.
    • Symbol Explanation:
      • P, Q: The two probability distributions being compared (e.g., real and synthetic distributions of in-degrees).
      • MM: The metric space over which the distributions are defined.
      • d(x, y): The ground distance between two points xx and yy in MM.
      • Γ(P,Q)\Gamma(P, Q): The set of all joint distributions (or "couplings") γ(x,y)\gamma(x, y) whose marginal distributions are PP and QQ.
      • Eγ[d(X,Y)]\mathbb{E}_{\gamma}[d(X, Y)]: The expected value of the ground distance under the joint distribution γ\gamma.
      • FP(x),FQ(x)F_P(x), F_Q(x): Cumulative distribution functions of PP and QQ.

5.3. Baselines

The paper compares its proposed method against several baselines depending on the specific evaluation task:

  • For Structured Reasoning Accuracy (§4.1):

    • Baseline (Llama-2 7B without recursion or instruction tuning): This model is a Llama-2 7B fine-tuned on text-encoded call graphs, following the GReaT method (Borisov et al., 2023) where call graph traces are represented as tabular data. It does not employ the recursive generation strategy or the intermediate instructions.
    • Recursive (Llama-2 7B with recursive generation but no instruction tuning): This model incorporates the recursive generation strategy but lacks the additional instruction tuning phase with intermediate reasoning steps.
  • For Similarity of Real and Synthetic Traces (§4.2) and ML-Driven Microservice Management (§4.3):

    • GReaT (Borisov et al., 2023) (Llama-2 7B + tabular format): This is the same baseline model described above, representing call graphs as a flat tabular format for a Llama-2 7B model.
    • Probabilistic model (Luo et al., 2021): A call graph generator from Alibaba that samples graph structures based on statistical distributions, such as communication types and the number of child nodes per depth. This model extends to time-related fields by sampling from training data statistics.
    • TVAE (Xu et al., 2019): A VAE-based tabular data generator. Since it cannot directly generate hierarchical traces, it is used only to compare edge distributions. Training is limited to 100K randomly selected samples.
  • For Adapting Models for Trace-Related Downstream Tasks (§4.5):

    • Standard Llama-2 7B: An unmodified Llama-2 7B model (without domain-specific pre-training on call graphs).
    • Llama-3.1 405B: A much larger, state-of-the-art LLM (Dubey et al., 2024). It is evaluated using in-context learning, where task descriptions and up to 16 examples are provided in the prompt, to assess its performance on domain-specific tasks without explicit fine-tuning on trace data.

6. Results & Analysis

The evaluation demonstrates the effectiveness of the proposed method across multiple dimensions: the quality of synthetic traces, their utility for machine learning tasks, and the model's ability to follow instructions and perform downstream trace-related tasks.

6.1. Core Results Analysis

6.1.1. Structured Reasoning Accuracy

This experiment evaluates how well the models generate structurally valid microservice call graphs that adhere to specified num_edges and depth constraints.

  • Complexity Impact: The results (Figures 3a and 3b) clearly show that as the complexity of the desired call graph increases (more edges or greater depth), the baseline model's accuracy significantly drops, falling below 25% for graphs with over 15 edges and nearly to zero for depths exceeding 4. This highlights the baseline's struggle with complex structural constraints.

  • Recursive Generation Benefit: In contrast, the recursive generation model consistently maintains higher accuracies, showing an improvement of approximately 30% and 35% compared to the baseline for edge count and depth, respectively. This demonstrates the effectiveness of breaking down graph generation into simpler, recursive sub-tasks.

  • Instruction Tuning Enhancement: Further enhancement is observed with instruction tuning, which boosts model accuracy by 23% to 36%. This improvement comes from explicitly directing the model to adhere to specific generation instructions, such as the number of edges per layer, via intermediate reasoning steps.

    The following charts illustrate these findings:

    该图像是论文中包含三部分折线图的图表,展示了不同方法在生成微服务调用图的准确率表现,分别对边数量、深度和温度参数进行分析。图中显示“Recursive + Instruction”方法在各项指标上准确率最高,体现了附加指令调优的优势。 Figure 3: Line graphs illustrating the accuracy performance of different methods in generating microservice call graphs, analyzed with respect to the number of edges, depth, and temperature parameters.

  • Decoding Temperature: Figure 3c illustrates the effect of decoding temperature on accuracy. Both models (baseline and recursive) show decreased performance as temperature increases (higher temperature leads to more creative/less deterministic output), but the recursive model consistently outperforms the baseline, maintaining about 10% higher accuracy even at a temperature of 1.

    Additional detailed analysis is provided in the appendix:

  • Heatmap Analysis (Figure 12): This heatmap visually confirms that recursive generation with instruction tuning (recursive + instruction) improves accuracy across most combinations of (# Edges, Depth), though some configurations may show lower accuracy due to training data distribution.

    Figure 4: Distribution similarity between real and synthetic traces. Figure 12: Accuracy heatmap.

  • Intermediate Instructions Ablation (Figure 13): An ablation study shows that removing intermediate instructions during instruction tuning leads to an approximate 13% decrease in accuracy across all temperatures. This strongly supports the effectiveness of explicit reasoning steps in guiding the LLM to generate correct structures.

    Figure 5: ML Model Performance (real vs. synthetic traces). Figure 13: Accuracy to generate correct call graph structures with and without intermediate instructions during instruction tuning.

  • Model Size Impact (Figure 14): Larger models generally achieve higher accuracy. For example, the Llama-2 13B model shows a 20 percentage point improvement over the 7B model for inputs with a depth greater than 4, indicating that more parameters help with handling deeper, more complex graphs.

    Figure 6: Instruction follow-Figure 7: Downstream task ing accuracy \(( \\% )\) . accuracy \(( \\% )\) . Figure 14: This composite figure presents three line graphs that illustrate how the accuracy of microservice call graph generation varies with the number of edges, depth, and sampling temperature for different Llama models (Llama-3.2 1B, Llama-3.2 3B, Llama-2 7B, and Llama-2 13B). The plots compare the performance of these models, showing that larger models generally achieve higher accuracy, particularly with increasing graph depth. The x-axis on the top graph is 'Num Edges', the middle graph is 'Depth', and the bottom graph is 'Temperature'. The y-axis for all graphs represents 'Accuracy (%).'

  • Memorization (Figure 15): The proposed methods (Recursive and Recursive+InstructionRecursive+Instruction) demonstrate significantly lower memorization (3% to 5%) of training data compared to the Baseline (16% to 24%). This indicates that the recursive approach not only generates more structured traces but also promotes greater diversity in the synthetic output.

    Figure 8: Training data distribution after preprocessing. Figure 15: Proportion (%) of synthetic call graphs found in training data varying temperature parameters.

6.1.2. Similarity of Real and Synthetic Traces

This evaluation compares the quality of synthetic traces by measuring their distributional similarity to real traces from a validation dataset.

  • Distribution of Popular Calls (Figure 4a):

    • LLM-based approaches (ours and GReaT) show very low KL divergence (0.16 and 0.11, respectively) for the distribution of popular calls (Source, Destination, Communication type), indicating high similarity to real data.
    • The probabilistic model has a significantly higher KL divergence of 3.84, due to its random selection processes.
    • TVAE shows an intermediate divergence of 0.74, better than the probabilistic model but less accurate than LLM-based methods.
  • Heavy-hitter Prediction (Figure 4b): This assesses the generation of frequently triggered microservices.

    • Our method achieves over90over 90% accuracy for K ≤ 50 and 6565% at K = 500.

    • The GReaT method also shows robust performance but slightly worse with larger K.

    • The probabilistic model starts at 59% accuracy for K=10 and drops to 23% at K=500, highlighting its inability to capture heavy-hitter dynamics.

      The following charts illustrate these findings:

      Figure 11: A training data sample of a call graph layer for instruction-tuning. Figure 4: Distribution similarity between real and synthetic traces.

  • Branching and Latency Distributions (Figure 16): Using normalized Earth Mover's Distance (EMD), our method consistently achieves the closest results to the training data for in-degree (communications received), out-degree (communications initiated), and response time distributions. It achieves a 2.6x to 10x reduction in EMD compared to GReaT and the probabilistic model, demonstrating superior capture of microservice interaction complexity.

    Figure 9: A training data sample of a call graph with 3 edges represented in tabular format. Figure 16: Distribution similarities in microservice branching (in-degree and out-degree) and response times between real and synthetic traces.

6.1.3. Synthetic Data as Training Data for ML-Driven Microservice Management

This section demonstrates the practical utility of synthetic traces by using them to train ML models for critical microservice management tasks.

  • Critical Component Extraction (FIRM): For predicting critical microservices using SVMs (as in FIRM), SVMs trained on our synthetic data achieve near-identical accuracy (less than 1.5 percentage points difference) compared to those trained on real data. In contrast, SVMs trained on synthetic traces from baselines (GReaT, probabilistic) show a large performance gap (6 to 81 percentage points).

  • Anomaly Detection (TraceVAE): For detecting anomalous microservices using VAE (TraceVAE), models trained on our synthetic traces yield results comparable to those trained on real data, as measured by ROC AUC.

    The results are summarized in Figure 5:

    Figure 12: Accuracy heatmap. Figure 5: ML Model Performance (real vs. synthetic traces).

  • Other Classification Tasks (Table 2): Fine-tuned Llama-2 7B models used for high latency prediction and uncommon communications prediction (without latency info in input) show only a slight accuracy drop when trained on synthetic data (e.g., 67.1% vs 68.3% for high latency) compared to real data, confirming the effectiveness of synthetic traces for real-world tasks.

    The following are the results from Table 2 of the original paper:

    Accuracy (%) High Latency Uncommon Communications
    Real 68.3 % 65.3 %
    Synthetic 67.1 % 62.5 %

6.1.4. Instruction-following Capability

This evaluates the model's ability to generate traces with user-specified characteristics, including uncommon situations.

  • High Latency and Uncommon Communications (Figure 6): The instruction-tuned model demonstrates high accuracy in generating traces according to specific instructions like "high latency" or "uncommon communications." Crucially, it also performs well when both instructions are combined, even though it was not explicitly trained on such combinations. This showcases its generalization and reasoning capabilities.

    Figure 13: Accuracy to generate correct call graph structures with and without intermediate instructions during instruction tuning. Figure 6: Instruction follow-Figure 7: Downstream task ing accuracy (%).

This section highlights the utility of the pre-trained model for various trace-related downstream tasks, where partial information is available.

  • Predicting Uncommon Communications (Figure 7, left):
    • Our trace-pretrained model achieves 76.8% accuracy.
    • The original Llama-2 model (without trace training) only gets 60.6%.
    • Llama-3.1 405B with in-context learning performs even worse at 45.6%, underscoring the need for domain-specific training for such tasks.
  • Infilling Missing Data (Figure 7, middle and right):
    • Missing Attribute Infilling: Our model achieves over 70% accuracy in predicting missing attributes (e.g., communication type, destination microservice), significantly outperforming baselines by 30-40%.

    • Missing Call (Connecting Edge) Infilling: For the more complex task of infilling a missing edge between parent and child layers, our model maintains a high accuracy of 66%. In contrast, the original Llama-2 scores only 24%, and Llama-3.1 405B reaches 34%. This demonstrates the robustness of our model in complex infilling scenarios.

      The results are part of the composite Figure 7, which is shown alongside Figure 6 in the previous point. The right part of Figure 7 specifically illustrates the downstream task accuracy.

6.2. Data Presentation (Tables)

The paper presents experimental data primarily through figures. One table is used in the appendix to summarize results for additional classification tasks.

The following are the results from Table 1 of the original paper:

Model Hyperparameter Value
Pre-Training & Instruction Tuning Optimizer AdamW (Loshchilov and Hutter, 2017)
Learning rate 3e-4 with cosine scheduler
Batch size 64
Gradient clipping 1.0
Downstream Task Fine-tuning Optimizer AdamW
Learning rate 1e-4 with cosine scheduler
Batch size 2
Gradient clipping 1.0

Table 1: Training setup and hyperparameters.

The following are the results from Table 2 of the original paper:

Accuracy (%) High Latency Uncommon Communications
Real 68.3 % 65.3 %
Synthetic 67.1 % 62.5 %

Table 2: Accuracy of prediction tasks by fine-tuning Llama-2 7B with real and synthetic traces.

6.3. Ablation Studies / Parameter Analysis

  • Intermediate Instructions Ablation: As discussed in 6.1.1, the ablation study in Figure 13 (Appendix D.1) clearly shows that removing intermediate instructions during instruction tuning reduces call graph generation accuracy by approximately 13% across various sampling temperatures. This confirms that these explicit reasoning steps are crucial for the model to adhere to structural constraints.

  • Model Size Analysis: Section D.2 and Figure 14 demonstrate that increasing model size (from 1B to 13B parameters) generally improves generation accuracy, particularly for more complex traces (e.g., deeper graphs). This suggests that larger models have a greater capacity to learn and apply the complex rules required for accurate trace generation.

  • Decoding Temperature Analysis: Figures 3c and 13 show the impact of decoding temperature. Lower temperatures generally lead to higher accuracy (more deterministic generation), while higher temperatures decrease accuracy but can increase diversity. The recursive model consistently maintains higher accuracy than the baseline even at higher temperatures.

  • Memorization Analysis: Section D.3 and Figure 15 analyze memorization. The recursive generation methods (with and without instruction tuning) significantly reduce the proportion of generated traces that exactly match training data (3-5%) compared to the non-recursive baseline (16-24%). This indicates that the recursive approach promotes diversity rather than just replicating existing examples.

    These analyses effectively validate the design choices, particularly recursive generation and instruction tuning, and shed light on how architectural and hyper-parameter choices influence performance.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully demonstrates a novel and effective method for generating realistic microservice call graphs using large language models. By adapting pre-trained LLMs through recursive call graph generation and instruction tuning, the authors overcome the limitations of previous synthetic trace generation techniques, particularly their inability to handle complex hierarchical structures and enforce implicit constraints. The proposed approach significantly outperforms baselines in generating accurate, structurally valid, and diverse call graphs that exhibit high distributional similarity to real-world traces. Crucially, the synthetically generated traces are shown to be a viable substitute for real data in training machine learning models for critical microservice management tasks, such as identifying critical components and detecting anomalies. Furthermore, the fine-tuned LLM effectively adapts to downstream trace-related tasks like predicting key features and infilling missing data, demonstrating its versatile utility beyond mere generation.

7.2. Limitations & Future Work

The authors acknowledge several limitations of their current work and propose directions for future research:

  • Discarding Previously Generated Edges in Recursive Method: The current recursive method processes layers sequentially, but it discards previously generated edges, only passing conditioning information from the prior layer to the next. While this has minimal impact on microservice call graph generation (where direct neighbors are most influential), it might limit the capture of longer-range dependencies and temporal patterns.
    • Future Work: Incorporating past information, such as prior layers or a time series of call graph traces, to enhance the capture of longer-range dependencies. A key challenge here is efficiently compressing historical trace information while preserving critical details.
  • Manually Constructed Instruction Templates: The method relies on manually constructed instruction templates for instruction tuning. This manual process may lead to sub-optimal generation quality as it might not fully leverage the vast potential of LLMs pre-trained with trillions of tokens.
    • Future Work: Diversifying instructions using LLM-generated outputs (following methods from Liu et al., 2024; Gunasekar et al., 2023; Li et al., 2024). This would involve guiding LLMs to generate instructions for trace generation, integrating domain-specific knowledge to ensure the usefulness and diversity of these generated instructions for downstream tasks.
  • Scope Limited to Microservice Call Graphs: The current work primarily focuses on microservice call graphs. While other system traces, such as operating system (OS) call graphs, share a similar hierarchical structure, they often present greater depth and a wider diversity of node and edge types.
    • Future Work: Evaluating the applicability and effectiveness of the proposed approach to other types of hierarchical system traces, like OS call graphs, to assess its generalizability.

7.3. Personal Insights & Critique

This paper presents an elegant and practically significant application of LLMs to a challenging domain. The innovation of combining recursive generation with instruction tuning for structured data generation is particularly insightful.

  • Strengths and Innovations:

    • Novelty in Application: Applying LLMs to hierarchical system traces like microservice call graphs is a fresh and impactful direction. It moves beyond typical text or code generation, showcasing the versatility of LLMs.
    • Elegant Solution for Structure: The recursive generation strategy is a clever way to decompose a complex graph generation problem into manageable, sequential steps, directly addressing the challenge of arbitrary hierarchical structures.
    • Effective Constraint Enforcement: The use of intermediate instructions during instruction tuning is a powerful mechanism. It explicitly leverages the LLM's reasoning capabilities to enforce complex, implicit constraints that are often difficult for traditional generative models to learn implicitly, thus ensuring structural validity. This is a practical example of "teaching" the LLM to "think" in steps for structured output.
    • Demonstrated Practical Value: The rigorous evaluation, especially showing that synthetic traces can effectively replace real data for training ML-driven microservice management tasks, elevates the work from theoretical interest to immediate practical utility for system designers and operators. The low memorization rate also ensures that the generated data is genuinely novel and diverse.
  • Potential Issues, Unverified Assumptions, or Areas for Improvement:

    • Scalability for Extremely Deep Graphs: While the recursive approach handles depth better than baselines, it still operates sequentially. For extremely deep (e.g., hundreds or thousands of layers, common in some OS call graphs) and wide graphs, the sequence length might become a limiting factor for LLMs, even with techniques like LoRA. The "discarding previously generated edges" limitation might become more prominent here.
    • Generalizability of Instruction Templates: While the paper mentions manually constructed templates as a limitation, the quality and coverage of these templates are crucial. If the templates miss subtle domain-specific constraints or are not exhaustive, the model's generated traces might still lack realism in edge cases. The future work on LLM-generated instructions is critical here.
    • Computational Cost: Fine-tuning LLama-2 7B on 1.36 million samples (1.1B tokens) is computationally intensive, even with LoRA. While impressive for research, deploying and continuously fine-tuning such models for diverse real-world scenarios might be costly for smaller organizations.
    • Debugging Generated Traces: When a generated trace is "invalid," diagnosing why it's invalid can be challenging. The intermediate instructions offer some insight, but pinpointing the exact reasoning failure within a large LLM remains an open problem.
    • Ethical Considerations: Although the data is anonymized, the paper touches on the risk of exposing privacy-related information if sensitive data were used. This is a perpetual concern with generative models and large datasets.
  • Transferability and Applications:

    • The core methodology (textual encoding + recursive generation + instruction tuning for constraints) is highly transferable. It could be applied to:

      • Network Packet Traces: Generating more realistic network traffic with specific protocol hierarchies and temporal dependencies.
      • Operating System Call Graphs: As mentioned by authors, for security analysis or performance debugging.
      • Biological Pathway Models: Representing complex biochemical pathways with hierarchical interactions and temporal constraints.
      • Software Dependency Graphs: Generating realistic software module dependencies for testing build systems or analyzing architectural vulnerabilities.
      • Supply Chain Networks: Simulating complex, multi-layered supply chain interactions with various constraints.
    • The ability to generate "uncommon situations" is a powerful tool for stress testing, security vulnerability analysis, and creating diverse datasets for anomaly detection, which are typically hard to collect from real systems.

      This paper successfully showcases the potential of LLMs to not just generate human-like text, but to become sophisticated tools for modeling and simulating complex, structured data in critical engineering domains.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.