Paper status: completed

Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning

Published:03/28/2024

Abnormal Trace Classification in Microservice Systems (1)Meta-Learning for Anomaly Detection (1)Cross-System Fault Categorization (1)Automated Root Cause Diagnosis (1)Few-Shot Classification Tasks (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

TraFaultDia uses meta-learning for few-shot abnormal trace classification in microservices, enabling accurate fault categorization and root cause identification across systems, validated on two open datasets.

Abstract

Microservice-based systems (MSS) may fail with various fault types. While existing AIOps methods excel at detecting abnormal traces and locating the responsible service(s), human efforts are still required for diagnosing specific fault types and failure causes.This paper presents TraFaultDia, a novel AIOps framework to automatically classify abnormal traces into fault categories for MSS. We treat the classification process as a series of multi-class classification tasks, where each task represents an attempt to classify abnormal traces into specific fault categories for a MSS. TraFaultDia leverages meta-learning to train on several abnormal trace classification tasks with a few labeled instances from a MSS, enabling quick adaptation to new, unseen abnormal trace classification tasks with a few labeled instances across MSS. TraFaultDia's use cases are scalable depending on how fault categories are built from anomalies within MSS. We evaluated TraFaultDia on two MSS, TrainTicket and OnlineBoutique, with open datasets where each fault category is linked to faulty system components (service/pod) and a root cause. TraFaultDia automatically classifies abnormal traces into these fault categories, thus enabling the automatic identification of faulty system components and root causes without manual analysis. TraFaultDia achieves 93.26% and 85.20% accuracy on 50 new classification tasks for TrainTicket and OnlineBoutique, respectively, when trained within the same MSS with 10 labeled instances per category. In the cross-system context, when TraFaultDia is applied to a MSS different from the one it is trained on, TraFaultDia gets an average accuracy of 92.19% and 84.77% for the same set of 50 new, unseen abnormal trace classification tasks of the respective systems, also with 10 labeled instances provided for each fault category per task in each system.

Mind Map

In-depth Reading

English Analysis~37 min read · 55,216 chars

1. Bibliographic Information

1.1. Title

Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning

1.2. Authors

YUQING WANG, University of Helsinki, Finland
MIKA V. MANTYLA, University of Helsinki, Finland and University of Oulu, Finland
SERGE DEMEYER, University of Antwerp, Belgium
MUTLU BEYAZIT, University of Antwerp, Belgium
JOANNA KISAAKYE, University of Antwerp, Belgium
JESSE NYSSOLA, University of Helsinki, Finland

1.3. Journal/Conference

This paper is published at "Proc. ACM Softw. Eng. 2, FSE, Article FSE027 (July 2025)". This indicates publication in the proceedings of a top-tier software engineering conference, likely the ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE), which is highly reputable and influential in the software engineering and AIOps communities.

1.4. Publication Year

2025 (as indicated in the ACM Reference Format). The abstract states a publication UTC date of 2024-03-27, which suggests it was made available as a preprint or accepted for publication around that time, with the official proceedings year being 2025.

1.5. Abstract

Microservice-based systems (MSS) frequently experience failures of diverse types. While existing AIOps methods are adept at detecting abnormal traces and pinpointing responsible services, they still necessitate significant human intervention for diagnosing specific fault types and their underlying causes. This paper introduces TraFaultDia, an innovative AIOps framework designed to automatically classify abnormal traces into predefined fault categories within MSS. The framework frames this classification as a series of multi-class classification tasks. TraFaultDia employs a meta-learning approach, training on various abnormal trace classification tasks with a limited number of labeled instances from a single MSS. This enables it to rapidly adapt to new, unseen abnormal trace classification tasks across different MSS, requiring only a few labeled examples. Its utility is flexible, adapting to how fault categories are structured from anomalies within an MSS. Evaluated on two benchmark MSS, TrainTicket and OnlineBoutique, using open datasets where fault categories are linked to faulty components (service/pod) and root causes, TraFaultDia successfully automates the classification of abnormal traces. This facilitates the automatic identification of faulty system components and root causes, bypassing manual analysis. The framework achieves high accuracy: 93.26% for TrainTicket and 85.20% for OnlineBoutique on 50 new classification tasks when trained and tested within the same MSS (with 10 labeled instances per category). In a cross-system context, TraFaultDia maintains strong performance, achieving an average accuracy of 92.19% and 84.77% for TrainTicket and OnlineBoutique, respectively, across the same 50 new tasks, also with 10 labeled instances per fault category per task.

1.6. Original Source Link

Official Source/PDF Link: https://arxiv.org/pdf/2403.18998v4.pdf
Original Source Link: https://arxiv.org/abs/2403.18998
Publication Status: Preprint (arXiv) and accepted for publication (ACM FSE 2025).

2. Executive Summary

2.1. Background & Motivation

The proliferation of microservice-based systems (MSS) has brought about increased complexity and dynamic behavior, making them prone to various fault types. Traditional Artificial Intelligence for IT Operations (AIOps) methods have made significant strides in detecting abnormal traces and identifying the specific services responsible for these anomalies. However, a critical gap persists: these methods typically do not extend to automatically diagnosing the specific fault types or conducting in-depth root cause analysis (RCA) to explain why a failure occurred. This deficiency necessitates substantial human effort from practitioners, who must manually analyze detected abnormal traces to determine fault categories and underlying causes. This manual process is time-consuming, requires deep expertise in software architecture and operational behaviors, and becomes increasingly infeasible as MSS grow in complexity and data volume. The inability to promptly and automatically categorize abnormal traces leads to delayed resolutions, increased downtime, and higher operational costs.

The core problem the paper aims to solve is this automation gap in fault diagnosis for abnormal traces in MSS. The existing AIOps methods can tell "what is wrong" (abnormal trace detected) and "where it is wrong" (responsible service located), but not "what kind of wrong" (fault category) or "why it is wrong" (root cause). The paper's innovative idea is to leverage meta-learning to enable an AIOps framework to automatically classify abnormal traces into specific fault categories, even in dynamic and heterogeneous MSS environments, and with limited labeled data.

2.2. Main Contributions / Findings

The paper introduces TraFaultDia, a novel AIOps framework with several key contributions:

Automatic Fault Categorization: TraFaultDia automatically classifies abnormal traces into specific fault categories for MSS, addressing a critical gap in existing AIOps methods that primarily focus on detection and localization. This enables automatic identification of faulty system components and root causes, reducing manual analysis.
Unsupervised Multi-Modal Trace Representation: The framework employs an unsupervised approach (AttenAE) to fuse high-dimensional, multi-modal trace-related data (including textual, time-based, and identity attributes from spans and logs) into compressed yet effective trace representations. This tackles the challenge of complex, diverse trace data (C2) and offers computational efficiency compared to graph-based methods.
Meta-Learning for Adaptability (Few-Shot & Cross-System): TraFaultDia utilizes a Transformer-Encoder based Model-Agnostic Meta-Learning (TEMAML) model. This enables few-shot learning, allowing the framework to classify abnormal traces from rare fault categories with only a few labeled instances (C3). Crucially, it also provides strong transfer learning capabilities, allowing quick adaptation to new, unseen abnormal trace classification tasks across different, heterogeneous MSS (C1).
Empirical Validation: Extensive evaluation on two representative benchmark MSS (TrainTicket and OnlineBoutique) with open datasets demonstrates the framework's effectiveness.
- Within-system adaptability (RQ1): Achieves high average accuracy of 93.26% (TrainTicket) and 85.20% (OnlineBoutique) on 50 new tasks with only 10 labeled instances per category.
- Cross-system adaptability (RQ2): Demonstrates robust performance with average accuracies of 92.19% (TrainTicket) and 84.77% (OnlineBoutique) when trained on one MSS and applied to another, also with 10 labeled instances per category.
  
  These findings highlight TraFaultDia's capability to significantly enhance the efficiency and effectiveness of RCA in complex MSS environments by automating a crucial diagnostic step.

3.1. Foundational Concepts

To fully grasp the TraFaultDia framework, a foundational understanding of several key concepts is essential:

Microservice-Based Systems (MSS):
- Conceptual Definition: An architectural style where a complex application is composed of small, independent services, each running in its own process and communicating with lightweight mechanisms, often HTTP APIs. These services are typically organized around business capabilities and can be deployed independently.
- Relevance: The paper addresses the challenges inherent in MSS, such as their complex and dynamic nature, which leads to various fault types and makes traditional fault diagnosis difficult.
Artificial Intelligence for IT Operations (AIOps):
- Conceptual Definition: The application of Artificial Intelligence (AI) and Machine Learning (ML) techniques to IT operations tasks, including monitoring, anomaly detection, root cause analysis, and automation.
- Relevance: The paper aims to advance AIOps capabilities by addressing the gap in automated fault categorization within MSS, moving beyond simple anomaly detection and localization.
Traces, Spans, and Logs: These are fundamental concepts for understanding system behavior and diagnosing issues in distributed systems like MSS.
- Traces: Represent the end-to-end path of a single user request or transaction as it propagates through various services in an MSS. They provide a holistic view of an operation.
- Spans: The basic unit of work within a trace. Each span represents an individual operation performed by a particular service. Spans have a parent-child relationship, forming a hierarchical tree structure (as shown in Figure 1). Spans carry attributes like service name, operation name, start time, end time, and unique Span IDs (e.g., a480f2.0, a480f2.1 where .1 is a child of .0).
- Logs: Textual records generated by service instances during their execution. They capture detailed events, status information, errors, and debugging messages at specific points in time. Logs are often associated with particular spans or traces (as shown in Figure 2).
- Relevance: TraFaultDia leverages information from both spans and logs, recognizing their multi-modal nature, to construct comprehensive trace representations for fault categorization.
Root Cause Analysis (RCA):
- Conceptual Definition: The process of identifying the fundamental reason for a problem or failure. In IT operations, RCA seeks to determine why an issue occurred, beyond just identifying the symptoms.
- Relevance: The paper aims to automate a significant part of RCA by automatically classifying abnormal traces into specific fault categories, thereby enabling faster identification of faulty components and root causes without manual effort.
Meta-Learning (Learning to Learn):
- Conceptual Definition: A field of machine learning where the goal is to train models that can learn new tasks or adapt to new environments quickly and efficiently, often with limited training data. It's about learning the inductive biases that allow for rapid generalization.
- Key Concepts:
  - Few-shot Learning: A sub-field of meta-learning where models are trained to perform well on new tasks given only a very small number of labeled examples (the "shots") for each new class. This is crucial for handling rare fault categories (C3).
  - Transfer Learning: The process of taking knowledge gained from one task or domain and applying it to a different but related task or domain. In this paper, it refers to transferring learning from one MSS to another (C1).
  - N-Way K-Shot: A common setup in few-shot learning. $N$ refers to the number of distinct classes (fault categories) in a given task, and $K$ refers to the number of labeled examples ("shots") available per class in the support set for adaptation.
- Relevance: TraFaultDia uses meta-learning to achieve few-shot and cross-system adaptability, addressing the challenges of imbalanced abnormal trace distribution (C3) and MSS heterogeneity (C1).
Autoencoders (AE):
- Conceptual Definition: A type of artificial neural network used for unsupervised learning of efficient data codings (representations). It consists of two parts: an encoder that compresses the input data into a lower-dimensional latent space representation, and a decoder that reconstructs the original input from this compressed representation. The network is trained to minimize the reconstruction error.
- Relevance: The AttenAE component in TraFaultDia is an autoencoder, specifically designed to fuse high-dimensional, multi-modal trace-related data (C2) into unified, low-dimensional representations in an unsupervised manner.
Attention Mechanism, Multi-Head Attention, and Self-Attention:
- Conceptual Definition: A mechanism that allows a neural network to focus on specific parts of its input when processing information. Instead of processing the entire input equally, attention assigns weights to different input elements, highlighting the most relevant ones.
  - Attention Formula: The fundamental scaled dot-product attention mechanism is defined as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $
    - $Q$ (Query), $K$ (Key), $V$ (Value): These are matrices derived from the input data. The query is used to attend to keys, and the attention weights are then applied to the values.
    - $QK^T$ : Computes the dot product between the query and all keys, measuring their similarity.
    - $\sqrt{d_k}$ : A scaling factor (square root of the dimension of the keys) used to prevent the dot products from becoming too large, which could push the softmax function into regions with tiny gradients.
    - softmax: Normalizes the scores into probability distributions, indicating how much attention to pay to each value.
    - $V$ : The values are weighted by the softmax probabilities and summed up.
  - Multi-Head Attention: An extension where the attention mechanism is performed multiple times in parallel, each with different learned linear projections of Q, K, V. The outputs from these multiple attention heads are then concatenated and linearly transformed. This allows the model to jointly attend to information from different representation subspaces at different positions.
  - Self-Attention: A special case of attention where the query, key, and value all come from the same sequence of data. This allows the model to weigh the importance of different elements within the same input sequence to generate an output.
- Relevance: TraFaultDia uses Multi-Head Attention within AttenAE to fuse span and log data, and Self-Attention within the Transformer-Encoder (TE) to process trace representations, enabling it to recognize and integrate the most relevant features.
Transformer-Encoder:
- Conceptual Definition: The encoder part of the Transformer architecture, a neural network model primarily known for its efficiency in handling sequential data, particularly in natural language processing (NLP). It relies entirely on self-attention mechanisms to draw global dependencies between input and output, rather than using recurrent or convolutional layers.
- Relevance: The TEMAML component uses a Transformer-Encoder (TE) as its base model, chosen for its effectiveness in recognizing and integrating features from latent trace representations due to its self-attention mechanism.
Neural Representations (BERT, WordPiece Tokenization):
- Conceptual Definition: Methods to convert textual data (like service operations or log messages) into dense numerical vectors (embeddings) that capture their semantic meaning.
  - BERT (Bidirectional Encoder Representations from Transformers): A powerful pre-trained NLP model that generates contextualized word embeddings, meaning the embedding for a word depends on its surrounding words.
  - WordPiece Tokenization: A subword tokenization algorithm used by BERT that breaks words into smaller, frequently occurring subword units. This helps handle out-of-vocabulary (OOV) words and reduces the vocabulary size.
- Relevance: TraFaultDia uses these techniques to create robust vector representations for textual attributes from spans and logs, addressing the challenge of diverse and evolving textual data in MSS that may contain OOV words.

3.2. Previous Works

The paper contextualizes its contributions by discussing existing research in trace representation and trace classification.

Trace Representation:
- Graph Neural Networks (GNNs) for Trace Graphs: Many existing AIOps methods (e.g., [Chen et al. 2023, 2022; Raeiszadeh et al. 2023; Zhang et al. 2022a,b]) construct trace graphs, where spans are nodes and interactions are edges. These graphs capture complex service interactions and are effective for detecting abnormalities and locating faulty services.
  - Differentiation from TraFaultDia: The paper argues GNNs are not suitable for its context due to high computational expense and scalability issues, especially with traces having hundreds of spans and dynamic MSS environments where graphs would need frequent, burdensome updates. TraFaultDia opts for unified, compressed trace representations without explicit graph construction, making it more scalable.
- Sequence Models for Traces: Some anomaly detection studies (e.g., [Du et al. 2023; Kohyarnejadfard et al. 2022; Nedelkoski et al. 2019b]) treat traces as sequences of span attributes and use models like Long Short-Term Memory (LSTM) networks to model these sequences.
  - Differentiation from TraFaultDia: These approaches typically overlook logs, which the paper identifies as a crucial modality for recognizing many fault categories (e.g., code errors, configuration issues). TraFaultDia explicitly fuses both span and log data.
Trace Classification:
- Binary Classification for Anomaly Detection: Many MSS anomaly detection studies (e.g., [Kohyarnejadfard et al. 2019; Kong et al. 2024; Zhang et al. 2022b]) use binary classification to distinguish between normal and abnormal traces.
  - Differentiation from TraFaultDia: This is distinct from TraFaultDia's goal, which is multi-class classification of already detected abnormal traces into specific fault categories.
- Multi-Class Classification (Nedelkoski et al. 2019a): The study by Nedelkoski et al. (2019a) is cited as the only related work that performs multi-class classification of abnormal traces. It uses a Convolutional Neural Network (CNN) to classify traces into four time series-based fault categories (incremental, mean shift, gradual increase, cylinder) by characterizing traces as sequences of span call path attributes.
  - Differentiation from TraFaultDia: This approach is deemed insufficient for TraFaultDia's broader scope because:
    - It only considers time-series data on call path, ignoring other critical multi-modal data (logs, other span attributes) vital for diagnosing a wider range of fault types (C2).
    - It lacks mechanisms to address MSS heterogeneity (C1) or few-shot learning for rare fault categories (C3), which TraFaultDia explicitly tackles with meta-learning.

3.3. Technological Evolution

The field of IT operations has evolved from manual monitoring and rule-based systems to sophisticated AIOps solutions. Early AIOps focused on basic anomaly detection (identifying "if" something is wrong) and simple alerting. As distributed systems became more complex, particularly with the rise of MSS, the need for understanding the flow of requests led to the adoption of distributed tracing (traces, spans, logs). This enabled AIOps to move towards localizing issues (identifying "where" something is wrong), often using techniques like GNNs to map service dependencies or sequence models to identify anomalous patterns in execution flows.

The current paper represents an evolution beyond detection and localization towards automated RCA and diagnosis (identifying "what kind of wrong" and "why"). It addresses the limitations of prior approaches by:

Comprehensive Data Utilization: Moving beyond single-modality (e.g., only spans) to multi-modal data fusion (spans + logs).
Scalability for Dynamics: Offering an alternative to computationally heavy GNNs for dynamic MSS.
Adaptability for Heterogeneity: Incorporating meta-learning to overcome the challenges of diverse MSS architectures and the dynamic appearance of new, rare fault types, a capability largely absent in prior multi-class classification attempts for MSS.

3.4. Differentiation Analysis

The core differences and innovations of TraFaultDia compared to existing methods lie in its comprehensive approach to fault categorization:

Problem Scope: While previous works primarily focused on binary anomaly detection (normal vs. abnormal) or localizing faulty services, TraFaultDia tackles the more complex multi-class classification of already abnormal traces into specific, human-understandable fault categories and root causes.
Trace Representation:
- Multi-Modal Fusion: Unlike methods that only use spans or logs, TraFaultDia explicitly fuses textual, time-based, and identity attributes from both spans and logs using an AttenAE with a Multi-Head Attention mechanism. This addresses C2 (high-dimensional, multi-modal data) and ensures a richer, more effective representation for detailed fault diagnosis.
- Efficiency over Graphs: It avoids the computational burden and update complexity of GNNs by creating compressed, unified representations without requiring explicit graph construction or frequent re-computation of dependencies.
Learning Paradigm:
- Meta-Learning (MAML): This is the most significant innovation. TraFaultDia employs MAML with a Transformer-Encoder (TEMAML) to achieve few-shot learning and cross-system adaptability. This directly addresses C1 (MSS heterogeneity) and C3 (imbalanced abnormal trace distribution) by enabling the model to quickly adapt to new MSS and rare fault categories with minimal labeled data. Prior multi-class classification attempts for MSS (like the CNN approach) lacked this meta-learning capability, making them less robust and adaptable across diverse systems or to novel fault types.
Unsupervised Representation Learning: The use of AttenAE allows TraFaultDia to learn effective trace representations from unlabeled traces, which are abundant in MSS, making the approach practical and scalable.

In essence, TraFaultDia fills a crucial gap by providing an AIOps framework that not only detects and locates but also diagnoses specific fault categories and root causes, doing so efficiently, with limited labeled data, and across diverse MSS.

4. Methodology

4.1. Principles

The core idea behind TraFaultDia is to combine robust, unsupervised trace representation learning with an adaptive meta-learning approach for multi-class fault categorization. The method is grounded in two main principles:

Unsupervised Multi-modal Data Fusion for Effective Representation: Microservice traces are high-dimensional and multi-modal, containing diverse information in spans (textual, time-based, identity) and logs (textual, severity, identity). To make this complex data tractable and effective for analysis, TraFaultDia employs an Autoencoder architecture, specifically AttenAE, to learn compressed, unified representations. The unsupervised nature of Autoencoders is critical because unlabeled trace data is readily available, while labeled data for specific fault categories is scarce. The Multi-Head Attention mechanism within AttenAE is designed to intelligently fuse these diverse data modalities, focusing on the most relevant features.
Meta-Learning for Few-Shot and Cross-System Adaptability: Microservice environments are inherently heterogeneous (C1) and constantly evolving, leading to imbalanced fault category distributions (C3) with many rare fault types. Traditional supervised learning models would struggle with few-shot learning (very few examples per category) and would require extensive retraining for each new MSS or fault type. TraFaultDia overcomes this by adopting Model-Agnostic Meta-Learning (MAML) trained with a Transformer-Encoder (TEMAML). MAML learns a good initialization of model parameters that can quickly adapt to new, unseen tasks (new fault categories or new MSS) with just a few gradient updates and a small number of labeled examples. This "learning to learn" principle allows TraFaultDia to be efficient and practical in dynamic AIOps scenarios.

4.2. Core Methodology In-depth (Layer by Layer)

The TraFaultDia framework consists of two main components: the Multi-Head Attention Autoencoder (AttenAE) and the Transformer-Encoder based Model-Agnostic Meta-Learning (TEMAML) model.

4.2.1. TraFaultDia Workflow Overview

The overall workflow of TraFaultDia is illustrated in Figure 3.

Fig. 3. Overview of our framework 该图像是图3，展示了TraFaultDia框架中AttenAE与TEMAML的交互流程。包括利用AttenAE对微服务系统异常调用链进行编码与解码，构建调用链表示，再通过TEMAML进行基于元学习的异常调用链分类，分为元训练和元测试阶段。

The workflow proceeds in two main stages:

AttenAE for Trace Representation Construction: For a given MSS, AttenAE is initially trained on a large volume of unlabeled traces. Its purpose is to unsupervisedly learn how to effectively fuse the high-dimensional, multi-modal trace-related data into compressed yet effective trace representations. Once trained, the encoder part of AttenAE is used independently to generate these trace representations for any new, unseen traces within that specific MSS.
TEMAML for Abnormal Trace Classification: This component uses a Transformer-Encoder (TE) as its base model, trained via meta-learning.
- Meta-training Phase: TEMAML is trained on a series of abnormal trace classification tasks (called meta-training tasks) sampled from a source MSS. During this phase, TEMAML learns optimal initial parameters that enable rapid adaptation.
- Meta-testing Phase: After meta-training, TEMAML is evaluated on new, unseen abnormal trace classification tasks (called meta-testing tasks). These meta-testing tasks can originate from the same MSS used during meta-training or from an entirely different MSS. For these tasks, TEMAML leverages the optimized AttenAE encoder of the respective MSS to convert abnormal traces into their trace representations before performing classification.

4.2.2. AttenAE for Constructing Trace Representations

The AttenAE architecture is depicted in Figure 4. For a given MSS, AttenAE processes a set of traces, denoted as $Tr = \{Tr_1, Tr_2, ..., Tr_n\}$ , where each $Tr_i$ is an individual trace comprising a sequence of spans and logs. Thus, $Tr = (\mathrm{Span}, \mathrm{Log})$ collectively refers to the spans and logs across all traces.

Fig. 4. AttenAE architecture 该图像是图4，AttenAE架构的示意图，展示了从一条跟踪的spans和日志中提取属性，并通过多头注意力机制的编码器-解码器结构进行特征投影和重构的过程。

4.2.2.1. Span Preprocessing and Vector Generation

For each span, TraFaultDia extracts the following attributes, as identified in Section 2.2:

Textual Attributes: call component and path.
Time-based Attributes: span start time and span end time.
Identity Attributes: trace ID and span ID.

The trace IDs are used to group spans correctly into their respective traces. The attributes are then processed into numerical vector representations:

Time-based Attributes: These (in UNIX format) are normalized within the context of each span to account for individual characteristics and scales. The normalized values are concatenated to form a single vector $V_{\mathrm{numeric}}$ for each span.
Span IDs: To capture hierarchical relationships, shared common prefixes in span IDs are abstracted away, retaining only hierarchical-level digits. For example, a480f2.0, a480f2.1, a480f2.2 might be reassigned as 1.0, 1.1, 1.2. These normalized span IDs within the trace context are converted into a vector $V_{\mathrm{span\_id}}$ .
Textual Attributes: The call component and path are concatenated to form a singular attribute called "service operation." To represent this textual information, a neural representation method (similar to [Le and Zhang 2021]) is used, leveraging the pre-trained BERT model with WordPiece tokenization to handle evolving and out-of-vocabulary (OOV) words. This process involves three steps:
- Step 1: Preprocessing: Convert uppercase letters to lowercase, substitute specific variables (e.g., Prod1234 with ProductID), and remove non-alphabetic characters.
- Step 2: Tokenization: Apply WordPiece tokenization to break "service operations" into subwords.
- Step 3: Neural Representation: Feed the subwords into a BERT base model (specifically, the word embeddings generated by its last encoding layer) and calculate the sentence embedding of each "service operation" by averaging its word embeddings. This yields a vector representation $V_{\mathrm{operation}}$ .
  
  Finally, for each span within a trace, the vector representations from these three types of attributes are concatenated to form a composite vector $V_{\mathrm{span}} \in \mathbb{R}^{d_{\mathrm{span}}}$ , where $d_{\mathrm{span}}$ is the dimensionality of this combined vector space: $ V_{\mathrm{span}} = \mathrm{Concatenate}(V_{\mathrm{operation}}, V_{\mathrm{numeric}}, V_{\mathrm{span_id}}) $

4.2.2.2. Log Preprocessing and Vector Generation

From logs, TraFaultDia extracts textual (log component, message, severity level) and identity (trace ID, span ID) attributes. Trace IDs are used to group logs into their corresponding traces.

Textual Attributes: The log component, message, and severity level are concatenated to form a singular attribute called "log event."
Neural Representation: Similar to service operations, a neural representation method is applied to "log events" following the same three-step process (preprocessing, WordPiece tokenization, BERT embeddings). This generates a vector representation $V_{\mathrm{log}} \in \mathbb{R}^{d_{\mathrm{log}}}$ , where $d_{\mathrm{log}}$ is the dimensionality of its vector space.

4.2.2.3. Trace Representation Construction

For a given MSS, AttenAE's encoder constructs trace representations $Z$ from the processed span and log vectors.

Projection into Common Feature Space: The input vectors $V_{\mathrm{span}}$ and $V_{\mathrm{log}}$ are first projected into a common feature space $\mathbb{R}^{d'}$ : $ V'{\mathrm{span}} = g(W{\mathrm{span}}V_{\mathrm{span}} + b_{\mathrm{span}}) $ $ V'{\mathrm{log}} = g(W{\mathrm{log}}V_{\mathrm{log}} + b_{\mathrm{log}}) $
- $g$ : Denotes the activation function (e.g., ReLU).
- $W_{\mathrm{span}}$ , $W_{\mathrm{log}}$ : Are respective weight matrices for the linear transformation.
- $b_{\mathrm{span}}$ , $b_{\mathrm{log}}$ : Are the bias vectors.
Multi-Head Attention Fusion: The projected vectors $V'_{\mathrm{span}}$ and $V'_{\mathrm{log}}$ are then fused using a Multi-Head Attention mechanism. This mechanism, as defined in Equation 2, captures complex interactions between spans and logs.

$ \left{ \begin{array}{l} \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \ \mathrm{head}_i = \mathrm{Attention}(QW^Q, KW^K, VW^V) \ \mathrm{MultiHead}(Q, K, V) = \mathrm{Concatenate}(\mathrm{head}_1, \dots, \mathrm{head}_h)W^O \end{array} \right. $
- Q, K, V: Refer to the query, key, and value matrices, respectively.
- $QK^T$ : Computes attention scores by taking the dot product of Query and Key.
- $\sqrt{d_k}$ : A scaling factor for numerical stability.
- softmax: Applies a softmax function to produce the attention distribution (weights).
- $\mathrm{head}_i$ : Each attention head computes attention after transforming Q, K, V with separate learned weight matrices ( $W^Q, W^K, W^V$ ).
- MultiHead: Concatenates the outputs of all attention heads and applies a final linear transformation with weight matrix $W^O$ .
  
  To fuse spans and logs into a trace representation $Z$ , TraFaultDia sets $V'_{\mathrm{span}}$ as the Query ( $Q$ ), and $V'_{\mathrm{log}}$ as both the Key ( $K$ ) and Value ( $V$ ): $ Z = \mathrm{MultiHead}(V'{\mathrm{span}}, V'{\mathrm{log}}, V'_{\mathrm{log}}) $ This specific assignment reflects the architectural understanding that spans define the trace structure and service communications, while logs provide detailed contextual event information. The output $Z$ is the trace representation for a given trace $Tr_i$ . For a set of traces $Tr = \{Tr_1, ..., Tr_n\}$ , this process generates corresponding trace representations $Z = \{Z_1, ..., Z_n\}$ .
Decoder for Reconstruction: The AttenAE's decoder aims to reconstruct the original span and log vectors from the trace representations $Z$ , effectively inverting the encoder's process: $ \hat{V}{\mathrm{span}} = g(W'{\mathrm{span}}Z + b'{\mathrm{span}}) $ $ \hat{V}{\mathrm{log}} = g(W'{\mathrm{log}}Z + b'{\mathrm{log}}) $
- $\hat{V}_{\mathrm{span}} \in \mathbb{R}^{d_{\mathrm{span}}}$ and $\hat{V}_{\mathrm{log}} \in \mathbb{R}^{d_{\mathrm{log}}}$ : Are the reconstructed span and log vectors.
- $g$ : The activation function.
- $W'_{\mathrm{span}}$ , $W'_{\mathrm{log}}$ : Respective weight matrices.
- $b'_{\mathrm{span}}$ , $b'_{\mathrm{log}}$ : Respective bias vectors.
Loss Function for AttenAE Training: AttenAE is trained by minimizing the overall loss $\mathcal{L}$ , which is the sum of squared errors between the original and reconstructed vectors for both spans and logs: $ \min_{\Psi} \mathcal{L} = |\hat{V}{\mathrm{span}} - V{\mathrm{span}}|^2 + |\hat{V}{\mathrm{log}} - V{\mathrm{log}}|^2 $
- $\Psi$ : Represents all learnable parameters within the AttenAE (encoder and decoder weights and biases).

4.2.3. TEMAML for Few-Shot Abnormal Trace Classification Across MSS

The TEMAML learning process is depicted in Figure 5. It consists of meta-training and meta-testing phases.

Fig. 5. TEMAML learning process 该图像是图 5，展示了TEMAML（基于任务的元学习）学习过程的示意图，包含内循环和外循环两个阶段，通过迭代优化参数heta以实现对异常跟踪的分类。

4.2.3.1. Base Model for Abnormal Trace Classification

The base model within TEMAML is a Transformer-Encoder (TE), denoted as $f$ . It performs multi-class classification for abnormal traces. The workflow for $f$ is:

Input: Receives trace representations $Z$ (generated by AttenAE's encoder) for all abnormal traces in a given task.
Self-Attention Mechanism: The input $Z$ is processed through TE's self-attention mechanism. This mechanism helps TE weigh the most relevant parts and capture dependencies within each trace representation $Z_i$ to recognize the fault type for each trace $Tr_i$ . This adheres to the Multi-Head Attention mechanism (Equation 2), where the input $Z$ serves as Q, K, V: $ \mathrm{output} = \mathrm{MultiHead}(Z, Z, Z) $
Classification Layers: The output from self-attention is passed through:
- A pooling layer to highlight key features.
- A dropout layer to prevent overfitting.
- A fully connected layer to reshape the output.
- A softmax classifier to compute probabilities for each fault category.

4.2.3.2. Meta-training

The objective of meta-training is to train the TE ( $f$ ) to find robust initial parameters that allow it to quickly adapt to diverse abnormal trace classification tasks from any MSS.

Tasks: TE is trained on several abnormal trace classification tasks (meta-training tasks), denoted as $T = (S, Q)$ , sampled from a source MSS. Each task $T_i = (S_i, Q_i)$ is configured in an N-Way K-Shot setup.
- $S_i = \{(z_{ij}^{\mathrm{spt}}, y_{ij}^{\mathrm{spt}})\}_{j=1}^{N \times K}$ : The support set, containing $N$ distinct fault categories, each with $K$ labeled example abnormal traces.
- $Q_i = \{(z_{ig}^{\mathrm{qry}}, y_{ig}^{\mathrm{qry}})\}_{g=1}^{N \times M}$ : The query set, containing $N$ distinct fault categories, each with $M$ labeled trace instances. Here, $M > K$ to ensure robust optimization.
- $z$ : Represents the trace representations generated by the optimized AttenAE's encoder for the MSS.
- $y$ : Represents the corresponding fault category labels.
Two-Loop Optimization (MAML): TEMAML optimizes $f$ 's parameters $\theta$ using two nested loops:
- Inner Loop (Task-level Adaptation): For each meta-training task $T_i$ , the base model $f_{\theta}$ (initialized with current meta-parameters $\theta$ ) is adapted to this specific task. A few gradient descent steps are performed on the support set $S_i$ to update $\theta$ to task-specific parameters $\theta'_i$ : $ \theta'i = \theta - \alpha \nabla{\theta} \mathcal{L}{T_i}(f{\theta}(S_i)) $
  - $\alpha$ : The learning rate for inner loop updates.
  - $\mathcal{L}_{T_i}(f_{\theta}(S_i))$ : The loss of model $f_{\theta}$ on the support set $S_i$ of task $T_i$ .
- Outer Loop (Meta-level Optimization): After adapting to all meta-training tasks (obtaining $\theta'_i$ for each $T_i$ ), the meta-parameters $\theta$ are updated to minimize the aggregated loss on the query sets $Q_i$ across all meta-training tasks $T$ : $ \min_{\theta} \mathcal{L}(\theta) = \sum_{T_i \in T} \mathcal{L}{T_i}(f{\theta'_i}(Q_i)) $
  - $\mathcal{L}_{T_i}(f_{\theta'_i}(Q_i))$ : The loss of the adapted model $f_{\theta'_i}$ on the query set $Q_i$ of task $T_i$ . The paper uses a first-order approximation for the outer loop update to simplify the computationally intensive process of computing second-order derivatives (gradients through gradients): $ \theta \gets \theta - \beta \nabla_{\theta} \sum_{T_i \in T} \mathcal{L}{T_i}(f{\theta'_i}(Q_i)) $
  - $\beta$ : The learning rate for outer loop updates. This outer loop update is performed for a predetermined number of optimization steps, yielding the optimal meta-parameters $\theta^*$ . The base model TE is then initialized with these $\theta^*$ , becoming the optimized base model $f_{\theta^*}$ , which possesses enhanced adaptability.

4.2.3.3. Meta-testing

In the meta-testing phase, the optimized TE ( $f_{\theta^*}$ ) is used to adapt to new, unseen abnormal trace classification tasks (meta-testing tasks) from any MSS.

Task Configuration: Each meta-testing task $T_{ts} = (S_{ts}, Q_{ts})$ is configured similarly to meta-training tasks (e.g., N-Way K-Shot).
Adaptation and Evaluation: To adapt to a specific meta-testing task $T_{ts}$ $T_{t s}$ :
1. The optimized base model $f_{\theta^*}$ is fine-tuned on the support set $S_{ts}$ of that task to obtain task-specific parameters $\theta'_{ts}$ .
2. The adapted model $f_{\theta'_{ts}}$ is then used to classify abnormal traces in the query set $Q_{ts}$ , and its performance is evaluated.

5. Experimental Setup

5.1. Datasets

The study evaluates TraFaultDia using trace data from two benchmark MSS, TrainTicket and OnlineBoutique, sourced from open datasets:

DeepTraLog [FudanSELab 2024]: Provides normal and abnormal traces across 14 fault categories for TrainTicket. These categories represent various real-world anomaly scenarios (asynchronous interaction, multi-instance, configuration, monolithic faults).
Nezha [IntelligentDDS 2024]: Contains normal and abnormal traces for both TrainTicket and OnlineBoutique. It includes abnormal traces from five fault types (CPU contention, CPU consumption, network delay, error return, exception code defect) applied to various service pods. Each pod associated with a fault type forms a unique fault category.

Fault Dataset Construction: The authors combined data from DeepTraLog and Nezha to create a comprehensive fault dataset:
TrainTicket: Includes abnormal traces from 30 fault categories (F1-F30).
OnlineBoutique: Includes abnormal traces from 32 fault categories (B1-B32).

For experiments, these fault categories were further divided:
TrainTicket: 20 base fault categories and 10 novel fault categories.
OnlineBoutique: 22 base fault categories and 10 novel fault categories. The composition of base and novel categories for each system was designed to include a random mix of fault categories with varying numbers of spans and logs, ensuring a representative evaluation.

The following are the results from Table 1 of the original paper:

TrainTicket DeepTraLog: Asynchronous service invocations related faults (F1.Asynchronous message sequence error, F2.Unexpected order of data requests, F13.Unexpected order of price optimization steps); Multiple service instances related faults (F8.Key passing issues in requests, F11.BOM data is updated in an unexpected sequence, F12.Price status query ignores expected service outputs); Configuration faults (F3.JVM and Docker configuration mismatch, F4.SSL offloading issue, F5. High request load, F7. Overload of requests to a third-party service); Monolithic faults (F6.SQL error of a dependent service, F9.Bi-directional CSS display error, F10.API errors in BOM update, F14.Locked product incorrectly included in CPI calculation) Nezha: CPU contention on F23.travel, F25.contact, F26.food service pods; Network delay on F28.basic, F29.travel, F30.route, F27.security, F24.verification-code service pods; Message return errors on F16.basic, F15.contact, F18.food, F19.verification-code service pods; Exception code defects on F17.basic, F21.route, F22.price, F20.travel service pods.

OnlineBoutique Nezha: CPU contention on B4.shipping, B14.cart, B18.currency, B19.email, B26.recommendation, B31.adservice, B9.payment, B11.frontend service pods; CPU consumption on B8.recommendation, B12.frontend, B24.productcatalog, B28.shipping, B17.checkout, B20.email, B32.adservice service pods; Network delay on B10.currency, B1.cart, B15.checkout, B22.productcatalog, B27.shipping, B21.payment, B25.recommendation, B29.adservice, B7.email service pods; Message return errors on B6.frontend, B23.productcatalog, B2.checkout, B30.adservice service pods; Exception code defects on B5.adservic, B13.frontend, B3.productcatalog, B16.checkout service pods

The following are the results from Table 2 of the original paper:

TrainTicket	Mean	Min	Max
Unique traces per
fault category:	1196	26	2546
Spans per trace:	79	1	345
Logs per trace:	44	1	340
OnlineBoutique	Mean	Min	Max
Unique traces per
fault category:	443	32	1018
Spans per trace:	53	1	190
Logs per trace:	51	4	184

Table 2 provides descriptive statistics on the traces in the fault dataset, highlighting the variability in the number of unique traces per fault category (e.g., 26 to 2546 for TrainTicket) and the complexity of individual traces (e.g., 1 to 345 spans, 1 to 340 logs per trace for TrainTicket). This data underscores the challenges C2 (high-dimensional, multi-modal data) and C3 (imbalanced distribution) that TraFaultDia aims to address.

Unlabeled Traces for AttenAE:

3960 unlabeled traces were randomly selected from each MSS (3360 for training, 570 for validation) to train AttenAE.
These traces were ensured to primarily consist of normal traces and did not overlap with the fault datasets used for TEMAML training/evaluation.

5.2. Evaluation Metrics

The study employs several standard metrics to evaluate the effectiveness and efficiency of TraFaultDia and its baselines.

Accuracy:
- Conceptual Definition: In classification tasks, accuracy measures the proportion of total predictions that were correct. It is a straightforward metric that indicates the overall correctness of a model. For multi-class classification, it represents the number of correctly classified instances across all classes divided by the total number of instances. The paper argues its suitability for their 5-way meta-learning setup where predicting a single class would only yield 20% accuracy.
- Mathematical Formula: $ \mathrm{Accuracy} = \frac{\mathrm{Number~of~Correct~Predictions}}{\mathrm{Total~Number~of~Predictions}} $
- Symbol Explanation:
  - Number of Correct Predictions: The count of instances where the model's predicted fault category matches the true fault category.
  - Total Number of Predictions: The total number of abnormal traces that the model attempted to classify.
T-tests [Fisher 1992]:
- Conceptual Definition: A statistical hypothesis test used to determine if there is a significant difference between the means of two groups (in this case, the accuracy scores of two different models). It helps ascertain if observed differences are likely due to the intervention (e.g., using TraFaultDia vs. a baseline) or merely random chance.
- Relevance: Used to statistically compare the accuracy of TraFaultDia against effective baselines across 50 meta-testing tasks, determining if the observed performance difference is statistically significant.
Cohen's D [Cohen 2013]:
- Conceptual Definition: A measure of effect size, used to quantify the magnitude of the difference between two groups (e.g., the difference in accuracy between two models). Unlike p-values from t-tests which indicate statistical significance, Cohen's D indicates practical significance or the strength of the observed effect.
- Mathematical Formula (for two independent groups with equal variances): $ d = \frac{\bar{x}_1 - \bar{x}_2}{s_p} $ where $s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}}$
- Symbol Explanation:
  - $\bar{x}_1$ , $\bar{x}_2$ : The means of the two groups being compared (e.g., average accuracy of TraFaultDia and a baseline).
  - $s_1$ , $s_2$ : The standard deviations of the two groups.
  - $n_1$ , $n_2$ : The sample sizes of the two groups.
  - $s_p$ : The pooled standard deviation.
- Interpretation:
  - $d = 0.2$ : Small effect size
  - $d = 0.5$ : Medium effect size
  - $d = 0.8$ or higher: Large effect size
- Relevance: Provides a quantitative measure of how much TraFaultDia's performance differs from baselines, indicating the practical importance of any observed differences.
Training Time:
- Conceptual Definition: The total time taken for a model to complete its training process. For TraFaultDia, this includes the training time for AttenAE on unlabeled traces and the meta-training time for TEMAML.
- Relevance: Measures the efficiency of learning, particularly important for AIOps systems that need to be deployed and updated.
Testing Time:
- Conceptual Definition: The time taken for a trained model to make predictions or adapt to new tasks during the evaluation phase. For TraFaultDia, this refers to the time for TEMAML to adapt to a meta-testing task and classify its query set.
- Relevance: Crucial for real-time AIOps systems where rapid fault diagnosis is essential to minimize downtime.

5.3. Baselines

Given the absence of directly comparable AIOps methods for cross-system few-shot abnormal trace categorization, the authors construct baselines by combining different trace representation and classification approaches. These baselines also serve as an ablation study to understand the contribution of TraFaultDia's components.

The neural representation method (using BERT with WordPiece tokenization for textual attributes) employed in TraFaultDia is also applied to all baselines to ensure a fair comparison and isolate the performance differences attributable to the core TraFaultDia architecture.

1. AttenAE Alternative (Trace Representation): These baselines modify how trace representations are constructed, while still using TEMAML for classification.

$OnlySpan+TEMAML$ : This baseline follows related work ([Nedelkoski et al. 2019a]) by considering each trace solely as a sequence of spans. It uses only the span attributes identified in TraFaultDia (textual, time-based, identity) to construct trace representations, completely omitting log data.
$LinearAE+TEMAML$ : This baseline uses a simplified Autoencoder (LinearAE) for multi-modal fusion. Instead of Multi-Head Attention, it employs linear projection to fuse span and log attributes, lacking the sophisticated gating mechanism of GluAE or the attention mechanism of AttenAE.
$GluAE+TEMAML$ : This baseline uses a Gated Linear Unit (GLU)-based Autoencoder (GluAE) for multi-modal fusion. It constructs trace representations by fusing the same span and log attributes as TraFaultDia but uses GLUs, adapted from modality fusion methods for MSS ([Lee et al. 2023]).

2. Transformer Encoder Alternatives (Base Model): These baselines replace the Transformer-Encoder (TE) in TEMAML with different base models, while retaining AttenAE for trace representation and MAML for meta-learning.
$AttenAE+LinearMAML$ : Uses a basic linear model as the base classifier within MAML. This serves as the simplest possible classification approach.
$AttenAE+RnnMAML$ : Uses a Recurrent Neural Network (RNN) as the base model. RNNs are traditionally used for sequential data and have shown effectiveness in some binary anomaly detection tasks.
$AttenAE+LstmMAML$ : Uses a Long Short-Term Memory (LSTM) network as the base model. LSTMs are a type of RNN particularly good at learning long-term dependencies in sequences.
$AttenAE+CnnMAML$ : Uses a Convolutional Neural Network (CNN) as the base model. CNNs have been used in related work for multi-class abnormal trace classification ([Nedelkoski et al. 2019a]).

3. TEMAML Alternatives (Meta-Learning / Classification Strategy): These baselines replace the entire TEMAML component (or its meta-learning strategy) with other meta-learning methods or traditional machine learning models. For these, the AttenAE is still used to generate trace representations. Critically, these alternatives are only evaluated on E1 and E2 (within-system) because their underlying mechanisms are not designed for cross-system transfer learning in the same way MAML is.
$AttenAE+TEMatchNet$ : Uses a Matching Network ([Vinyals et al. 2016]), another meta-learning approach for few-shot learning, as the classifier.
$AttenAE+ProtoNet$ : Uses a Prototypical Network ([Snell et al. 2017]), another popular meta-learning algorithm that classifies by finding the closest "prototype" for each class in an embedding space.
$AttenAE+NNeighbor$ : Uses a Nearest Neighbor classifier. This is a fundamental machine learning algorithm that classifies instances based on the majority class of their closest neighbors. Its performance provides insight into the inherent separability of the trace representations.
$AttenAE+DTree$ : Uses a Decision Tree classifier. This is a non-parametric supervised learning method used for classification and regression, which can also serve as a basic classification baseline.

6. Results & Analysis

6.1. Core Results Analysis

The evaluation focuses on answering two research questions:

RQ1: Within-system adaptability: How effectively and efficiently can TraFaultDia adapt to new abnormal trace classification tasks within the same MSS?
RQ2: Cross-system adaptability: How effectively and efficiently can TraFaultDia adapt to new abnormal trace classification tasks in a different MSS?

The experiments (E1-E4) test these aspects:
E1 (TrainTicket → TrainTicket): RQ1 for TrainTicket.
E2 (OnlineBoutique → OnlineBoutique): RQ1 for OnlineBoutique.
E3 (OnlineBoutique → TrainTicket): RQ2 (trained on OnlineBoutique, tested on TrainTicket).
E4 (TrainTicket → OnlineBoutique): RQ2 (trained on TrainTicket, tested on OnlineBoutique).

6.1.1. Effectiveness

The following are the results from Table 3 of the original paper:

Model	E1 (TrainTicket→TrainTicket)		E3 (OnlineBoutique→TrainTicket)
Model	5-shot	10-shot	5-shot	10-shot
Our TraFaultDia	92.91±2.10 (74.67-100.0)	93.26±1.40 (76.00-100.0)	86.35±2.00 (70.67-100.0)	92.19±1.99 (74.67-100.0)
AttenAE alternative:
OnlySpan+TEMAML	80.64±2.84 (57.33-97.33)	78.77±2.80 (60.00-97.33)	79.25±2.89 (57.33-97.33)	80.19±3.10 (56.00-97.33)
Multihead attention fusion alternatives:
LinearAE+TEMAML	89.15±2.29 (73.33-100.0)	90.59±2.43 (70.67-100.0)	83.09±2.55 (62.67-97.33)	90.61±2.01 (72.00-100.0)
GluAE+TEMAML	92.21±1.73 (77.33-100.0)	93.07±1.64 (77.33-100.0)	85.07±2.38 (66.67-100.0)	94.40±2.19 (72.00-100.0)
Transformer encoder alternatives:
AttenAE+LinearMAML	45.84±2.21 (25.33-61.33)	45.36±2.16 (30.67-60.00)	43.81±1.99 (28.00-58.67)	43.87±1.93 (29.33-60.00)
AttenAE+RnnMAML	49.65±2.09 (37.33-64.00)	42.88±1.93 (24.00-58.67)	48.45±1.75 (38.67-65.33)	47.07±1.88 (34.67-58.67)
AttenAE+LstmMAML	41.39±2.20 (21.33-56.00)	42.67±1.91 (29.33-56.00)	40.32±1.68 (22.67-52.00)	42.29±1.79 (25.33-56.00)
AttenAE+CnnMAML	57.06±2.85 (41.33-81.00)	69.20±2.19 (56.00-88.00)	49.04±1.80 (38.67-64.00)	69.47±2.65 (48.00-89.33)
TEMAML alternatives:
AttenAE+TEMatchNet	76.56±2.80 (49.33-93.33)	76.05±2.37 (50.67-94.67)
AttenAE+ProtoNet	57.25±0.03 (40.00-74.67)	59.68±0.03 (44.00-76.00)
AttenAE+NNeighbor	88.19±0.02 (74.67-98.67)	92.56±0.02 (78.00-100.0)
AttenAE+DTree	66.80±0.03 (46.67-88.00)	77.09±0.03 (54.67-96.00)

The following are the results from Table 4 of the original paper:

Model	E2 (OnlineBoutique→OnlineBoutique)		E4 (TrainTicket→OnlineBoutique)
Model	5-shot	10-shot	5-shot	10-shot
Our TraFaultDia	82.50±2.35 (65.33-98.67)	85.20±2.33 (66.67-98.67)	82.37±2.07 (64.00-97.33)	84.77±2.28 (68.00-98.67)
AttenAE alternative:
OnlySpan+TEMAML	72.83±2.40 (57.33-88.00)	73.15±2.81 (46.67-92.00)	71.81±2.25 (56.00-85.33)	73.60±2.21 (57.33-85.33)
Multihead attention fusion alternatives:
LinearAE+TEMAML	76.15±2.59 (60.00-95.00)	78.21±2.50 (64.00-96.00)	75.81±2.45 (52.00-89.33)	74.32±2.44 (53.33-88.00)
GluAE+TEMAML	80.61±2.96 (58.67-98.67)	77.49±2.67 (48.00-94.67)	74.96±2.76 (54.67-94.67)	77.57±2.70 (56.00-94.67)
Transformer encoder alternatives:
AttenAE+LinearMAML	42.59±3.63 (20.00-77.33)	40.75±3.53 (21.33-68.00)	47.01±3.59 (25.33-74.67)	44.35±4.10 (20.00-89.33)
AttenAE+RnnMAML	72.59±2.50 (54.67-94.67)	64.75±2.53 (46.67-80.00)	72.58±2.58 (54.70-94.70)	71.01±2.80 (56.00-89.33)
AttenAE+LstmMAML	54.80±2.11 (40.00-70.67)	55.97±2.25 (41.33-69.33)	56.19±1.92 (38.67-72.00)	59.71±2.05 (42.67-77.33)
AttenAE+CnnMAML	80.10±2.16 (60.00-94.67)	83.07±3.29 (68.00-97.33)	79.01±2.63 (56.00-97.33)	84.08±2.76 (65.33-100.0)
TEMAML alternatives:
AttenAE+TEMatchNet	76.29±3.00 (54.67-96.00)	73.11±2.94 (50.67-94.67)
AttenAE+ProtoNet	74.51±0.03 (53.33-92.00)	76.59±0.04 (58.66-94.67)
AttenAE+NNeighbor	80.96±0.03 (64.00-98.67)	84.75±0.03 (62.80-98.67)
AttenAE+DTree	66.99±0.02 (54.67-80.00)	73.79±0.03 (58.67-82.67)

Overall Performance of TraFaultDia:

TraFaultDia consistently achieves high average accuracy across all four experimental setups (E1-E4) and both 5-shot and 10-shot configurations.
Within-System (RQ1):
- TrainTicket (E1): 92.91% (5-shot) and 93.26% (10-shot).
- OnlineBoutique (E2): 82.50% (5-shot) and 85.20% (10-shot).
Cross-System (RQ2):
- OnlineBoutique trained, TrainTicket tested (E3): 86.35% (5-shot) and 92.19% (10-shot).
- TrainTicket trained, OnlineBoutique tested (E4): 82.37% (5-shot) and 84.77% (10-shot). These results strongly validate TraFaultDia's effectiveness in both within-system and cross-system contexts, demonstrating its robust few-shot learning capability.

Comparison with Baselines:

Most Effective Baselines: $GluAE+TEMAML$ , $LinearAE+TEMAML$ , $AttenAE+CnnMAML$ , and $AttenAE+NNeighbor$ are identified as the most effective baselines, achieving over 80% average accuracy in at least one setup.
Robustness: TraFaultDia demonstrates better robustness by maintaining consistently high accuracy across all experimental setups compared to these baselines.
- $GluAE+TEMAML$ and $LinearAE+TEMAML$ perform similarly to TraFaultDia in E1 and E3 (TrainTicket tasks), but show a noticeable drop (2-8% and 7-10% respectively) in E2 and E4 (OnlineBoutique tasks). This suggests that AttenAE's multi-head attention fusion is more robust across diverse MSS.
- $AttenAE+CnnMAML$ performs reasonably well in E2 and E4 (79-84%), but significantly worse in E1 and E3 (49-69%). This indicates that CNNs might struggle more with the specific patterns or complexities of TrainTicket's traces or its fault categories when combined with MAML.
- $AttenAE+NNeighbor$ achieves comparable accuracy to TraFaultDia in within-system setups (E1 and E2, only 1-4% lower), indicating good local classification capability on the learned AttenAE representations. However, its lack of cross-system adaptability makes it unsuitable for RQ2.
Sequence Models (RNN/LSTM): $AttenAE+RnnMAML$ and $AttenAE+LstmMAML$ generally perform poorly, often below 70%, highlighting the limitations of traditional sequence models for this multi-class fault categorization task, even when combined with MAML.
Basic Models: $AttenAE+LinearMAML$ , $AttenAE+ProtoNet$ , $AttenAE+TEMatchNet$ , and $AttenAE+DTree$ show significantly lower accuracy, emphasizing the necessity of TraFaultDia's specific architectural choices. LinearMAML and ProtoNet consistently perform below 60%, indicating that simple models or different meta-learning paradigms are insufficient for the complexity of the task.

Statistical Significance and Effect Size: The following are the results from Table 5 of the original paper:

Our TraFaultDia v.s. E1	5-shot	10-shot	E3 5-shot	10-shot
GluAE+TEMAML	0.0719	0.5347	0.0045***	7.74e-07***
LinearAE+TEMAML	1.62e-13***	1.14e-09***	1.87e-10***	0.0005***
AttenAE+NNeighbor	7.10e-29***	0.0006***		−
Our TraFaultDia v.s. E2	5-shot	10-shot	E4 5-shot	10-shot
GluAE+TEMAML	0.0006***	6.95e-28***	1.70e-27***	9.15e-27***
AttenAE+CnnMAML	6.65e-07***	0.0003***	2.01e-10***	0.176
AttenAE+NNeighbor	1.11e-05***	0.1752

***p < 0.001; **p < 0.01; *p < 0.05

The following are the results from Table 6 of the original paper:

Our TraFaultDia v.s.	5-shot	10-shot	E3	5-shot	10-shot
GluAE+TEMAML	0.36	0.13		0.58	-1.06
LinearAE+TEMAML	1.71	1.35		1.42	0.79
AttenAE+NNeighbor	3.18	0.71
Our TraFaultDia v.s. E2	5-shot	10-shot	E4	5-shot	10-shot
GluAE+TEMAML	0.71	3.07		3.04	2.88
AttenAE+CnnMAML	1.06	0.75		1.42	0.27
AttenAE+NNeighbor	0.93	0.27

Statistical Significance (Table 5): TraFaultDia shows statistically significant differences ( $p < 0.001$ ) from most effective baselines across nearly all experimental setups, except for $GluAE+TEMAML$ in E1 (5-shot, 10-shot), $AttenAE+NNeighbor$ in E2 (10-shot), and $AttenAE+CnnMAML$ in E4 (10-shot). This confirms that TraFaultDia's superior performance is not due to random chance.
Effect Size (Table 6): TraFaultDia demonstrates large positive effect sizes ( $Cohen's d > 0.8$ ) against $GluAE+TEMAML$ in E2 and E4, and against $LinearAE+TEMAML$ and $AttenAE+NNeighbor$ in 5-shot setups across all experiments. A negative Cohen's d for $GluAE+TEMAML$ in E3 10-shot suggests $GluAE+TEMAML$ performed better in that specific case, which is also reflected in Table 3 where $GluAE+TEMAML$ achieved 94.40% compared to TraFaultDia's 92.19%.

Exploration Results: The lowest accuracies for TraFaultDia and effective baselines in OnlineBoutique's meta-testing tasks (E2 and E4) were observed for performance-related issues like CPU contention, CPU consumption, and network delay on the same pods (ranging 48-68%). The authors suggest that incorporating performance metrics (e.g., CPU, memory, network traffic) directly into trace representation construction could enhance the categorization accuracy for these types of faults.

6.1.2. Efficiency

The following are the results from Table 7 of the original paper:

	E1	5-shot	10-shot	E3	5-shot	10-shot
Our TraFaultDia		36.3394	25.8387		46.0045	62.7578
GluAE+TEMAML		19.9185	20.6976		42.3725	78.0125
LinearAE+TEMAML		17.7576	45.1150		119.6440	30.5474
AttenAE+NNeighbor*	E2	− 5-shot	10-shot	E4		−
Our TraFaultDia		40.9426	65.3120		32.4136	53.3256
GluAE +TEMAML		20.1432	58.8608		21.6602	47.0818
CnnMAML		23.9832	36.7725		15.6358	23.6530
AttenAE+NNeighbor*

The following are the results from Table 8 of the original paper:

	E1	5-shot	10-shot	E3	5-shot	10-shot
Our TraFaultDia		0.0460	0.0680		0.0651	0.0953
GluAE+TEMAML		0.0640	0.0942		0.0651	0.0954
LinearAE+TEMAML		0.0748	0.0946		0.0640	0.0944
AttenAE+NNeighbor		0.1860	0.1890
	E2	5-shot	10-shot	E4	5-shot	10-shot
Our TraFaultDia		0.0848	0.0977		0.0875	0.0974
GluAE +TEMAML		0.0801	0.0978		0.0731	0.0969
CnnMAML		0.0470	0.0472		0.0417	0.0490
AttenAE+NNeighbor		0.5140	0.5307

Training Time (Table 7): TraFaultDia's training time is competitive. In some E3 setups, it is faster than baselines (6-73 seconds less), while in E1, E2, E4, it might take slightly longer (4-29 seconds more). This variability suggests that the meta-training process can be sensitive to the specific dataset characteristics of the source MSS.
Testing Time (Table 8): MAML-related approaches (including TraFaultDia, $GluAE+TEMAML$ , $LinearAE+TEMAML$ , $AttenAE+CnnMAML$ ) are significantly faster during meta-testing compared to $AttenAE+NNeighbor$ . $AttenAE+NNeighbor$ takes approximately 4-11 times longer per meta-testing task. This is because NNeighbor requires comparing new instances to all existing labeled instances, which becomes computationally expensive as the dataset grows. This highlights a crucial advantage of MAML's rapid adaptation for real-world dynamic MSS.

The following are the results from Table 9 of the original paper:

TrainTicket OnlineBouque

Our AttenAE 22.4112 18.7502

GluAE 20.1517 18.6413

LinearAE 20.5443 15.5549

OnlySpan 13.4135 10.3123

	TrainTicket	OnlineBouque
Our AttenAE	22.4112	18.7502
GluAE	20.1517	18.6413
LinearAE	20.5443	15.5549
OnlySpan	13.4135	10.3123

The following are the results from Table 10 of the original paper:

	Trainticket		OnlineBoutique
	5-shot	10-shot	5-shot	10-shot
Our AttenAE	1.8436	2.1524	4.1426	5.0189
GluAE	1.8395	2.1541	4.3121	5.1921
LinearAE	1.8286	2.1486	4.1223	5.2066
OnlySpan	1.1671	1.1818	1.9065	2.1822

AttenAE Training and Trace Construction Time (Table 9 & 10): AttenAE, GluAE, and LinearAE show comparable training times (in minutes) and trace construction times (in seconds per task). OnlySpan is faster (about half the time) for both training and construction because it omits logs. However, this speed comes at a significant cost to accuracy (approximately 10% lower for $OnlySpan+TEMAML$ ), demonstrating the value of multi-modal data fusion.

6.1.3. Answers to Research Questions

Based on the evaluation, the paper concludes:

TraFaultDia demonstrates effective (high accuracy) and fast (efficient adaptation times) within-system and cross-system adaptability for classifying abnormal traces into precise fault categories for MSS. This confirms its ability to address RQ1 and RQ2.

6.2. Ablation Studies / Parameter Analysis

The authors conducted ablation studies to understand the contribution of each component of TraFaultDia and the effect of key parameters.

Contribution of AttenAE (Multi-modal Fusion):
- The comparison between TraFaultDia and the AttenAE alternative $OnlySpan+TEMAML$ (Tables 3 and 4) clearly highlights the significant contribution of AttenAE's multi-modal data fusion. TraFaultDia consistently outperforms $OnlySpan+TEMAML$ by approximately 10% across all E1-E4 setups. This demonstrates that incorporating logs alongside spans and effectively fusing them via AttenAE is crucial for the overall effectiveness of the framework.
Contribution of Transformer-Encoder and Multi-Head Attention in AttenAE:
- The performance difference between TraFaultDia and $LinearAE+TEMAML$ / $GluAE+TEMAML$ (Tables 3 and 4) showcases the superiority of TraFaultDia's Multi-Head Attention mechanism for fusion over simpler linear or GLU-based alternatives, particularly in the more challenging OnlineBoutique (E2, E4) scenarios.
- The poor performance of $AttenAE+LinearMAML$ , $AttenAE+RnnMAML$ , and $AttenAE+LstmMAML$ (Tables 3 and 4) confirms the importance of the Transformer-Encoder (TE) as the base model for TEMAML. TE's self-attention mechanism is better suited for processing the latent trace representations than simpler models or traditional sequential models.

Ablation Study on Number of Meta-Training Tasks: The following are the results from Table 11 of the original paper:

	E1		E2		E3		E4
Number of tasks	5-shot	10-shot	5-shot	10-shot	5-shot	10-shot	5-shot	10-shot
Current: 4	92.91	93.26	82.50	85.20	86.35	92.19	82.37	84.77
3	87.67	90.00	80.53	82.13	85.03	89.07	81.92	82.67
2	88.17	89.33	78.67	78.10	83.11	85.78	74.67	77.33

An ablation study was conducted to assess the impact of the number of meta-training tasks on TEMAML's performance.

The results in Table 11 show that using 4 meta-training tasks (the "Current" setting) consistently yields the best average accuracy across all 50 meta-testing tasks in each experiment, compared to using 2 or 3 meta-training tasks.
This finding aligns with the nature of MAML as a multi-task learning algorithm: providing a greater variety of meta-training tasks (up to a certain point) enhances the algorithm's ability to learn robust, generalizable parameters. This improved meta-learning capability allows for more effective adaptation to new, unseen tasks in diverse contexts, confirming that increasing the diversity of learning experiences during meta-training can improve TraFaultDia's robustness.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces TraFaultDia, an innovative AIOps framework designed for the automatic categorization of abnormal traces into specific fault types within Microservice-Based Systems (MSS). By addressing the critical gap where existing AIOps methods detect anomalies and locate services but fall short of diagnosing root causes, TraFaultDia significantly reduces human effort in Root Cause Analysis (RCA). The framework's core strength lies in its two main components:

AttenAE (Multi-Head Attention Autoencoder): Unsupervisedly learns to fuse high-dimensional, multi-modal trace-related data (from both spans and logs) into compressed yet effective trace representations. This ensures both efficiency and effectiveness in handling complex trace data.
TEMAML (Transformer-Encoder based Model-Agnostic Meta-Learning): Leverages meta-learning to provide few-shot learning capabilities, allowing rapid adaptation to new and rare fault categories with only a few labeled instances. Crucially, it also enables robust cross-system adaptability, meaning TraFaultDia can effectively categorize faults in new, unseen MSS even if trained on a different system.

Evaluated on two benchmark MSS, TrainTicket and OnlineBoutique, using open datasets, TraFaultDia demonstrated strong performance. It achieved high average accuracies (e.g., 93.26% for TrainTicket within-system and 92.19% for TrainTicket cross-system with 10 labeled instances), validating its effectiveness and rapid adaptability in both within-system and cross-system contexts. These findings underscore TraFaultDia's potential to automate a crucial diagnostic step in AIOps, leading to faster resolutions, reduced downtime, and improved operational efficiency in complex MSS.

7.2. Limitations & Future Work

The authors acknowledge several limitations in their current study, primarily related to the datasets and the scope of evaluation, and suggest future research directions:

Dataset Limitations:
- Source and Applicability: The evaluation relies on open datasets (DeepTraLog and Nezha) from benchmark MSS. The authors note that using datasets from real-world industrial systems would significantly enhance the applicability and relevance of their findings, better reflecting actual operational environments. Limited resources prevented access to such comprehensive real-world data.
- Trace-Related Metrics: The current trace representations do not include performance metrics (e.g., CPU, memory usage, network traffic). The authors suggest that incorporating such metrics could improve the classification accuracy for performance-related anomalies (e.g., CPU contention, network delay), which were identified as areas of lower accuracy in their exploration results. While Nezha provides some of these metrics, other open datasets are lacking.
- Limited Fault Type and MSS Diversity: The study used abnormal traces from 30 and 32 fault categories for TrainTicket and OnlineBoutique, respectively, and adopted a 5-way setup. While MAML is known to be effective for more complex multi-class classification tasks (e.g., 20-way, 50-way), the limited diversity of available open datasets restricted the scope. The authors aim to expand evaluation to include a broader range of fault types and MSS from additional datasets, which they were unable to find for this study.
Impact of Limitations: These external threats (dataset limitations) pose the possibility that TraFaultDia's generalization capabilities might not be fully utilized or demonstrated. Addressing them could further enhance the precision and robustness of trace-level RCA in MSS.
Future Work:
- Improving Generalizability, Scalability, and Interpretability: Future efforts will focus on enhancing these aspects of TraFaultDia.
- Real-World Industrial MSS Evaluation: The authors are actively working on deploying and selecting real-world industrial MSS to generate new datasets for further testing and validation of TraFaultDia.

7.3. Personal Insights & Critique

This paper presents a highly relevant and innovative solution to a pressing problem in AIOps for MSS. The shift from mere anomaly detection/localization to automatic fault categorization is a significant step towards truly autonomous IT operations.

Strengths:

Addresses a Clear Gap: The paper effectively identifies and addresses a critical missing piece in the AIOps puzzle: automated RCA for fault types. This has immense practical value for organizations managing complex MSS.
Intelligent Use of Meta-Learning: Meta-learning is an excellent choice for MSS due to their inherent heterogeneity and dynamic nature. The ability to adapt quickly to new systems and rare fault categories with few labels is a strong selling point for real-world adoption.
Multi-Modal Data Fusion: The AttenAE's approach to fusing diverse span and log attributes is robust. Recognizing the importance of logs (often overlooked by sequence-based trace analysis) significantly enhances diagnostic power. The unsupervised nature of AttenAE also makes it practical, as unlabeled data is abundant.
Computational Efficiency: By avoiding computationally expensive GNN constructions and frequent updates, TraFaultDia offers a more scalable solution for dynamic MSS.
Rigorous Evaluation: The defined research questions (within-system and cross-system) and the comprehensive set of baselines (including ablation studies) provide a thorough assessment of the framework's performance.

Potential Issues, Unverified Assumptions, and Areas for Improvement:

Defining Fault Categories: While the paper states TraFaultDia's use cases are scalable depending on "how fault categories are built," the process of defining and maintaining these categories in a rapidly evolving MSS can still be a significant challenge. The paper assumes these categories are well-defined (e.g., tied to faulty components and root causes). In practice, this manual definition process for meta-training data might require substantial initial human effort, even if the classification becomes automated later. Future work could explore semi-supervised or unsupervised approaches to discover fault categories.
Interpretability of AttenAE Representations: The paper focuses on the effectiveness of the compressed trace representations. While the Multi-Head Attention provides some insight into feature relevance during fusion, the resulting latent space representations are not directly interpretable. For RCA, especially for human practitioners, interpretability of why a trace falls into a category could be crucial. Incorporating XAI (Explainable AI) techniques could be a valuable addition.
MAML First-Order Approximation: The paper uses a first-order approximation in the MAML outer loop for computational efficiency. While common, this can sometimes lead to slightly suboptimal performance compared to full second-order MAML (which computes Hessian-vector products). Exploring whether more advanced MAML variants or meta-optimizers could yield further gains, especially in complex cross-system scenarios, might be beneficial if computational resources permit.
Impact of BERT Model Choice: The neural representation relies on BERT. The choice of BERT variant (e.g., base, large, domain-specific) and its pre-training data could impact performance. Robustness to different NLP models or even lightweight alternatives for resource-constrained environments could be explored.
Generalizability of "Sufficient Unlabeled Traces": The AttenAE requires "sufficient unlabeled traces" for training. The definition of "sufficient" can vary widely. In real-world MSS, trace data volume can be immense, but also sparse for certain less-frequented paths. The impact of AttenAE training data volume and diversity on downstream TEMAML performance could be investigated further.
Memory Footprint: While training and testing time are evaluated for efficiency, the memory footprint of the AttenAE and TEMAML models, especially when dealing with high-dimensional trace representations and potentially large meta-learning support/query sets, could be a practical concern for deployment in production AIOps environments.

Transferability to Other Domains: The core methodology of TraFaultDia — combining unsupervised multi-modal data fusion with meta-learning for few-shot, cross-domain classification — is highly transferable.

Log Anomaly Categorization: Beyond traces, this framework could be adapted to categorize raw system logs into fault types across different software systems or versions.
IoT Device Fault Diagnosis: In diverse IoT environments, devices generate various sensor data and event logs. TraFaultDia could be used to categorize device malfunctions across different models or deployments with limited labeled fault data.
Healthcare Diagnostics: Adapting to new diseases or patient cohorts with few labeled examples from multi-modal patient data (e.g., medical images, electronic health records, lab results).
Manufacturing Quality Control: Classifying product defects based on multi-modal inspection data (e.g., visual, vibrational, acoustic) across different product lines or manufacturing batches.

Overall, TraFaultDia represents a robust and forward-thinking approach that holds significant promise for enhancing the autonomy and efficiency of AIOps in complex, dynamic software systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~37 min read · 55,216 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. TraFaultDia Workflow Overview

4.2.2. AttenAE for Constructing Trace Representations

4.2.2.1. Span Preprocessing and Vector Generation

4.2.2.2. Log Preprocessing and Vector Generation

4.2.2.3. Trace Representation Construction

4.2.3. TEMAML for Few-Shot Abnormal Trace Classification Across MSS

4.2.3.1. Base Model for Abnormal Trace Classification

4.2.3.2. Meta-training

4.2.3.3. Meta-testing

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Effectiveness

6.1.2. Efficiency

6.1.3. Answers to Research Questions

6.2. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers