Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning
TL;DR Summary
TraFaultDia uses meta-learning for few-shot abnormal trace classification in microservices, enabling accurate fault categorization and root cause identification across systems, validated on two open datasets.
Abstract
Microservice-based systems (MSS) may fail with various fault types. While existing AIOps methods excel at detecting abnormal traces and locating the responsible service(s), human efforts are still required for diagnosing specific fault types and failure causes.This paper presents TraFaultDia, a novel AIOps framework to automatically classify abnormal traces into fault categories for MSS. We treat the classification process as a series of multi-class classification tasks, where each task represents an attempt to classify abnormal traces into specific fault categories for a MSS. TraFaultDia leverages meta-learning to train on several abnormal trace classification tasks with a few labeled instances from a MSS, enabling quick adaptation to new, unseen abnormal trace classification tasks with a few labeled instances across MSS. TraFaultDia's use cases are scalable depending on how fault categories are built from anomalies within MSS. We evaluated TraFaultDia on two MSS, TrainTicket and OnlineBoutique, with open datasets where each fault category is linked to faulty system components (service/pod) and a root cause. TraFaultDia automatically classifies abnormal traces into these fault categories, thus enabling the automatic identification of faulty system components and root causes without manual analysis. TraFaultDia achieves 93.26% and 85.20% accuracy on 50 new classification tasks for TrainTicket and OnlineBoutique, respectively, when trained within the same MSS with 10 labeled instances per category. In the cross-system context, when TraFaultDia is applied to a MSS different from the one it is trained on, TraFaultDia gets an average accuracy of 92.19% and 84.77% for the same set of 50 new, unseen abnormal trace classification tasks of the respective systems, also with 10 labeled instances provided for each fault category per task in each system.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning
1.2. Authors
- YUQING WANG, University of Helsinki, Finland
- MIKA V. MANTYLA, University of Helsinki, Finland and University of Oulu, Finland
- SERGE DEMEYER, University of Antwerp, Belgium
- MUTLU BEYAZIT, University of Antwerp, Belgium
- JOANNA KISAAKYE, University of Antwerp, Belgium
- JESSE NYSSOLA, University of Helsinki, Finland
1.3. Journal/Conference
This paper is published at "Proc. ACM Softw. Eng. 2, FSE, Article FSE027 (July 2025)". This indicates publication in the proceedings of a top-tier software engineering conference, likely the ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE), which is highly reputable and influential in the software engineering and AIOps communities.
1.4. Publication Year
2025 (as indicated in the ACM Reference Format). The abstract states a publication UTC date of 2024-03-27, which suggests it was made available as a preprint or accepted for publication around that time, with the official proceedings year being 2025.
1.5. Abstract
Microservice-based systems (MSS) frequently experience failures of diverse types. While existing AIOps methods are adept at detecting abnormal traces and pinpointing responsible services, they still necessitate significant human intervention for diagnosing specific fault types and their underlying causes. This paper introduces TraFaultDia, an innovative AIOps framework designed to automatically classify abnormal traces into predefined fault categories within MSS. The framework frames this classification as a series of multi-class classification tasks. TraFaultDia employs a meta-learning approach, training on various abnormal trace classification tasks with a limited number of labeled instances from a single MSS. This enables it to rapidly adapt to new, unseen abnormal trace classification tasks across different MSS, requiring only a few labeled examples. Its utility is flexible, adapting to how fault categories are structured from anomalies within an MSS. Evaluated on two benchmark MSS, TrainTicket and OnlineBoutique, using open datasets where fault categories are linked to faulty components (service/pod) and root causes, TraFaultDia successfully automates the classification of abnormal traces. This facilitates the automatic identification of faulty system components and root causes, bypassing manual analysis. The framework achieves high accuracy: 93.26% for TrainTicket and 85.20% for OnlineBoutique on 50 new classification tasks when trained and tested within the same MSS (with 10 labeled instances per category). In a cross-system context, TraFaultDia maintains strong performance, achieving an average accuracy of 92.19% and 84.77% for TrainTicket and OnlineBoutique, respectively, across the same 50 new tasks, also with 10 labeled instances per fault category per task.
1.6. Original Source Link
- Official Source/PDF Link: https://arxiv.org/pdf/2403.18998v4.pdf
- Original Source Link: https://arxiv.org/abs/2403.18998
- Publication Status: Preprint (arXiv) and accepted for publication (ACM FSE 2025).
2. Executive Summary
2.1. Background & Motivation
The proliferation of microservice-based systems (MSS) has brought about increased complexity and dynamic behavior, making them prone to various fault types. Traditional Artificial Intelligence for IT Operations (AIOps) methods have made significant strides in detecting abnormal traces and identifying the specific services responsible for these anomalies. However, a critical gap persists: these methods typically do not extend to automatically diagnosing the specific fault types or conducting in-depth root cause analysis (RCA) to explain why a failure occurred. This deficiency necessitates substantial human effort from practitioners, who must manually analyze detected abnormal traces to determine fault categories and underlying causes. This manual process is time-consuming, requires deep expertise in software architecture and operational behaviors, and becomes increasingly infeasible as MSS grow in complexity and data volume. The inability to promptly and automatically categorize abnormal traces leads to delayed resolutions, increased downtime, and higher operational costs.
The core problem the paper aims to solve is this automation gap in fault diagnosis for abnormal traces in MSS. The existing AIOps methods can tell "what is wrong" (abnormal trace detected) and "where it is wrong" (responsible service located), but not "what kind of wrong" (fault category) or "why it is wrong" (root cause). The paper's innovative idea is to leverage meta-learning to enable an AIOps framework to automatically classify abnormal traces into specific fault categories, even in dynamic and heterogeneous MSS environments, and with limited labeled data.
2.2. Main Contributions / Findings
The paper introduces TraFaultDia, a novel AIOps framework with several key contributions:
- Automatic Fault Categorization:
TraFaultDiaautomatically classifies abnormal traces into specific fault categories for MSS, addressing a critical gap in existing AIOps methods that primarily focus on detection and localization. This enables automatic identification of faulty system components and root causes, reducing manual analysis. - Unsupervised Multi-Modal Trace Representation: The framework employs an unsupervised approach (
AttenAE) to fuse high-dimensional, multi-modal trace-related data (including textual, time-based, and identity attributes from spans and logs) into compressed yet effective trace representations. This tackles the challenge of complex, diverse trace data (C2) and offers computational efficiency compared to graph-based methods. - Meta-Learning for Adaptability (Few-Shot & Cross-System):
TraFaultDiautilizes a Transformer-Encoder based Model-Agnostic Meta-Learning (TEMAML) model. This enables few-shot learning, allowing the framework to classify abnormal traces from rare fault categories with only a few labeled instances (C3). Crucially, it also provides strong transfer learning capabilities, allowing quick adaptation to new, unseen abnormal trace classification tasks across different, heterogeneous MSS (C1). - Empirical Validation: Extensive evaluation on two representative benchmark MSS (TrainTicket and OnlineBoutique) with open datasets demonstrates the framework's effectiveness.
-
Within-system adaptability (RQ1): Achieves high average accuracy of 93.26% (TrainTicket) and 85.20% (OnlineBoutique) on 50 new tasks with only 10 labeled instances per category.
-
Cross-system adaptability (RQ2): Demonstrates robust performance with average accuracies of 92.19% (TrainTicket) and 84.77% (OnlineBoutique) when trained on one MSS and applied to another, also with 10 labeled instances per category.
These findings highlight
TraFaultDia's capability to significantly enhance the efficiency and effectiveness ofRCAin complexMSSenvironments by automating a crucial diagnostic step.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the TraFaultDia framework, a foundational understanding of several key concepts is essential:
- Microservice-Based Systems (MSS):
- Conceptual Definition: An architectural style where a complex application is composed of small, independent services, each running in its own process and communicating with lightweight mechanisms, often HTTP APIs. These services are typically organized around business capabilities and can be deployed independently.
- Relevance: The paper addresses the challenges inherent in
MSS, such as their complex and dynamic nature, which leads to various fault types and makes traditional fault diagnosis difficult.
- Artificial Intelligence for IT Operations (AIOps):
- Conceptual Definition: The application of Artificial Intelligence (AI) and Machine Learning (ML) techniques to IT operations tasks, including monitoring, anomaly detection, root cause analysis, and automation.
- Relevance: The paper aims to advance
AIOpscapabilities by addressing the gap in automated fault categorization withinMSS, moving beyond simple anomaly detection and localization.
- Traces, Spans, and Logs: These are fundamental concepts for understanding system behavior and diagnosing issues in distributed systems like
MSS.- Traces: Represent the end-to-end path of a single user request or transaction as it propagates through various services in an
MSS. They provide a holistic view of an operation. - Spans: The basic unit of work within a trace. Each span represents an individual operation performed by a particular service. Spans have a parent-child relationship, forming a hierarchical tree structure (as shown in Figure 1). Spans carry attributes like service name, operation name, start time, end time, and unique
Span IDs(e.g.,a480f2.0,a480f2.1where.1is a child of.0). - Logs: Textual records generated by service instances during their execution. They capture detailed events, status information, errors, and debugging messages at specific points in time. Logs are often associated with particular spans or traces (as shown in Figure 2).
- Relevance:
TraFaultDialeverages information from both spans and logs, recognizing theirmulti-modalnature, to construct comprehensive trace representations for fault categorization.
- Traces: Represent the end-to-end path of a single user request or transaction as it propagates through various services in an
- Root Cause Analysis (RCA):
- Conceptual Definition: The process of identifying the fundamental reason for a problem or failure. In
IToperations,RCAseeks to determine why an issue occurred, beyond just identifying the symptoms. - Relevance: The paper aims to automate a significant part of
RCAby automatically classifying abnormal traces into specific fault categories, thereby enabling faster identification of faulty components and root causes without manual effort.
- Conceptual Definition: The process of identifying the fundamental reason for a problem or failure. In
- Meta-Learning (Learning to Learn):
- Conceptual Definition: A field of machine learning where the goal is to train models that can learn new tasks or adapt to new environments quickly and efficiently, often with limited training data. It's about learning the inductive biases that allow for rapid generalization.
- Key Concepts:
- Few-shot Learning: A sub-field of meta-learning where models are trained to perform well on new tasks given only a very small number of labeled examples (the "shots") for each new class. This is crucial for handling rare fault categories (
C3). - Transfer Learning: The process of taking knowledge gained from one task or domain and applying it to a different but related task or domain. In this paper, it refers to transferring learning from one
MSSto another (C1). - N-Way K-Shot: A common setup in few-shot learning. refers to the number of distinct classes (fault categories) in a given task, and refers to the number of labeled examples ("shots") available per class in the support set for adaptation.
- Few-shot Learning: A sub-field of meta-learning where models are trained to perform well on new tasks given only a very small number of labeled examples (the "shots") for each new class. This is crucial for handling rare fault categories (
- Relevance:
TraFaultDiauses meta-learning to achievefew-shotandcross-systemadaptability, addressing the challenges ofimbalanced abnormal trace distribution(C3) andMSS heterogeneity(C1).
- Autoencoders (AE):
- Conceptual Definition: A type of artificial neural network used for unsupervised learning of efficient data codings (representations). It consists of two parts: an
encoderthat compresses the input data into a lower-dimensional latent space representation, and adecoderthat reconstructs the original input from this compressed representation. The network is trained to minimize the reconstruction error. - Relevance: The
AttenAEcomponent inTraFaultDiais an autoencoder, specifically designed to fusehigh-dimensional, multi-modal trace-related data(C2) intounified, low-dimensional representationsin an unsupervised manner.
- Conceptual Definition: A type of artificial neural network used for unsupervised learning of efficient data codings (representations). It consists of two parts: an
- Attention Mechanism, Multi-Head Attention, and Self-Attention:
- Conceptual Definition: A mechanism that allows a neural network to focus on specific parts of its input when processing information. Instead of processing the entire input equally, attention assigns weights to different input elements, highlighting the most relevant ones.
- Attention Formula: The fundamental scaled dot-product attention mechanism is defined as:
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
- (Query), (Key), (Value): These are matrices derived from the input data. The
queryis used to attend tokeys, and the attention weights are then applied to thevalues. - : Computes the dot product between the query and all keys, measuring their similarity.
- : A scaling factor (square root of the dimension of the keys) used to prevent the dot products from becoming too large, which could push the
softmaxfunction into regions with tiny gradients. softmax: Normalizes the scores into probability distributions, indicating how much attention to pay to each value.- : The values are weighted by the
softmaxprobabilities and summed up.
- (Query), (Key), (Value): These are matrices derived from the input data. The
- Multi-Head Attention: An extension where the
attentionmechanism is performed multiple times in parallel, each with different learned linear projections ofQ, K, V. The outputs from these multipleattention headsare then concatenated and linearly transformed. This allows the model to jointly attend to information from different representation subspaces at different positions. - Self-Attention: A special case of
attentionwhere thequery,key, andvalueall come from the same sequence of data. This allows the model to weigh the importance of different elements within the same input sequence to generate an output.
- Attention Formula: The fundamental scaled dot-product attention mechanism is defined as:
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
- Relevance:
TraFaultDiausesMulti-Head AttentionwithinAttenAEto fuse span and log data, andSelf-Attentionwithin theTransformer-Encoder(TE) to process trace representations, enabling it to recognize and integrate the most relevant features.
- Conceptual Definition: A mechanism that allows a neural network to focus on specific parts of its input when processing information. Instead of processing the entire input equally, attention assigns weights to different input elements, highlighting the most relevant ones.
- Transformer-Encoder:
- Conceptual Definition: The encoder part of the
Transformerarchitecture, a neural network model primarily known for its efficiency in handling sequential data, particularly in natural language processing (NLP). It relies entirely onself-attentionmechanisms to draw global dependencies between input and output, rather than using recurrent or convolutional layers. - Relevance: The
TEMAMLcomponent uses aTransformer-Encoder(TE) as its base model, chosen for its effectiveness in recognizing and integrating features from latent trace representations due to itsself-attentionmechanism.
- Conceptual Definition: The encoder part of the
- Neural Representations (BERT, WordPiece Tokenization):
- Conceptual Definition: Methods to convert textual data (like service operations or log messages) into dense numerical vectors (embeddings) that capture their semantic meaning.
- BERT (Bidirectional Encoder Representations from Transformers): A powerful pre-trained
NLPmodel that generates contextualized word embeddings, meaning the embedding for a word depends on its surrounding words. - WordPiece Tokenization: A subword tokenization algorithm used by
BERTthat breaks words into smaller, frequently occurring subword units. This helps handle out-of-vocabulary (OOV) words and reduces the vocabulary size.
- BERT (Bidirectional Encoder Representations from Transformers): A powerful pre-trained
- Relevance:
TraFaultDiauses these techniques to create robust vector representations for textual attributes from spans and logs, addressing the challenge ofdiverseandevolvingtextual data inMSSthat may containOOVwords.
- Conceptual Definition: Methods to convert textual data (like service operations or log messages) into dense numerical vectors (embeddings) that capture their semantic meaning.
3.2. Previous Works
The paper contextualizes its contributions by discussing existing research in trace representation and trace classification.
-
Trace Representation:
- Graph Neural Networks (GNNs) for Trace Graphs: Many existing
AIOpsmethods (e.g., [Chen et al. 2023, 2022; Raeiszadeh et al. 2023; Zhang et al. 2022a,b]) construct trace graphs, where spans are nodes and interactions are edges. These graphs capture complex service interactions and are effective for detecting abnormalities and locating faulty services.- Differentiation from
TraFaultDia: The paper arguesGNNsare not suitable for its context due to high computational expense and scalability issues, especially with traces having hundreds of spans and dynamicMSSenvironments where graphs would need frequent, burdensome updates.TraFaultDiaopts forunified, compressed trace representationswithout explicit graph construction, making it more scalable.
- Differentiation from
- Sequence Models for Traces: Some anomaly detection studies (e.g., [Du et al. 2023; Kohyarnejadfard et al. 2022; Nedelkoski et al. 2019b]) treat traces as sequences of span attributes and use models like
Long Short-Term Memory (LSTM)networks to model these sequences.- Differentiation from
TraFaultDia: These approaches typically overlook logs, which the paper identifies as a crucialmodalityfor recognizing many fault categories (e.g., code errors, configuration issues).TraFaultDiaexplicitly fuses both span and log data.
- Differentiation from
- Graph Neural Networks (GNNs) for Trace Graphs: Many existing
-
Trace Classification:
- Binary Classification for Anomaly Detection: Many
MSSanomaly detection studies (e.g., [Kohyarnejadfard et al. 2019; Kong et al. 2024; Zhang et al. 2022b]) use binary classification to distinguish between normal and abnormal traces.- Differentiation from
TraFaultDia: This is distinct fromTraFaultDia's goal, which ismulti-class classificationof already detected abnormal traces into specific fault categories.
- Differentiation from
- Multi-Class Classification (Nedelkoski et al. 2019a): The study by Nedelkoski et al. (2019a) is cited as the only related work that performs
multi-class classificationof abnormal traces. It uses aConvolutional Neural Network (CNN)to classify traces into four time series-based fault categories (incremental, mean shift, gradual increase, cylinder) by characterizing traces as sequences of spancall pathattributes.- Differentiation from
TraFaultDia: This approach is deemed insufficient forTraFaultDia's broader scope because:- It only considers time-series data on
call path, ignoring other criticalmulti-modaldata (logs, other span attributes) vital for diagnosing a wider range of fault types (C2). - It lacks mechanisms to address
MSS heterogeneity(C1) orfew-shot learningfor rare fault categories (C3), whichTraFaultDiaexplicitly tackles withmeta-learning.
- It only considers time-series data on
- Differentiation from
- Binary Classification for Anomaly Detection: Many
3.3. Technological Evolution
The field of IT operations has evolved from manual monitoring and rule-based systems to sophisticated AIOps solutions. Early AIOps focused on basic anomaly detection (identifying "if" something is wrong) and simple alerting. As distributed systems became more complex, particularly with the rise of MSS, the need for understanding the flow of requests led to the adoption of distributed tracing (traces, spans, logs). This enabled AIOps to move towards localizing issues (identifying "where" something is wrong), often using techniques like GNNs to map service dependencies or sequence models to identify anomalous patterns in execution flows.
The current paper represents an evolution beyond detection and localization towards automated RCA and diagnosis (identifying "what kind of wrong" and "why"). It addresses the limitations of prior approaches by:
- Comprehensive Data Utilization: Moving beyond single-modality (e.g., only spans) to
multi-modaldata fusion (spans + logs). - Scalability for Dynamics: Offering an alternative to computationally heavy
GNNsfor dynamicMSS. - Adaptability for Heterogeneity: Incorporating
meta-learningto overcome the challenges of diverseMSSarchitectures and the dynamic appearance of new, rare fault types, a capability largely absent in priormulti-class classificationattempts forMSS.
3.4. Differentiation Analysis
The core differences and innovations of TraFaultDia compared to existing methods lie in its comprehensive approach to fault categorization:
-
Problem Scope: While previous works primarily focused on binary
anomaly detection(normal vs. abnormal) or localizing faulty services,TraFaultDiatackles the more complexmulti-class classificationof already abnormal traces into specific, human-understandable fault categories and root causes. -
Trace Representation:
- Multi-Modal Fusion: Unlike methods that only use spans or logs,
TraFaultDiaexplicitly fuses textual, time-based, and identity attributes from both spans and logs using anAttenAEwith aMulti-Head Attentionmechanism. This addressesC2(high-dimensional, multi-modal data) and ensures a richer, more effective representation for detailed fault diagnosis. - Efficiency over Graphs: It avoids the computational burden and update complexity of
GNNsby creatingcompressed, unified representationswithout requiring explicit graph construction or frequent re-computation of dependencies.
- Multi-Modal Fusion: Unlike methods that only use spans or logs,
-
Learning Paradigm:
- Meta-Learning (MAML): This is the most significant innovation.
TraFaultDiaemploysMAMLwith aTransformer-Encoder(TEMAML) to achievefew-shot learningandcross-system adaptability. This directly addressesC1(MSS heterogeneity) andC3(imbalanced abnormal trace distribution) by enabling the model to quickly adapt to newMSSand rare fault categories with minimal labeled data. Priormulti-class classificationattempts forMSS(like theCNNapproach) lacked thismeta-learningcapability, making them less robust and adaptable across diverse systems or to novel fault types.
- Meta-Learning (MAML): This is the most significant innovation.
-
Unsupervised Representation Learning: The use of
AttenAEallowsTraFaultDiato learn effective trace representations fromunlabeled traces, which are abundant inMSS, making the approach practical and scalable.In essence,
TraFaultDiafills a crucial gap by providing anAIOpsframework that not only detects and locates but also diagnoses specific fault categories and root causes, doing so efficiently, with limited labeled data, and across diverseMSS.
4. Methodology
4.1. Principles
The core idea behind TraFaultDia is to combine robust, unsupervised trace representation learning with an adaptive meta-learning approach for multi-class fault categorization. The method is grounded in two main principles:
-
Unsupervised Multi-modal Data Fusion for Effective Representation:
Microservicetraces arehigh-dimensionalandmulti-modal, containing diverse information in spans (textual, time-based, identity) and logs (textual, severity, identity). To make this complex data tractable and effective for analysis,TraFaultDiaemploys anAutoencoderarchitecture, specificallyAttenAE, to learncompressed,unified representations. Theunsupervisednature ofAutoencodersis critical becauseunlabeled trace datais readily available, whilelabeled datafor specific fault categories is scarce. TheMulti-Head Attentionmechanism withinAttenAEis designed to intelligently fuse these diverse data modalities, focusing on the most relevant features. -
Meta-Learning for Few-Shot and Cross-System Adaptability:
Microserviceenvironments are inherentlyheterogeneous(C1) and constantly evolving, leading toimbalanced fault category distributions(C3) with many rare fault types. Traditional supervised learning models would struggle withfew-shot learning(very few examples per category) and would require extensive retraining for each newMSSor fault type.TraFaultDiaovercomes this by adoptingModel-Agnostic Meta-Learning (MAML)trained with aTransformer-Encoder(TEMAML).MAMLlearns a good initialization of model parameters that can quickly adapt to new, unseen tasks (new fault categories or newMSS) with just a few gradient updates and a small number of labeled examples. This "learning to learn" principle allowsTraFaultDiato be efficient and practical in dynamicAIOpsscenarios.
4.2. Core Methodology In-depth (Layer by Layer)
The TraFaultDia framework consists of two main components: the Multi-Head Attention Autoencoder (AttenAE) and the Transformer-Encoder based Model-Agnostic Meta-Learning (TEMAML) model.
4.2.1. TraFaultDia Workflow Overview
The overall workflow of TraFaultDia is illustrated in Figure 3.
该图像是图3,展示了TraFaultDia框架中AttenAE与TEMAML的交互流程。包括利用AttenAE对微服务系统异常调用链进行编码与解码,构建调用链表示,再通过TEMAML进行基于元学习的异常调用链分类,分为元训练和元测试阶段。
The workflow proceeds in two main stages:
-
AttenAEfor Trace Representation Construction: For a givenMSS,AttenAEis initially trained on a large volume of unlabeled traces. Its purpose is tounsupervisedlylearn how to effectively fuse thehigh-dimensional, multi-modal trace-related dataintocompressed yet effective trace representations. Once trained, theencoderpart ofAttenAEis used independently to generate thesetrace representationsfor any new, unseen traces within that specificMSS. -
TEMAMLfor Abnormal Trace Classification: This component uses aTransformer-Encoder (TE)as its base model, trained viameta-learning.- Meta-training Phase:
TEMAMLis trained on a series of abnormal trace classification tasks (calledmeta-training tasks) sampled from a sourceMSS. During this phase,TEMAMLlearns optimal initial parameters that enable rapid adaptation. - Meta-testing Phase: After
meta-training,TEMAMLis evaluated on new, unseen abnormal trace classification tasks (calledmeta-testing tasks). Thesemeta-testing taskscan originate from the sameMSSused duringmeta-trainingor from an entirely differentMSS. For these tasks,TEMAMLleverages the optimizedAttenAEencoderof the respectiveMSSto convert abnormal traces into theirtrace representationsbefore performing classification.
- Meta-training Phase:
4.2.2. AttenAE for Constructing Trace Representations
The AttenAE architecture is depicted in Figure 4. For a given MSS, AttenAE processes a set of traces, denoted as , where each is an individual trace comprising a sequence of spans and logs. Thus, collectively refers to the spans and logs across all traces.
该图像是图4,AttenAE架构的示意图,展示了从一条跟踪的spans和日志中提取属性,并通过多头注意力机制的编码器-解码器结构进行特征投影和重构的过程。
4.2.2.1. Span Preprocessing and Vector Generation
For each span, TraFaultDia extracts the following attributes, as identified in Section 2.2:
-
Textual Attributes:
call componentandpath. -
Time-based Attributes:
span start timeandspan end time. -
Identity Attributes:
trace IDandspan ID.The
trace IDsare used to group spans correctly into their respective traces. The attributes are then processed into numerical vector representations:
-
Time-based Attributes: These (in
UNIXformat) are normalized within the context of each span to account for individual characteristics and scales. The normalized values are concatenated to form a single vector for each span. -
Span IDs: To capture hierarchical relationships, shared common prefixes in
span IDsare abstracted away, retaining only hierarchical-level digits. For example,a480f2.0,a480f2.1,a480f2.2might be reassigned as1.0,1.1,1.2. These normalizedspan IDswithin the trace context are converted into a vector . -
Textual Attributes: The
call componentandpathare concatenated to form a singular attribute called "service operation." To represent this textual information, a neural representation method (similar to [Le and Zhang 2021]) is used, leveraging the pre-trainedBERTmodel withWordPiece tokenizationto handle evolving and out-of-vocabulary (OOV) words. This process involves three steps:-
Step 1: Preprocessing: Convert uppercase letters to lowercase, substitute specific variables (e.g.,
Prod1234withProductID), and remove non-alphabetic characters. -
Step 2: Tokenization: Apply
WordPiece tokenizationto break "service operations" into subwords. -
Step 3: Neural Representation: Feed the subwords into a
BERT base model(specifically, the word embeddings generated by its last encoding layer) and calculate the sentence embedding of each "service operation" by averaging its word embeddings. This yields a vector representation .Finally, for each span within a trace, the vector representations from these three types of attributes are concatenated to form a composite vector , where is the dimensionality of this combined vector space: $ V_{\mathrm{span}} = \mathrm{Concatenate}(V_{\mathrm{operation}}, V_{\mathrm{numeric}}, V_{\mathrm{span_id}}) $
-
4.2.2.2. Log Preprocessing and Vector Generation
From logs, TraFaultDia extracts textual (log component, message, severity level) and identity (trace ID, span ID) attributes. Trace IDs are used to group logs into their corresponding traces.
- Textual Attributes: The
log component,message, andseverity levelare concatenated to form a singular attribute called "log event." - Neural Representation: Similar to service operations, a neural representation method is applied to "log events" following the same three-step process (preprocessing,
WordPiece tokenization,BERTembeddings). This generates a vector representation , where is the dimensionality of its vector space.
4.2.2.3. Trace Representation Construction
For a given MSS, AttenAE's encoder constructs trace representations from the processed span and log vectors.
-
Projection into Common Feature Space: The input vectors and are first projected into a common feature space : $ V'{\mathrm{span}} = g(W{\mathrm{span}}V_{\mathrm{span}} + b_{\mathrm{span}}) $ $ V'{\mathrm{log}} = g(W{\mathrm{log}}V_{\mathrm{log}} + b_{\mathrm{log}}) $
- : Denotes the activation function (e.g.,
ReLU). - , : Are respective weight matrices for the linear transformation.
- , : Are the bias vectors.
- : Denotes the activation function (e.g.,
-
Multi-Head Attention Fusion: The projected vectors and are then fused using a
Multi-Head Attentionmechanism. This mechanism, as defined in Equation 2, captures complex interactions between spans and logs.$ \left{ \begin{array}{l} \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \ \mathrm{head}_i = \mathrm{Attention}(QW^Q, KW^K, VW^V) \ \mathrm{MultiHead}(Q, K, V) = \mathrm{Concatenate}(\mathrm{head}_1, \dots, \mathrm{head}_h)W^O \end{array} \right. $
-
Q, K, V: Refer to the query, key, and value matrices, respectively. -
: Computes attention scores by taking the dot product of
QueryandKey. -
: A scaling factor for numerical stability.
-
softmax: Applies asoftmaxfunction to produce the attention distribution (weights). -
: Each
attention headcomputesattentionafter transformingQ, K, Vwith separate learned weight matrices (). -
MultiHead: Concatenates the outputs of allattention headsand applies a final linear transformation with weight matrix .To fuse spans and logs into a trace representation ,
TraFaultDiasets as theQuery(), and as both theKey() andValue(): $ Z = \mathrm{MultiHead}(V'{\mathrm{span}}, V'{\mathrm{log}}, V'_{\mathrm{log}}) $ This specific assignment reflects the architectural understanding that spans define the trace structure and service communications, while logs provide detailed contextual event information. The output is thetrace representationfor a given trace . For a set of traces , this process generates correspondingtrace representations.
-
-
Decoder for Reconstruction: The
AttenAE's decoder aims to reconstruct the original span and log vectors from thetrace representations, effectively inverting the encoder's process: $ \hat{V}{\mathrm{span}} = g(W'{\mathrm{span}}Z + b'{\mathrm{span}}) $ $ \hat{V}{\mathrm{log}} = g(W'{\mathrm{log}}Z + b'{\mathrm{log}}) $- and : Are the reconstructed span and log vectors.
- : The activation function.
- , : Respective weight matrices.
- , : Respective bias vectors.
-
Loss Function for AttenAE Training:
AttenAEis trained by minimizing the overall loss , which is the sum of squared errors between the original and reconstructed vectors for both spans and logs: $ \min_{\Psi} \mathcal{L} = |\hat{V}{\mathrm{span}} - V{\mathrm{span}}|^2 + |\hat{V}{\mathrm{log}} - V{\mathrm{log}}|^2 $- : Represents all learnable parameters within the
AttenAE(encoder and decoder weights and biases).
- : Represents all learnable parameters within the
4.2.3. TEMAML for Few-Shot Abnormal Trace Classification Across MSS
The TEMAML learning process is depicted in Figure 5. It consists of meta-training and meta-testing phases.
该图像是图 5,展示了TEMAML(基于任务的元学习)学习过程的示意图,包含内循环和外循环两个阶段,通过迭代优化参数heta以实现对异常跟踪的分类。
4.2.3.1. Base Model for Abnormal Trace Classification
The base model within TEMAML is a Transformer-Encoder (TE), denoted as . It performs multi-class classification for abnormal traces. The workflow for is:
- Input: Receives
trace representations(generated byAttenAE's encoder) for all abnormal traces in a given task. - Self-Attention Mechanism: The input is processed through
TE'sself-attentionmechanism. This mechanism helpsTEweigh the most relevant parts and capture dependencies within eachtrace representationto recognize the fault type for each trace . This adheres to theMulti-Head Attentionmechanism (Equation 2), where the input serves asQ, K, V: $ \mathrm{output} = \mathrm{MultiHead}(Z, Z, Z) $ - Classification Layers: The output from
self-attentionis passed through:- A pooling layer to highlight key features.
- A dropout layer to prevent overfitting.
- A fully connected layer to reshape the output.
- A softmax classifier to compute probabilities for each fault category.
4.2.3.2. Meta-training
The objective of meta-training is to train the TE () to find robust initial parameters that allow it to quickly adapt to diverse abnormal trace classification tasks from any MSS.
-
Tasks:
TEis trained on several abnormal trace classification tasks (meta-training tasks), denoted as , sampled from a sourceMSS. Each task is configured in anN-Way K-Shotsetup.- : The
support set, containing distinct fault categories, each with labeled example abnormal traces. - : The
query set, containing distinct fault categories, each with labeled trace instances. Here, to ensure robust optimization. - : Represents the
trace representationsgenerated by the optimizedAttenAE's encoder for theMSS. - : Represents the corresponding fault category labels.
- : The
-
Two-Loop Optimization (MAML):
TEMAMLoptimizes 's parameters using two nested loops:-
Inner Loop (Task-level Adaptation): For each
meta-training task, the base model (initialized with current meta-parameters ) is adapted to this specific task. A few gradient descent steps are performed on thesupport setto update to task-specific parameters : $ \theta'i = \theta - \alpha \nabla{\theta} \mathcal{L}{T_i}(f{\theta}(S_i)) $- : The learning rate for inner loop updates.
- : The loss of model on the
support setof task .
-
Outer Loop (Meta-level Optimization): After adapting to all
meta-training tasks(obtaining for each ), the meta-parameters are updated to minimize the aggregated loss on thequery setsacross allmeta-training tasks: $ \min_{\theta} \mathcal{L}(\theta) = \sum_{T_i \in T} \mathcal{L}{T_i}(f{\theta'_i}(Q_i)) $- : The loss of the adapted model on the
query setof task . The paper uses afirst-order approximationfor the outer loop update to simplify the computationally intensive process of computing second-order derivatives (gradients through gradients): $ \theta \gets \theta - \beta \nabla_{\theta} \sum_{T_i \in T} \mathcal{L}{T_i}(f{\theta'_i}(Q_i)) $ - : The learning rate for outer loop updates.
This outer loop update is performed for a predetermined number of optimization steps, yielding the optimal meta-parameters . The base model
TEis then initialized with these , becoming the optimized base model , which possesses enhanced adaptability.
- : The loss of the adapted model on the
-
4.2.3.3. Meta-testing
In the meta-testing phase, the optimized TE () is used to adapt to new, unseen abnormal trace classification tasks (meta-testing tasks) from any MSS.
- Task Configuration: Each
meta-testing taskis configured similarly tometa-training tasks(e.g.,N-Way K-Shot). - Adaptation and Evaluation: To adapt to a specific
meta-testing task:- The optimized base model is fine-tuned on the
support setof that task to obtain task-specific parameters . - The adapted model is then used to classify abnormal traces in the
query set, and its performance is evaluated.
- The optimized base model is fine-tuned on the
5. Experimental Setup
5.1. Datasets
The study evaluates TraFaultDia using trace data from two benchmark MSS, TrainTicket and OnlineBoutique, sourced from open datasets:
-
DeepTraLog [FudanSELab 2024]: Provides normal and abnormal traces across 14 fault categories for TrainTicket. These categories represent various real-world anomaly scenarios (asynchronous interaction, multi-instance, configuration, monolithic faults).
-
Nezha [IntelligentDDS 2024]: Contains normal and abnormal traces for both TrainTicket and OnlineBoutique. It includes abnormal traces from five fault types (CPU contention, CPU consumption, network delay, error return, exception code defect) applied to various service pods. Each
podassociated with a fault type forms a unique fault category.Fault Dataset Construction: The authors combined data from DeepTraLog and Nezha to create a comprehensive fault dataset:
-
TrainTicket: Includes abnormal traces from 30 fault categories (F1-F30).
-
OnlineBoutique: Includes abnormal traces from 32 fault categories (B1-B32).
For experiments, these fault categories were further divided:
-
TrainTicket: 20 base fault categories and 10 novel fault categories.
-
OnlineBoutique: 22 base fault categories and 10 novel fault categories. The composition of base and novel categories for each system was designed to include a random mix of fault categories with varying numbers of spans and logs, ensuring a representative evaluation.
The following are the results from Table 1 of the original paper:
| TrainTicket DeepTraLog: Asynchronous service invocations related faults (F1.Asynchronous message sequence error, F2.Unexpected order of data requests, F13.Unexpected order of price optimization steps); Multiple service instances related faults (F8.Key passing issues in requests, F11.BOM data is updated in an unexpected sequence, F12.Price status query ignores expected service outputs); Configuration faults (F3.JVM and Docker configuration mismatch, F4.SSL offloading issue, F5. High request load, F7. Overload of requests to a third-party service); Monolithic faults (F6.SQL error of a dependent service, F9.Bi-directional CSS display error, F10.API errors in BOM update, F14.Locked product incorrectly included in CPI calculation) Nezha: CPU contention on F23.travel, F25.contact, F26.food service pods; Network delay on F28.basic, F29.travel, F30.route, F27.security, F24.verification-code service pods; Message return errors on F16.basic, F15.contact, F18.food, F19.verification-code service pods; Exception code defects on F17.basic, F21.route, F22.price, F20.travel service pods. |
| OnlineBoutique Nezha: CPU contention on B4.shipping, B14.cart, B18.currency, B19.email, B26.recommendation, B31.adservice, B9.payment, B11.frontend service pods; CPU consumption on B8.recommendation, B12.frontend, B24.productcatalog, B28.shipping, B17.checkout, B20.email, B32.adservice service pods; Network delay on B10.currency, B1.cart, B15.checkout, B22.productcatalog, B27.shipping, B21.payment, B25.recommendation, B29.adservice, B7.email service pods; Message return errors on B6.frontend, B23.productcatalog, B2.checkout, B30.adservice service pods; Exception code defects on B5.adservic, B13.frontend, B3.productcatalog, B16.checkout service pods |
The following are the results from Table 2 of the original paper:
| TrainTicket | Mean | Min | Max |
| Unique traces per | |||
| fault category: | 1196 | 26 | 2546 |
| Spans per trace: | 79 | 1 | 345 |
| Logs per trace: | 44 | 1 | 340 |
| OnlineBoutique | Mean | Min | Max |
| Unique traces per | |||
| fault category: | 443 | 32 | 1018 |
| Spans per trace: | 53 | 1 | 190 |
| Logs per trace: | 51 | 4 | 184 |
Table 2 provides descriptive statistics on the traces in the fault dataset, highlighting the variability in the number of unique traces per fault category (e.g., 26 to 2546 for TrainTicket) and the complexity of individual traces (e.g., 1 to 345 spans, 1 to 340 logs per trace for TrainTicket). This data underscores the challenges C2 (high-dimensional, multi-modal data) and C3 (imbalanced distribution) that TraFaultDia aims to address.
Unlabeled Traces for AttenAE:
- 3960 unlabeled traces were randomly selected from each
MSS(3360 for training, 570 for validation) to trainAttenAE. - These traces were ensured to primarily consist of normal traces and did not overlap with the fault datasets used for
TEMAMLtraining/evaluation.
5.2. Evaluation Metrics
The study employs several standard metrics to evaluate the effectiveness and efficiency of TraFaultDia and its baselines.
-
Accuracy:
- Conceptual Definition: In classification tasks, accuracy measures the proportion of total predictions that were correct. It is a straightforward metric that indicates the overall correctness of a model. For multi-class classification, it represents the number of correctly classified instances across all classes divided by the total number of instances. The paper argues its suitability for their
5-way meta-learning setupwhere predicting a single class would only yield 20% accuracy. - Mathematical Formula: $ \mathrm{Accuracy} = \frac{\mathrm{Number~of~Correct~Predictions}}{\mathrm{Total~Number~of~Predictions}} $
- Symbol Explanation:
Number of Correct Predictions: The count of instances where the model's predicted fault category matches the true fault category.Total Number of Predictions: The total number of abnormal traces that the model attempted to classify.
- Conceptual Definition: In classification tasks, accuracy measures the proportion of total predictions that were correct. It is a straightforward metric that indicates the overall correctness of a model. For multi-class classification, it represents the number of correctly classified instances across all classes divided by the total number of instances. The paper argues its suitability for their
-
T-tests [Fisher 1992]:
- Conceptual Definition: A statistical hypothesis test used to determine if there is a significant difference between the means of two groups (in this case, the accuracy scores of two different models). It helps ascertain if observed differences are likely due to the intervention (e.g., using
TraFaultDiavs. a baseline) or merely random chance. - Relevance: Used to statistically compare the accuracy of
TraFaultDiaagainst effective baselines across 50meta-testing tasks, determining if the observed performance difference is statistically significant.
- Conceptual Definition: A statistical hypothesis test used to determine if there is a significant difference between the means of two groups (in this case, the accuracy scores of two different models). It helps ascertain if observed differences are likely due to the intervention (e.g., using
-
Cohen's D [Cohen 2013]:
- Conceptual Definition: A measure of effect size, used to quantify the magnitude of the difference between two groups (e.g., the difference in accuracy between two models). Unlike
p-valuesfromt-testswhich indicate statistical significance,Cohen's Dindicates practical significance or the strength of the observed effect. - Mathematical Formula (for two independent groups with equal variances): $ d = \frac{\bar{x}_1 - \bar{x}_2}{s_p} $ where
- Symbol Explanation:
- , : The means of the two groups being compared (e.g., average accuracy of
TraFaultDiaand a baseline). - , : The standard deviations of the two groups.
- , : The sample sizes of the two groups.
- : The pooled standard deviation.
- , : The means of the two groups being compared (e.g., average accuracy of
- Interpretation:
- : Small effect size
- : Medium effect size
- or higher: Large effect size
- Relevance: Provides a quantitative measure of how much
TraFaultDia's performance differs from baselines, indicating the practical importance of any observed differences.
- Conceptual Definition: A measure of effect size, used to quantify the magnitude of the difference between two groups (e.g., the difference in accuracy between two models). Unlike
-
Training Time:
- Conceptual Definition: The total time taken for a model to complete its training process. For
TraFaultDia, this includes the training time forAttenAEon unlabeled traces and themeta-trainingtime forTEMAML. - Relevance: Measures the efficiency of learning, particularly important for
AIOpssystems that need to be deployed and updated.
- Conceptual Definition: The total time taken for a model to complete its training process. For
-
Testing Time:
- Conceptual Definition: The time taken for a trained model to make predictions or adapt to new tasks during the evaluation phase. For
TraFaultDia, this refers to the time forTEMAMLto adapt to ameta-testing taskand classify itsquery set. - Relevance: Crucial for real-time
AIOpssystems where rapid fault diagnosis is essential to minimize downtime.
- Conceptual Definition: The time taken for a trained model to make predictions or adapt to new tasks during the evaluation phase. For
5.3. Baselines
Given the absence of directly comparable AIOps methods for cross-system few-shot abnormal trace categorization, the authors construct baselines by combining different trace representation and classification approaches. These baselines also serve as an ablation study to understand the contribution of TraFaultDia's components.
The neural representation method (using BERT with WordPiece tokenization for textual attributes) employed in TraFaultDia is also applied to all baselines to ensure a fair comparison and isolate the performance differences attributable to the core TraFaultDia architecture.
1. AttenAE Alternative (Trace Representation): These baselines modify how trace representations are constructed, while still using TEMAML for classification.
-
: This baseline follows related work ([Nedelkoski et al. 2019a]) by considering each trace solely as a sequence of spans. It uses only the span attributes identified in
TraFaultDia(textual, time-based, identity) to construct trace representations, completely omitting log data. -
: This baseline uses a simplified
Autoencoder(LinearAE) for multi-modal fusion. Instead ofMulti-Head Attention, it employs linear projection to fuse span and log attributes, lacking the sophisticated gating mechanism ofGluAEor the attention mechanism ofAttenAE. -
: This baseline uses a
Gated Linear Unit (GLU)-basedAutoencoder(GluAE) for multi-modal fusion. It constructs trace representations by fusing the same span and log attributes asTraFaultDiabut usesGLUs, adapted from modality fusion methods forMSS([Lee et al. 2023]).2. Transformer Encoder Alternatives (Base Model): These baselines replace the
Transformer-Encoder (TE)inTEMAMLwith different base models, while retainingAttenAEfor trace representation andMAMLfor meta-learning. -
: Uses a basic linear model as the base classifier within
MAML. This serves as the simplest possible classification approach. -
: Uses a
Recurrent Neural Network (RNN)as the base model.RNNsare traditionally used for sequential data and have shown effectiveness in some binary anomaly detection tasks. -
: Uses a
Long Short-Term Memory (LSTM)network as the base model.LSTMsare a type ofRNNparticularly good at learning long-term dependencies in sequences. -
: Uses a
Convolutional Neural Network (CNN)as the base model.CNNshave been used in related work for multi-class abnormal trace classification ([Nedelkoski et al. 2019a]).3. TEMAML Alternatives (Meta-Learning / Classification Strategy): These baselines replace the entire
TEMAMLcomponent (or its meta-learning strategy) with other meta-learning methods or traditional machine learning models. For these, theAttenAEis still used to generate trace representations. Critically, these alternatives are only evaluated on E1 and E2 (within-system) because their underlying mechanisms are not designed for cross-system transfer learning in the same wayMAMLis. -
: Uses a
Matching Network([Vinyals et al. 2016]), another meta-learning approach for few-shot learning, as the classifier. -
: Uses a
Prototypical Network([Snell et al. 2017]), another popular meta-learning algorithm that classifies by finding the closest "prototype" for each class in an embedding space. -
: Uses a
Nearest Neighborclassifier. This is a fundamental machine learning algorithm that classifies instances based on the majority class of their closest neighbors. Its performance provides insight into the inherent separability of the trace representations. -
: Uses a
Decision Treeclassifier. This is a non-parametric supervised learning method used for classification and regression, which can also serve as a basic classification baseline.
6. Results & Analysis
6.1. Core Results Analysis
The evaluation focuses on answering two research questions:
-
RQ1: Within-system adaptability: How effectively and efficiently can
TraFaultDiaadapt to new abnormal trace classification tasks within the sameMSS? -
RQ2: Cross-system adaptability: How effectively and efficiently can
TraFaultDiaadapt to new abnormal trace classification tasks in a differentMSS?The experiments (E1-E4) test these aspects:
-
E1 (TrainTicket → TrainTicket):
RQ1for TrainTicket. -
E2 (OnlineBoutique → OnlineBoutique):
RQ1for OnlineBoutique. -
E3 (OnlineBoutique → TrainTicket):
RQ2(trained on OnlineBoutique, tested on TrainTicket). -
E4 (TrainTicket → OnlineBoutique):
RQ2(trained on TrainTicket, tested on OnlineBoutique).
6.1.1. Effectiveness
The following are the results from Table 3 of the original paper:
| Model | E1 (TrainTicket→TrainTicket) | E3 (OnlineBoutique→TrainTicket) | ||
| 5-shot | 10-shot | 5-shot | 10-shot | |
| Our TraFaultDia | 92.91±2.10 (74.67-100.0) | 93.26±1.40 (76.00-100.0) | 86.35±2.00 (70.67-100.0) | 92.19±1.99 (74.67-100.0) |
| AttenAE alternative: | ||||
| OnlySpan+TEMAML | 80.64±2.84 (57.33-97.33) | 78.77±2.80 (60.00-97.33) | 79.25±2.89 (57.33-97.33) | 80.19±3.10 (56.00-97.33) |
| Multihead attention fusion alternatives: | ||||
| LinearAE+TEMAML | 89.15±2.29 (73.33-100.0) | 90.59±2.43 (70.67-100.0) | 83.09±2.55 (62.67-97.33) | 90.61±2.01 (72.00-100.0) |
| GluAE+TEMAML | 92.21±1.73 (77.33-100.0) | 93.07±1.64 (77.33-100.0) | 85.07±2.38 (66.67-100.0) | 94.40±2.19 (72.00-100.0) |
| Transformer encoder alternatives: | ||||
| AttenAE+LinearMAML | 45.84±2.21 (25.33-61.33) | 45.36±2.16 (30.67-60.00) | 43.81±1.99 (28.00-58.67) | 43.87±1.93 (29.33-60.00) |
| AttenAE+RnnMAML | 49.65±2.09 (37.33-64.00) | 42.88±1.93 (24.00-58.67) | 48.45±1.75 (38.67-65.33) | 47.07±1.88 (34.67-58.67) |
| AttenAE+LstmMAML | 41.39±2.20 (21.33-56.00) | 42.67±1.91 (29.33-56.00) | 40.32±1.68 (22.67-52.00) | 42.29±1.79 (25.33-56.00) |
| AttenAE+CnnMAML | 57.06±2.85 (41.33-81.00) | 69.20±2.19 (56.00-88.00) | 49.04±1.80 (38.67-64.00) | 69.47±2.65 (48.00-89.33) |
| TEMAML alternatives: | ||||
| AttenAE+TEMatchNet | 76.56±2.80 (49.33-93.33) | 76.05±2.37 (50.67-94.67) | ||
| AttenAE+ProtoNet | 57.25±0.03 (40.00-74.67) | 59.68±0.03 (44.00-76.00) | ||
| AttenAE+NNeighbor | 88.19±0.02 (74.67-98.67) | 92.56±0.02 (78.00-100.0) | ||
| AttenAE+DTree | 66.80±0.03 (46.67-88.00) | 77.09±0.03 (54.67-96.00) | ||
The following are the results from Table 4 of the original paper:
| Model | E2 (OnlineBoutique→OnlineBoutique) | E4 (TrainTicket→OnlineBoutique) | ||
| 5-shot | 10-shot | 5-shot | 10-shot | |
| Our TraFaultDia | 82.50±2.35 (65.33-98.67) | 85.20±2.33 (66.67-98.67) | 82.37±2.07 (64.00-97.33) | 84.77±2.28 (68.00-98.67) |
| AttenAE alternative: | ||||
| OnlySpan+TEMAML | 72.83±2.40 (57.33-88.00) | 73.15±2.81 (46.67-92.00) | 71.81±2.25 (56.00-85.33) | 73.60±2.21 (57.33-85.33) |
| Multihead attention fusion alternatives: | ||||
| LinearAE+TEMAML | 76.15±2.59 (60.00-95.00) | 78.21±2.50 (64.00-96.00) | 75.81±2.45 (52.00-89.33) | 74.32±2.44 (53.33-88.00) |
| GluAE+TEMAML | 80.61±2.96 (58.67-98.67) | 77.49±2.67 (48.00-94.67) | 74.96±2.76 (54.67-94.67) | 77.57±2.70 (56.00-94.67) |
| Transformer encoder alternatives: | ||||
| AttenAE+LinearMAML | 42.59±3.63 (20.00-77.33) | 40.75±3.53 (21.33-68.00) | 47.01±3.59 (25.33-74.67) | 44.35±4.10 (20.00-89.33) |
| AttenAE+RnnMAML | 72.59±2.50 (54.67-94.67) | 64.75±2.53 (46.67-80.00) | 72.58±2.58 (54.70-94.70) | 71.01±2.80 (56.00-89.33) |
| AttenAE+LstmMAML | 54.80±2.11 (40.00-70.67) | 55.97±2.25 (41.33-69.33) | 56.19±1.92 (38.67-72.00) | 59.71±2.05 (42.67-77.33) |
| AttenAE+CnnMAML | 80.10±2.16 (60.00-94.67) | 83.07±3.29 (68.00-97.33) | 79.01±2.63 (56.00-97.33) | 84.08±2.76 (65.33-100.0) |
| TEMAML alternatives: | ||||
| AttenAE+TEMatchNet | 76.29±3.00 (54.67-96.00) | 73.11±2.94 (50.67-94.67) | ||
| AttenAE+ProtoNet | 74.51±0.03 (53.33-92.00) | 76.59±0.04 (58.66-94.67) | ||
| AttenAE+NNeighbor | 80.96±0.03 (64.00-98.67) | 84.75±0.03 (62.80-98.67) | ||
| AttenAE+DTree | 66.99±0.02 (54.67-80.00) | 73.79±0.03 (58.67-82.67) | ||
Overall Performance of TraFaultDia:
TraFaultDiaconsistently achieves high average accuracy across all four experimental setups (E1-E4) and both5-shotand10-shotconfigurations.- Within-System (RQ1):
- TrainTicket (E1): 92.91% (5-shot) and 93.26% (10-shot).
- OnlineBoutique (E2): 82.50% (5-shot) and 85.20% (10-shot).
- Cross-System (RQ2):
- OnlineBoutique trained, TrainTicket tested (E3): 86.35% (5-shot) and 92.19% (10-shot).
- TrainTicket trained, OnlineBoutique tested (E4): 82.37% (5-shot) and 84.77% (10-shot).
These results strongly validate
TraFaultDia's effectiveness in bothwithin-systemandcross-systemcontexts, demonstrating its robustfew-shotlearning capability.
Comparison with Baselines:
- Most Effective Baselines: , , , and are identified as the most effective baselines, achieving over 80% average accuracy in at least one setup.
- Robustness:
TraFaultDiademonstrates better robustness by maintaining consistently high accuracy across all experimental setups compared to these baselines.- and perform similarly to
TraFaultDiain E1 and E3 (TrainTicket tasks), but show a noticeable drop (2-8% and 7-10% respectively) in E2 and E4 (OnlineBoutique tasks). This suggests thatAttenAE'smulti-head attentionfusion is more robust across diverseMSS. - performs reasonably well in E2 and E4 (79-84%), but significantly worse in E1 and E3 (49-69%). This indicates that
CNNs might struggle more with the specific patterns or complexities of TrainTicket's traces or its fault categories when combined withMAML. - achieves comparable accuracy to
TraFaultDiainwithin-systemsetups (E1 and E2, only 1-4% lower), indicating good local classification capability on the learnedAttenAErepresentations. However, its lack ofcross-systemadaptability makes it unsuitable forRQ2.
- and perform similarly to
- Sequence Models (RNN/LSTM): and generally perform poorly, often below 70%, highlighting the limitations of traditional sequence models for this multi-class
fault categorizationtask, even when combined withMAML. - Basic Models: , , , and show significantly lower accuracy, emphasizing the necessity of
TraFaultDia's specific architectural choices.LinearMAMLandProtoNetconsistently perform below 60%, indicating that simple models or differentmeta-learningparadigms are insufficient for the complexity of the task.
Statistical Significance and Effect Size: The following are the results from Table 5 of the original paper:
| Our TraFaultDia v.s. E1 | 5-shot | 10-shot | E3 5-shot | 10-shot |
| GluAE+TEMAML | 0.0719 | 0.5347 | 0.0045*** | 7.74e-07*** |
| LinearAE+TEMAML | 1.62e-13*** | 1.14e-09*** | 1.87e-10*** | 0.0005*** |
| AttenAE+NNeighbor | 7.10e-29*** | 0.0006*** | − | |
| Our TraFaultDia v.s. E2 | 5-shot | 10-shot | E4 5-shot | 10-shot |
| GluAE+TEMAML | 0.0006*** | 6.95e-28*** | 1.70e-27*** | 9.15e-27*** |
| AttenAE+CnnMAML | 6.65e-07*** | 0.0003*** | 2.01e-10*** | 0.176 |
| AttenAE+NNeighbor | 1.11e-05*** | 0.1752 |
***p < 0.001; **p < 0.01; *p < 0.05
The following are the results from Table 6 of the original paper:
| Our TraFaultDia v.s. | 5-shot | 10-shot | E3 | 5-shot | 10-shot |
| GluAE+TEMAML | 0.36 | 0.13 | 0.58 | -1.06 | |
| LinearAE+TEMAML | 1.71 | 1.35 | 1.42 | 0.79 | |
| AttenAE+NNeighbor | 3.18 | 0.71 | |||
| Our TraFaultDia v.s. E2 | 5-shot | 10-shot | E4 | 5-shot | 10-shot |
| GluAE+TEMAML | 0.71 | 3.07 | 3.04 | 2.88 | |
| AttenAE+CnnMAML | 1.06 | 0.75 | 1.42 | 0.27 | |
| AttenAE+NNeighbor | 0.93 | 0.27 |
- Statistical Significance (Table 5):
TraFaultDiashows statistically significant differences () from most effective baselines across nearly all experimental setups, except for in E1 (5-shot, 10-shot), in E2 (10-shot), and in E4 (10-shot). This confirms thatTraFaultDia's superior performance is not due to random chance. - Effect Size (Table 6):
TraFaultDiademonstrates large positive effect sizes () against in E2 and E4, and against and in 5-shot setups across all experiments. A negative Cohen's d for in E3 10-shot suggests performed better in that specific case, which is also reflected in Table 3 where achieved 94.40% compared toTraFaultDia's 92.19%.
Exploration Results:
The lowest accuracies for TraFaultDia and effective baselines in OnlineBoutique's meta-testing tasks (E2 and E4) were observed for performance-related issues like CPU contention, CPU consumption, and network delay on the same pods (ranging 48-68%). The authors suggest that incorporating performance metrics (e.g., CPU, memory, network traffic) directly into trace representation construction could enhance the categorization accuracy for these types of faults.
6.1.2. Efficiency
The following are the results from Table 7 of the original paper:
| E1 | 5-shot | 10-shot | E3 | 5-shot | 10-shot | |
| Our TraFaultDia | 36.3394 | 25.8387 | 46.0045 | 62.7578 | ||
| GluAE+TEMAML | 19.9185 | 20.6976 | 42.3725 | 78.0125 | ||
| LinearAE+TEMAML | 17.7576 | 45.1150 | 119.6440 | 30.5474 | ||
| AttenAE+NNeighbor* | E2 | − 5-shot | 10-shot | E4 | − | |
| Our TraFaultDia | 40.9426 | 65.3120 | 32.4136 | 53.3256 | ||
| GluAE +TEMAML | 20.1432 | 58.8608 | 21.6602 | 47.0818 | ||
| CnnMAML | 23.9832 | 36.7725 | 15.6358 | 23.6530 | ||
| AttenAE+NNeighbor* |
The following are the results from Table 8 of the original paper:
| E1 | 5-shot | 10-shot | E3 | 5-shot | 10-shot | |
| Our TraFaultDia | 0.0460 | 0.0680 | 0.0651 | 0.0953 | ||
| GluAE+TEMAML | 0.0640 | 0.0942 | 0.0651 | 0.0954 | ||
| LinearAE+TEMAML | 0.0748 | 0.0946 | 0.0640 | 0.0944 | ||
| AttenAE+NNeighbor | 0.1860 | 0.1890 | ||||
| E2 | 5-shot | 10-shot | E4 | 5-shot | 10-shot | |
| Our TraFaultDia | 0.0848 | 0.0977 | 0.0875 | 0.0974 | ||
| GluAE +TEMAML | 0.0801 | 0.0978 | 0.0731 | 0.0969 | ||
| CnnMAML | 0.0470 | 0.0472 | 0.0417 | 0.0490 | ||
| AttenAE+NNeighbor | 0.5140 | 0.5307 |
-
Training Time (Table 7):
TraFaultDia's training time is competitive. In some E3 setups, it is faster than baselines (6-73 seconds less), while in E1, E2, E4, it might take slightly longer (4-29 seconds more). This variability suggests that the meta-training process can be sensitive to the specific dataset characteristics of the sourceMSS. -
Testing Time (Table 8):
MAML-related approaches (includingTraFaultDia, , , ) are significantly faster duringmeta-testingcompared to . takes approximately 4-11 times longer permeta-testing task. This is becauseNNeighborrequires comparing new instances to all existing labeled instances, which becomes computationally expensive as the dataset grows. This highlights a crucial advantage ofMAML's rapid adaptation for real-world dynamicMSS.The following are the results from Table 9 of the original paper:
TrainTicket OnlineBouque Our AttenAE 22.4112 18.7502 GluAE 20.1517 18.6413 LinearAE 20.5443 15.5549 OnlySpan 13.4135 10.3123
The following are the results from Table 10 of the original paper:
| Trainticket | OnlineBoutique | |||
| 5-shot | 10-shot | 5-shot | 10-shot | |
| Our AttenAE | 1.8436 | 2.1524 | 4.1426 | 5.0189 |
| GluAE | 1.8395 | 2.1541 | 4.3121 | 5.1921 |
| LinearAE | 1.8286 | 2.1486 | 4.1223 | 5.2066 |
| OnlySpan | 1.1671 | 1.1818 | 1.9065 | 2.1822 |
AttenAETraining and Trace Construction Time (Table 9 & 10):AttenAE,GluAE, andLinearAEshow comparable training times (in minutes) and trace construction times (in seconds per task).OnlySpanis faster (about half the time) for both training and construction because it omits logs. However, this speed comes at a significant cost to accuracy (approximately 10% lower for ), demonstrating the value of multi-modal data fusion.
6.1.3. Answers to Research Questions
Based on the evaluation, the paper concludes:
TraFaultDiademonstrates effective (high accuracy) and fast (efficient adaptation times)within-systemandcross-systemadaptability for classifying abnormal traces into precise fault categories forMSS. This confirms its ability to addressRQ1andRQ2.
6.2. Ablation Studies / Parameter Analysis
The authors conducted ablation studies to understand the contribution of each component of TraFaultDia and the effect of key parameters.
-
Contribution of
AttenAE(Multi-modal Fusion):- The comparison between
TraFaultDiaand theAttenAEalternative (Tables 3 and 4) clearly highlights the significant contribution ofAttenAE's multi-modal data fusion.TraFaultDiaconsistently outperforms by approximately 10% across all E1-E4 setups. This demonstrates that incorporating logs alongside spans and effectively fusing them viaAttenAEis crucial for the overall effectiveness of the framework.
- The comparison between
-
Contribution of
Transformer-EncoderandMulti-Head AttentioninAttenAE:- The performance difference between
TraFaultDiaand / (Tables 3 and 4) showcases the superiority ofTraFaultDia'sMulti-Head Attentionmechanism for fusion over simpler linear orGLU-based alternatives, particularly in the more challenging OnlineBoutique (E2,E4) scenarios. - The poor performance of , , and (Tables 3 and 4) confirms the importance of the
Transformer-Encoder(TE) as the base model forTEMAML.TE'sself-attentionmechanism is better suited for processing the latent trace representations than simpler models or traditional sequential models.
- The performance difference between
-
Ablation Study on Number of Meta-Training Tasks: The following are the results from Table 11 of the original paper:
E1 E2 E3 E4 Number of tasks 5-shot 10-shot 5-shot 10-shot 5-shot 10-shot 5-shot 10-shot Current: 4 92.91 93.26 82.50 85.20 86.35 92.19 82.37 84.77 3 87.67 90.00 80.53 82.13 85.03 89.07 81.92 82.67 2 88.17 89.33 78.67 78.10 83.11 85.78 74.67 77.33
An ablation study was conducted to assess the impact of the number of meta-training tasks on TEMAML's performance.
- The results in Table 11 show that using 4
meta-training tasks(the "Current" setting) consistently yields the best average accuracy across all 50meta-testing tasksin each experiment, compared to using 2 or 3meta-training tasks. - This finding aligns with the nature of
MAMLas amulti-task learningalgorithm: providing a greater variety ofmeta-training tasks(up to a certain point) enhances the algorithm's ability to learn robust, generalizable parameters. This improvedmeta-learningcapability allows for more effective adaptation to new, unseen tasks in diverse contexts, confirming that increasing the diversity of learning experiences duringmeta-trainingcan improveTraFaultDia's robustness.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces TraFaultDia, an innovative AIOps framework designed for the automatic categorization of abnormal traces into specific fault types within Microservice-Based Systems (MSS). By addressing the critical gap where existing AIOps methods detect anomalies and locate services but fall short of diagnosing root causes, TraFaultDia significantly reduces human effort in Root Cause Analysis (RCA). The framework's core strength lies in its two main components:
-
AttenAE(Multi-Head Attention Autoencoder): Unsupervisedly learns to fusehigh-dimensional, multi-modal trace-related data(from both spans and logs) intocompressedyeteffective trace representations. This ensures both efficiency and effectiveness in handling complex trace data. -
TEMAML(Transformer-Encoder based Model-Agnostic Meta-Learning): Leveragesmeta-learningto providefew-shot learningcapabilities, allowing rapid adaptation to new and rare fault categories with only a few labeled instances. Crucially, it also enables robustcross-system adaptability, meaningTraFaultDiacan effectively categorize faults in new, unseenMSSeven if trained on a different system.Evaluated on two benchmark
MSS, TrainTicket and OnlineBoutique, using open datasets,TraFaultDiademonstrated strong performance. It achieved high average accuracies (e.g., 93.26% for TrainTicket within-system and 92.19% for TrainTicket cross-system with 10 labeled instances), validating its effectiveness and rapid adaptability in bothwithin-systemandcross-systemcontexts. These findings underscoreTraFaultDia's potential to automate a crucial diagnostic step inAIOps, leading to faster resolutions, reduced downtime, and improved operational efficiency in complexMSS.
7.2. Limitations & Future Work
The authors acknowledge several limitations in their current study, primarily related to the datasets and the scope of evaluation, and suggest future research directions:
-
Dataset Limitations:
- Source and Applicability: The evaluation relies on open datasets (DeepTraLog and Nezha) from benchmark
MSS. The authors note that using datasets from real-world industrial systems would significantly enhance the applicability and relevance of their findings, better reflecting actual operational environments. Limited resources prevented access to such comprehensive real-world data. - Trace-Related Metrics: The current trace representations do not include performance metrics (e.g.,
CPU,memory usage,network traffic). The authors suggest that incorporating such metrics could improve the classification accuracy forperformance-related anomalies(e.g.,CPU contention,network delay), which were identified as areas of lower accuracy in their exploration results. While Nezha provides some of these metrics, other open datasets are lacking. - Limited Fault Type and MSS Diversity: The study used abnormal traces from 30 and 32 fault categories for TrainTicket and OnlineBoutique, respectively, and adopted a
5-waysetup. WhileMAMLis known to be effective for more complexmulti-class classificationtasks (e.g.,20-way,50-way), the limited diversity of available open datasets restricted the scope. The authors aim to expand evaluation to include a broader range of fault types andMSSfrom additional datasets, which they were unable to find for this study.
- Source and Applicability: The evaluation relies on open datasets (DeepTraLog and Nezha) from benchmark
-
Impact of Limitations: These external threats (dataset limitations) pose the possibility that
TraFaultDia's generalization capabilities might not be fully utilized or demonstrated. Addressing them could further enhance the precision and robustness of trace-levelRCAinMSS. -
Future Work:
- Improving Generalizability, Scalability, and Interpretability: Future efforts will focus on enhancing these aspects of
TraFaultDia. - Real-World Industrial MSS Evaluation: The authors are actively working on deploying and selecting real-world industrial
MSSto generate new datasets for further testing and validation ofTraFaultDia.
- Improving Generalizability, Scalability, and Interpretability: Future efforts will focus on enhancing these aspects of
7.3. Personal Insights & Critique
This paper presents a highly relevant and innovative solution to a pressing problem in AIOps for MSS. The shift from mere anomaly detection/localization to automatic fault categorization is a significant step towards truly autonomous IT operations.
Strengths:
- Addresses a Clear Gap: The paper effectively identifies and addresses a critical missing piece in the
AIOpspuzzle: automatedRCAforfault types. This has immense practical value for organizations managing complexMSS. - Intelligent Use of Meta-Learning:
Meta-learningis an excellent choice forMSSdue to their inherent heterogeneity and dynamic nature. The ability to adapt quickly to new systems and rare fault categories with few labels is a strong selling point for real-world adoption. - Multi-Modal Data Fusion: The
AttenAE's approach to fusing diverse span and log attributes is robust. Recognizing the importance of logs (often overlooked by sequence-based trace analysis) significantly enhances diagnostic power. The unsupervised nature ofAttenAEalso makes it practical, as unlabeled data is abundant. - Computational Efficiency: By avoiding computationally expensive
GNNconstructions and frequent updates,TraFaultDiaoffers a more scalable solution for dynamicMSS. - Rigorous Evaluation: The defined research questions (within-system and cross-system) and the comprehensive set of baselines (including
ablation studies) provide a thorough assessment of the framework's performance.
Potential Issues, Unverified Assumptions, and Areas for Improvement:
- Defining Fault Categories: While the paper states
TraFaultDia's use cases are scalable depending on "how fault categories are built," the process of defining and maintaining these categories in a rapidly evolvingMSScan still be a significant challenge. The paper assumes these categories are well-defined (e.g., tied to faulty components and root causes). In practice, this manual definition process formeta-trainingdata might require substantial initial human effort, even if the classification becomes automated later. Future work could explore semi-supervised or unsupervised approaches to discover fault categories. - Interpretability of
AttenAERepresentations: The paper focuses on the effectiveness of thecompressed trace representations. While theMulti-Head Attentionprovides some insight into feature relevance during fusion, the resultinglatent spacerepresentationsare not directly interpretable. ForRCA, especially for human practitioners, interpretability of why a trace falls into a category could be crucial. IncorporatingXAI (Explainable AI)techniques could be a valuable addition. MAMLFirst-Order Approximation: The paper uses afirst-order approximationin theMAMLouter loop for computational efficiency. While common, this can sometimes lead to slightly suboptimal performance compared to full second-orderMAML(which computesHessian-vector products). Exploring whether more advancedMAMLvariants or meta-optimizers could yield further gains, especially in complexcross-systemscenarios, might be beneficial if computational resources permit.- Impact of
BERTModel Choice: The neural representation relies onBERT. The choice ofBERTvariant (e.g., base, large, domain-specific) and its pre-training data could impact performance. Robustness to differentNLPmodels or even lightweight alternatives for resource-constrained environments could be explored. - Generalizability of "Sufficient Unlabeled Traces": The
AttenAErequires "sufficient unlabeled traces" for training. The definition of "sufficient" can vary widely. In real-worldMSS, trace data volume can be immense, but also sparse for certain less-frequented paths. The impact ofAttenAEtraining data volume and diversity on downstreamTEMAMLperformance could be investigated further. - Memory Footprint: While
trainingandtesting timeare evaluated for efficiency, thememory footprintof theAttenAEandTEMAMLmodels, especially when dealing with high-dimensionaltrace representationsand potentially largemeta-learningsupport/query sets, could be a practical concern for deployment in productionAIOpsenvironments.
Transferability to Other Domains:
The core methodology of TraFaultDia — combining unsupervised multi-modal data fusion with meta-learning for few-shot, cross-domain classification — is highly transferable.
-
Log Anomaly Categorization: Beyond traces, this framework could be adapted to categorize raw system
logsintofault typesacross different software systems or versions. -
IoT Device Fault Diagnosis: In diverse
IoTenvironments, devices generate various sensor data and event logs.TraFaultDiacould be used to categorize device malfunctions across different models or deployments with limited labeled fault data. -
Healthcare Diagnostics: Adapting to new diseases or patient cohorts with few labeled examples from
multi-modalpatient data (e.g., medical images, electronic health records, lab results). -
Manufacturing Quality Control: Classifying product defects based on
multi-modalinspection data (e.g., visual, vibrational, acoustic) across different product lines or manufacturing batches.Overall,
TraFaultDiarepresents a robust and forward-thinking approach that holds significant promise for enhancing the autonomy and efficiency ofAIOpsin complex, dynamic software systems.
Similar papers
Recommended via semantic vector search.