TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State
TL;DR Summary
TraStrainer adaptively samples distributed traces by integrating system runtime state and trace diversity, improving sampling quality and root cause analysis accuracy by over 30% while reducing overhead.
Abstract
TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State HAIYU HUANG, Sun Yat-sen University, China XIAOYU ZHANG, Huawei, China PENGFEI CHEN, ZILONG HE, ZHIMING CHEN, GUANGBA YU, and HONGYANG CHEN, Sun Yat-sen University, China CHEN SUN, Huawei, China Distributed tracing has been widely adopted in many microservice systems and plays an important role in monitoring and analyzing the system. However, trace data often come in large volumes, incurring substantial computational and storage costs. To reduce the quantity of traces, trace sampling has become a prominent topic of discussion, and several methods have been proposed in prior work. To attain higher-quality sampling outcomes, biased sampling has gained more attention compared to random sampling. Previous biased sampling methods primarily considered the importance of traces based on diversity, aiming to sample more edge-case traces and fewer common-case traces. However, we contend that relying solely on trace diversity for sampling is insufficient, system runtime state is another crucial factor that needs to be considered, especially in cases of system failures. In this study, we introduce Tra
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State
1.2. Authors
HAIYU HUANG, Sun Yat-sen University, China XIAOYU ZHANG, Huawei, China PENGFEI CHEN, ZILONG HE, ZHIMING CHEN, GUANGBA YU, and HONGYANG CHEN, Sun Yat-sen University, China CHEN SUN, Huawei, China
1.3. Journal/Conference
Proc. ACM Softw. Eng. 1, FSE, Article 22 (July 2024), 21 pages. The publication venue, FSE (ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering), is a highly reputable and influential conference in the field of software engineering. Its selective nature ensures that accepted papers represent significant contributions and high-quality research.
1.4. Publication Year
2024
1.5. Abstract
Distributed tracing is essential for monitoring microservice systems but generates large volumes of data, causing high computational and storage costs. Existing sampling methods often rely on trace diversity to prioritize sampling, which may overlook important system runtime conditions, especially during failures. We propose TraStrainer, an online sampling framework integrating both system runtime state and trace diversity. TraStrainer encodes traces into vectors and adaptively adjusts sampling preferences by analyzing real-time system metrics. It combines system-state bias and diversity bias through a dynamic voting mechanism to select high-value traces. Experimental evaluation shows TraStrainer significantly improves sampling quality and downstream root cause analysis, achieving a 32.63% average increase in Top-1 accuracy over four baselines on two datasets, demonstrating its effectiveness in producing actionable trace data while reducing overhead.
1.6. Original Source Link
/files/papers/6901a5fc84ecf5fffe471766/paper.pdf (This link indicates an officially published paper, likely hosted on a conference or publisher's server).
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the high computational and storage costs associated with distributed tracing in microservice systems. Distributed tracing, using frameworks like OpenTelemetry or SkyWalking, provides invaluable insights into system behavior, aids in anomaly detection, and assists in root cause analysis (RCA). However, the sheer volume of trace data generated by large-scale microservice systems (millions to billions daily) makes storing and analyzing all of it impractical and expensive.
This problem is important because effective monitoring and troubleshooting are critical for maintaining the reliability and performance of complex, dynamic microservice environments. While trace sampling techniques exist to reduce data volume by selectively capturing traces, existing methods face significant challenges:
-
Limited Scope of Existing Biased Sampling: Previous
biased samplingapproaches primarily focus ontrace diversity, aiming to captureedge-case(uncommon) traces while discardingcommon-caseones. This is based on the intuition thatedge-casetraces are more informative. -
Neglect of System Runtime State: A critical gap is that these methods often disregard the
system runtime state. The paper argues that the definition of a "valuable trace" can change dramatically depending on whether the system is operating normally or experiencing a failure. For instance,common-casetraces accessing a database might become highly valuable if the database server experiences anexceptionorhigh CPU utilization. -
Inadequacy for Downstream Analysis: Solely favoring
edge-casetraces is insufficient becausecommon-casetraces can also be related to root causes or are necessary forRCA algorithmsto learn normal patterns or performspectrum analysis. Whenanomaly ratesare high, existing methods might still miss the mostsignificant abnormal tracesdue to budget constraints, lacking a mechanism to prioritize among numerous anomalies.The paper's entry point and innovative idea is to propose
TraStrainer, an online biasedtrace samplerthat explicitly considers both system runtime state and trace diversity to adaptively determinesampling preferences. This allows for a more comprehensive and context-aware selection of valuable traces.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
A Novel Online Biased Trace Sampler: Proposes
TraStrainer, an onlinebiased trace samplerthat innovatively integratessystem runtime statewithtrace diversityto adaptively determinesampling preferences. This provides a more comprehensive approach than existing methods. -
Interpretable and Automated Trace Representation: Introduces an
interpretableandautomated methodfortrace encoding. The generatedtrace vectorcaptures bothstructuralandstatusinformation, with eachsystem metriccorresponding to a specific dimension, allowing for dynamic expansion and reduction. -
Impact on Downstream Root Cause Analysis: Systematically investigates the impact of different
sampling methodsondownstream root cause analysis (RCA) tasksby combiningTraStrainerwith severalstate-of-the-art trace-based RCA approaches. -
Empirical Validation and Superior Performance: Implements
TraStrainerand constructs two diversedatasets(one from 13 real-world productionmicroservice systems, another fromOnlineBoutiqueandTrainTicketbenchmarks) to validate itssampling quality,effectivenessin improvingdownstream analysis tasks, andefficiency.The key conclusions and findings are:
-
TraStrainersignificantly improvessampling quality, identifying more valuable traces within the same budget compared to four baseline methods. -
It substantially enhances the performance of
downstream root cause analysis (RCA)tasks, achieving an average increase of 32.63% in Top-1 RCA accuracy compared to four baselines across two datasets. -
The combination of
system runtime stateandtrace diversityis crucial, withTraStraineroutperforming variants that consider only one factor. Prioritizingproblem-related traces(via system state) is particularly impactful for analysis tasks. -
TraStrainerdemonstrates higherefficiencythan otheronline biased sampling methods, with low sampling latencies suitable for online deployment and effectivedimension reduction strategies. These findings directly address the problem of high data volume by providing a smarter sampling mechanism that retains more actionable data for analysis.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the TraStrainer paper, a novice reader should understand the following fundamental concepts:
-
Microservice Systems: A software architecture where complex applications are composed of small, independent processes communicating with each other. Each service is self-contained, can be developed, deployed, and scaled independently. This distributed nature makes monitoring challenging.
-
Distributed Tracing: A method used to monitor requests as they flow through various services in a
microservice system.- Trace: Represents the end-to-end path of a single request through multiple services.
- Span: A logical unit of work within a trace, representing an operation or a call to a service. Each
spanhas a unique ID, parentspanID (for hierarchical relationships), start/end timestamps, service name, operation name, and can includeevent annotations(key-value pairs with additional information like errors or business-specific data). - Trace Context: Information (like
TraceIDandSpanID) that is propagated across service boundaries to linkspanstogether into a completetrace. - OpenTelemetry/SkyWalking: Open-source frameworks that provide APIs, libraries, and agents to instrument, generate, collect, and export
telemetry data(traces, metrics, logs) from cloud-native applications. They are implementations ofdistributed tracing.
-
System Metrics: Time-series data reflecting the runtime state and performance of system components (e.g., CPU utilization, memory usage, disk I/O, network latency, request rates, error rates). These
metricsare crucial for identifyinganomaliesand understanding system health.- Time-series data: A sequence of data points indexed in time order.
- Anomaly Detection: The process of identifying data points, events, or observations that deviate significantly from the majority of the data, indicating unusual behavior.
-
Trace Sampling: The process of selectively capturing a subset of
tracesto reduce the volume of data stored and processed, while still aiming to retain enough information for monitoring and analysis.- Head-based Sampling: The decision to sample a
traceis made at the very beginning of the request (at the "head" of the trace). This is simple but cannot use information from laterspans(e.g., whether an error occurred). - Tail-based Sampling (Biased Sampling): The decision to sample a
traceis made after the entiretracehas completed (at the "tail" of the trace). This allows the sampling logic to use all available information within thetrace(e.g., latency, error status, specific operations) to determine its "value" or "importance," enablingbiased samplingtowardsvaluable traces. - Budget Sampling Rate (): The target percentage or proportion of
tracesthat should be sampled and retained.
- Head-based Sampling: The decision to sample a
-
Root Cause Analysis (RCA): The process of identifying the underlying reasons for a problem or an
anomalyin a system. Inmicroservice systems, this often involves pinpointing which service, host, or component is responsible for a detected issue based ontrace dataandmetrics. -
Machine Learning/Statistical Concepts:
- Vector Representation/Embedding: Transforming complex data (like a
trace) into a numerical vector in a high-dimensional space. This allows computational models to process and comparetracesmathematically. - Clustering: An
unsupervised machine learningtask that groups a set of objects in such a way that objects in the same group (called acluster) are more similar to each other than to those in other groups. - Jaccard Similarity: A statistic used for comparing the similarity and diversity of sample sets. It measures the size of the intersection divided by the size of the union of the sample sets.
- Time Series Forecasting (TSF): The use of models to predict future values based on previously observed values of a
time series. - Neural Network Models: A type of
machine learningmodel inspired by the structure of the human brain, capable of learning complex patterns from data.LSTM(Long Short-Term Memory) andTransformerare advancedneural networkarchitectures used forsequence data, whilelinear modelsare simpler forms.
- Vector Representation/Embedding: Transforming complex data (like a
3.2. Previous Works
The paper primarily focuses on trace sampling and its impact on root cause analysis (RCA).
Trace Sampling Approaches:
- Uniform Random Sampling: Used by early tracing systems like
Dapper[37],Jaeger[17], andZipkin[2]. Decision is made at the trace head with equal probability.- Limitation: Fails to guarantee the quality of sampled traces, often missing
valuableoranomalous traces.
- Limitation: Fails to guarantee the quality of sampled traces, often missing
- Biased Sampling (Tail-based): Aims to select more "important" traces after their completion. Most prior works define importance based on
trace diversityoruncommonness.- HC (Hierarchical Clustering) [20]: An
offline tail-based samplingapproach that useshierarchical clusteringto group traces bylabel counting(e.g., error status, latency bucket) to maximize diversity.- Differentiation from TraStrainer:
Offline(not suitable for real-time), relies onlabel counting(potentially limited features), focuses only ondiversity.
- Differentiation from TraStrainer:
- Sifter [21]: An
online tail-based samplingmethod that approximates thecommon-case behaviorof a distributed system andbiases samplingtowardsnew tracesthat arepoorly representedby this model (i.e.,uncommon).- Differentiation from TraStrainer:
Online, but focuses solely ondiversity/uncommonnessand doesn't considersystem runtime state.
- Differentiation from TraStrainer:
- Sieve [16]: An
online tail-based samplingapproach that usesRobust Random Cut Forest (RRCF)to detectuncommon tracesandsamplesthem with a high probability. It considers bothstructureandlatency.- Differentiation from TraStrainer:
Online, considers more features thanSifter, but still primarilydiversity-biasedand does not integratesystem runtime state.
- Differentiation from TraStrainer:
- TraceCRL [51]: Utilizes
contrastive learningandGraph Neural Network (GNN)techniques to encodetrace dataintovectors, enablingsamplingbased on thediversityof thesevector representations. - SampleHST [11]:
Biases samplingtowardsedge-casesbased on the distribution ofmass scoresobtained from aforest of Half Space Trees (HST). - Hindsight [52]: Introduces
retroactive sampling, a technique thatlazily retrieves trace datato performbiased samplingat earlier stages of thetrace lifecycle. This aims to improve theefficiencyofbiased samplingby not waiting for the full trace completion before making preliminary decisions, though the final decision might still betail-based.- Differentiation from TraStrainer: While
Hindsightimprovesefficiency, it still primarily focuses onedge-cases(diversity) as itssampling preference, similar to otherbiased samplingmethods.
- Differentiation from TraStrainer: While
- HC (Hierarchical Clustering) [20]: An
Trace-based Analysis Approaches (RCA):
- TraceAnomaly [26]: Leverages
deep learning(specificallyVariational Autoencoders-VAE) to learnnormal trace patterns offline, thendetects abnormal tracesandidentifies root causes online.- Differentiation from TraStrainer (as an RCA method):
TraceAnomalyis anRCA method, whileTraStraineris asampling methodthat feedsRCA methods.TraceAnomalyneedsnormal tracesto learn patterns, whichdiversity-biased samplersmight undersample.
- Differentiation from TraStrainer (as an RCA method):
- TraceRCA [23]: Utilizes
spectrum analysisto identifyroot cause servicesby analyzing theproportion of normal and abnormal invocations.- Differentiation from TraStrainer: An
RCA method. Its reliance onspectrum analysis(comparing normal vs. abnormal) means it benefits from a balanced view of bothcommonanduncommon traces.
- Differentiation from TraStrainer: An
- MicroRank [45]: Identifies and ranks
root causesby combining apersonalized PageRank methodandspectrum analysis.- Differentiation from TraStrainer: An
RCA method. Similar toTraceRCA, it relies on a nuanced view oftrace patterns.
- Differentiation from TraStrainer: An
- Sage [10]: Utilizes
causal Bayesian networksandgraphical variational autoencodersto pinpointroot cause microservices. - FSF [35]: Utilizes knowledge of
failure propagationand theclient-server communication modelto deduceroot causes.
3.3. Technological Evolution
The evolution of trace sampling has progressed from simple uniform random sampling to more sophisticated biased sampling techniques. Initially, the primary concern was just reducing data volume. As microservice systems grew in complexity and the analytical value of traces became clearer, researchers sought ways to intelligently select traces that were most "interesting" or "informative." This led to the development of tail-based sampling, which could leverage the full context of a trace.
Early biased sampling methods (like HC, Sifter, Sieve) focused heavily on trace diversity and uncommonness. The rationale was that edge cases and anomalies are often the most valuable for debugging and understanding system behavior. Techniques like clustering or random cut forests were employed to identify these rare patterns.
However, a limitation in this evolution was the implicit assumption that "valuable" always meant "diverse" or "uncommon." The TraStrainer paper marks a significant step by introducing the concept that the "value" of a trace is not static but dynamically influenced by the broader system runtime state. It integrates real-time system metrics as a crucial context, suggesting that common-case traces can become highly relevant during specific system failures if they interact with an affected component. This represents a shift from purely trace-centric importance to a holistic system-aware importance definition for sampling. The idea of retroactive sampling (Hindsight) also shows an ongoing effort to improve the efficiency of biased sampling without sacrificing quality.
3.4. Differentiation Analysis
Compared to the main methods in related work, TraStrainer's core differences and innovations are:
-
Multi-Modal Sampling Preference (System State + Trace Diversity):
- Previous methods (HC, Sifter, Sieve, TraceCRL, SampleHST): Primarily rely on
trace diversityoruncommonnessas the sole criteria forsampling preference. They analyze thestructureorlatencywithin thetraceitself to identifyedge-cases. - TraStrainer: Uniquely combines
system runtime state(derived fromreal-time system metrics) withtrace diversity. This means it can dynamically adjust itssampling preferences. For example, it can prioritizecommon-case tracesinteracting with adatabaseifdatabase metricsshow ananomaly.
- Previous methods (HC, Sifter, Sieve, TraceCRL, SampleHST): Primarily rely on
-
Adaptive and Dynamic Preference Adjustment:
- Previous methods:
Sampling preferencesforuncommon tracesare relatively static or based onhistorical trace patterns. They lack the ability to react to immediatesystem-wide issuesthat might make otherwisecommon tracessuddenly critical. - TraStrainer: Uses a
System Bias Extractortoadaptively determine sampling preferencesby analyzingreal-time fluctuationsinsystem metrics. This allows it to dynamically shift itsbiasbased on what the system is currently struggling with.
- Previous methods:
-
Interpretable and Automated Trace Encoding:
- Previous methods:
Trace encodingoften involves manual feature engineering or lessinterpretable vector representations. - TraStrainer: Proposes an
automated trace encoding methodguided bysystem metrics, ensuringinterpretabilityofvector dimensions. Each dimension corresponds to a specificsystem metric, making it easier to understand why atraceis consideredvaluable. It also includesdimension scalabilityandreduction strategies.
- Previous methods:
-
Dynamic Voting Mechanism for Combined Decision:
- Previous methods: Do not have a mechanism to combine different types of
sampling biasesunder asampling budget. - TraStrainer: Introduces a
Composite Samplerwith adynamic voting mechanism(AND/ORgate) that adjusts based on thecurrent sampling frequencyrelative to thebudget. This allows for a flexible balance between differentbiaseswhile adhering to budget constraints.
- Previous methods: Do not have a mechanism to combine different types of
-
Improved Downstream RCA Performance:
-
Previous methods: While aiming for
quality traces, theirdiversity-only biascan sometimes hinderRCA methodsthat requirenormal traces(e.g., forspectrum analysis) or a specific type ofproblem-related tracethat might not beuncommonin its structure. -
TraStrainer: By incorporating
system state, it ensures thattraces related to ongoing problemsare captured, leading to significantly better performance indownstream RCA tasksby providing moreactionable data.In essence,
TraStrainershifts the paradigm from merely identifyingdiverseoruncommon tracesto identifyingcontextually valuable tracesby actively incorporatingsystem healthinformation, thus providing a more intelligent and effectivesampling strategyfor complexmicroservice environments.
-
4. Methodology
4.1. Principles
The core idea behind TraStrainer is to implement biased trace sampling in a more comprehensive way than previous approaches. It addresses the limitation of existing methods that primarily focus on trace diversity by explicitly incorporating the system runtime state. The theoretical basis is that the "value" or "importance" of a trace is not static; it dynamically changes based on real-time system conditions and its structural and behavioral characteristics. Therefore, to capture higher-quality traces that are most useful for monitoring and root cause analysis, a sampling preference must consider both system anomalies and trace uniqueness.
TraStrainer is an adaptive online biased trace sampler. This means it operates continuously, making sampling decisions in real-time as traces arrive, and it can adjust its behavior based on observed system dynamics and trace patterns.
4.2. Core Methodology In-depth (Layer by Layer)
The overall architecture of TraStrainer is illustrated in Figure 4 of the original paper, comprising two main phases: runtime data preprocessing and comprehensive sampling.
The following figure (Figure 4 from the original paper) shows an overview of TraStrainer:
该图像是图6的示意图,展示了度量指标异常程度的评估流程。通过对历史数据窗口进行趋势和剩余量的线性建模,计算期望值,再结合当前值计算异常程度α,用于构成偏好向量。
4.2.1. Runtime Data Preprocessing Phase
This phase prepares the incoming trace and system metrics for sampling decisions.
4.2.1.1. Trace Encoder
The Trace Encoder module is responsible for converting raw distributed traces into a machine-readable vector representation. This is crucial because raw traces are complex, tree-like structures that cannot be directly processed by numerical sampling algorithms. A key design goal is interpretability and automation, meaning the encoding should be understandable and adapt to new metrics or trace structures without manual intervention.
The encoding process is divided into two parts: status-related and structure-related encoding.
The following figure (Figure 5 from the original paper) shows the process of encoding traces in two parts: status-related and structure-related:
该图像是图7的示意图,展示了动态投票机制的过程,该机制结合系统偏置和多样性偏置的采样结果,通过比较当前采样率 与预算采样率 ,利用 AND/OR 门决策最终采样结果。
-
Encoding the Status-Related Part of the Trace: When the
trace structureis disregarded, atracecan be viewed as a "bag of spans," denoted as , where eachspanrepresents a unit of work. Eachspancontains information likeduration,status code,associated node or pod, andevent annotations. Instead of relying on manual feature selection fromevent annotations,TraStrainerleveragessystem metricsto automatically determine relevant features.Definition 1 (Related Sub Span Bag ): For a
metricm(m.node, m.type)and atrace, let be thespan bagof . We use to denote therelated sub span bagof , where must satisfy thats_{mi}.node = m.nodeandm.typerelated to .- Explanation: This definition identifies all
spanswithin atracethat are relevant to a specificsystem metric. For example, if is "CPU usage on Node A," then would contain allspansfromtracethat executed onNode Aand whoseannotations(e.g.,SQL query) are related to theCPU usagemetric type.
Definition 2 (Feature Value ): For the
related sub span bagof themetric, we use to denote theabnormal spanin , where must satisfy that isabnormal. Thefeature valueof is calculated as follows:- Explanation: quantifies the "impact" of a
traceon a specificmetric.- : The number of
abnormal spanswithin therelated sub span bag. This captures the severity of errors. Adding ensures that even if there are noabnormal spans, this factor is at least 1, preventing the total duration from being nullified. - : The sum of
durationsof allspansin . This represents the total time spent by thetraceinteracting with the component relevant tometric.
- : The number of
- The product of these two terms () and the total
durationgives a combined measure of thetrace'ssignificance with respect tometric. This value becomes a dimension in thetrace's vector representation.
- Explanation: This definition identifies all
-
Encoding the Structure-Related Part of the Trace: The
trace structure, aninvocation tree, shows theorderandhierarchyofservice operations. Sinceasynchronous callscan change the order ofspansat the same level, the focus is oninvocation hierarchy.- Each layer of the
invocation treeis encoded as a feature. - Each
spanin a layer is represented by itsparent span,method name, andargument(pma). - If a
tracedoesn't have aspanat a certaindepth, that position is filled withnull.
- Each layer of the
-
Dimension Scalability and Reduction:
- Scalability:
Trace Encoderautomatically expandsvector dimensionswhen newmetricsare added or newtrace structuresoccur, by identifyingrelated sub span bagsand performing statistics. - Reduction: To prevent
dimension explosion:- For
status-related dimensions: If twometricsconsistently have the same values for their dimensions over a period, those dimensions aremerged. - For
structure-related dimensions: Ifdimensionscorresponding to deepercall hierarchiesno longer occur in recenttraces, thosedimensionsareremoved.
- For
- Scalability:
4.2.1.2. System Bias Extractor
The System Bias Extractor module continuously monitors system metrics to identify which ones are currently exhibiting anomalous behavior and thus deserve more attention. It adaptively determines the preference weight for each metric based on its anomaly degree.
-
Assess the Anomaly Degree of Metrics: Each
system metricis atime-series data. The goal is to calculate theanomaly degreeof eachmetriconline at the current time . This is done by comparing the current actual value with anexpected valuederived from historical data within alook-back window.- The paper opts for a
neural network modelapproach overstatistical methods(likeboxplotor3-sigma detection) becauseneural networkscan better learnhistorical waveform patternsintime-series data, which is important forbusiness operationswithperiodic patterns. TraStraineruses a modifiedDLinear algorithm[49], alinear forecasting modelinspired by recent advancements intime series forecasting. This model takes ahistorical time series data windowas input and outputs theexpected valuefor the current time point.- The
anomaly degreeis then calculated as the normalized difference between theactual valueand theexpected value:- Explanation: This formula calculates the
relative deviationof theactual metric valuefrom itsexpected value.-
: The
expected valueof themetricat time , predicted by theDLinear model. -
: The
actual observed valueof themetricat time . -
: The
absolute differencebetween the actual and expected values, indicating the magnitude of theanomaly. -
: Normalizes the difference by the larger of the two values, providing a
relative anomaly degreebetween 0 and 1. A higher indicates a greateranomaly.The following figure (Figure 6 from the original paper) shows an example of assessing the anomaly degree of metrics:
该图像是图表,展示了图8中两组数据集中不同采样预算下多种采样方法对罕见、相关及罕见相关追踪的采样比例,TraStrainer方法在各类指标中表现出显著优势。
-
- Explanation: This formula calculates the
- The paper opts for a
-
Form the Preference Vector: After obtaining the current
anomaly degreefor eachmetric, this value is directly considered as thepreference scorefor thatmetric. Allpreference scoresfor allmetricsat the current moment form thepreference vector:- Explanation: This vector quantifies which
system metricsare currentlyanomalousand, by extension, whichtrace dimensions(which correspond to these metrics) should be given highersampling preference.
- Explanation: This vector quantifies which
4.2.2. Online Comprehensive Sampling Phase
This phase utilizes the trace vector and preference vector to make actual sampling decisions.
4.2.2.1. System-Biased Sampler
The System-Biased Sampler aims to prioritize traces that are more relevant to the detected system fluctuations (i.e., anomalous metrics). It calculates a sampling probability based on how much a coming trace interacts with the currently anomalous system components.
-
Calculate the Attention Score Vector: The sampler maintains a
dynamic look-back windowof the most recenttrace vectors. Only thestatus-related partof eachtrace vector, denoted as , is considered here. For each dimension (corresponding to asystem metric), themeanandstandard deviationof thefeature values(for ) from theprevious k trace vectorsin that dimension are calculated. When a newtracearrives with itsfeature valuefor dimension , itsattention scorefor that dimension is calculated as:- Explanation: This formula measures how much the
coming tracedeviates from thehistorical averageinresource utilizationorimpactfordimension.- : The
feature valueof thecoming traceformetric. - : The
meanoffeature valuesformetricfromtracesin thelook-back window. - : The
standard deviationoffeature valuesformetricfromtracesin thelook-back window.
- : The
- A higher means the
coming tracehas a more unusualimpactorresource utilizationformetricicompared to recent history. Theseattention scoresfor all dimensions form the `attention score vector``\mathcal{A} = [a_1, ..., a_n]$.
- Explanation: This formula measures how much the
-
Calculate the System-Biased Sampling Probability: A
coming traceis considered morevaluableif it hashigher attention scoresindimensionsthat also havehigher preference scores(from theSystem Bias Extractor). Thesystem-biased sampling probabilityfor thecoming traceis calculated by taking thedot productof itsattention score vectorand the currentpreference vector, and then applying atanh functionto map the result to the range[0, 1]:- Explanation:
- : The
dot productbetween thepreference vectorand theattention score vector. This effectively sums up the product ofmetric anomaly degreesand thetrace's unusual impacton thosemetrics. A higher dot product means thetraceis unusually active or impactful in currentlyanomalous system areas. - The
tanhfunction (transformed to map to[0,1]instead of[-1,1]) squashes this potentially large dot product into aprobabilityrange. - Specifically, the standard
tanh(x)function ranges from -1 to 1. The formula is a common way to transform a value (here, the dot product) into the range[0, 1]forprobability. If is large and positive, approaches 0, and the expression approaches . If is large and negative, approaches infinity, and the expression approaches . Wait, the paper states the range is[0,1]. Let's re-evaluate the transformation from . The standard can be written as . A common sigmoid function maps to . The given formula is actually , which maps from(-1,1). However, it can be mapped to[0,1]if the dot product is always non-negative or the intention is to use a different sigmoid-like transformation. Given the context ofprobability, it should map to[0,1]. Let's assume the authors mean a transformation that results in[0,1]or that the dot product is inherently non-negative.- Let's check the function .
- As , , .
- As , , .
- At , .
So the range is indeed . This is a common practice for activation functions in neural networks, but for probability, it's typically
[0,1]. Perhaps the term here is an "importance score" that is then normalized or interpreted as a probability. The paper explicitly states it yieldssystem-biased sampling probabilityp_s(t_{k+1})
- Let's check the function .
- : The
- Explanation:
and maps to `[0,1] range`. There might be a slight discrepancy or an implicit assumption/further processing not detailed. For strict adherence, I will present the formula as is and note the stated range. It's possible the dot product itself is constrained or the output is later clamped. For academic rigor, I *must* present the formula exactly as provided.
#### 4.2.2.2. Diversity-Biased Sampler
The `Diversity-Biased Sampler` aims to identify `edge-case traces` (uncommon or unique) and assign them a `higher sampling probability`. This is achieved by comparing a `coming trace` to `clusters` of previously observed `traces`.
* **Cluster Traces within the Look-Back Window:**
To understand the `uncommon degree` of a `coming trace`, `patterns` from `previous traces` are established. `Trace vectors` within the `look-back window` are `clustered`.
* Assuming `clusters` are obtained, the `mass` of each `cluster` (`number of traces` included) is calculated, denoted as .
* **Calculate the Diversity-Biased Sampling Probability:**
For the `coming trace` , its `Jaccard similarity` [29] with each `cluster` is calculated. The `cluster` with the highest `similarity` is identified as the `closest cluster`, . Its `mass` is .
* A smaller (meaning the `closest cluster` is less common) and a smaller `Jaccard similarity` (meaning the `trace` is less similar to even its `closest cluster`) indicate a higher `uncommonness` for .
* The `diversity-biased sampling probability` is calculated as:
p_d(t_{k+1}) = \frac{\frac{1}{ma_{k+1}' * si(t_{k+1})}}{\sum_{i=1}^{k+1} \frac{1}{ma_i' * si(t_i)}}
* **Explanation:**
* : The `mass` (number of traces) of the `closest cluster` to .
* : The `Jaccard similarity` between and its `closest cluster` .
* The term represents the `uncommonness score` of the `coming trace`. A smaller `mass` or `similarity` leads to a larger `uncommonness score`.
* The denominator is a `normalization factor` that sums the `uncommonness scores` of all `traces` up to (or perhaps all `traces` in the `look-back window` plus the `current trace`). This ensures that is a `probability` (between 0 and 1) relative to other `traces`.
#### 4.2.2.3. Composite Sampler
The `Composite Sampler` is responsible for making the final `sampling decision` by combining the `system-biased sampling probability` and the `diversity-biased sampling probability` , while also considering the `sampling budget`.
* **Decision Generation:**
For a `coming trace` , after `System-Biased Sampler` and `Diversity-Biased Sampler` provide their respective probabilities and , `TraStrainer` generates a `random number` between `[0, 1]`.
* If is greater than the `random number`, the `system-biased sample result` is "True"; otherwise, "False".
* Similarly, if is greater than the `random number`, the `diversity-biased sample result` is "True"; otherwise, "False".
* **Dynamic Voting Mechanism:**
To ensure the overall `sampling rate` aligns with the `expected budget`, a `dynamic voting mechanism` is employed. It tracks the `recent sampling frequency` within a `look-back window` and compares it to the `budget sampling rate` .
The following figure (Figure 7 from the original paper) shows the process of the dynamic voting mechanism that takes the budget into account to combine the sampling results from the two previous samplers:

*该图像是图表,展示了论文中不同方法在两个数据集上对多个API采样结果的分布情况,其中API-4和API-8为异常API,TraStrainer在异常API采样比例上显著高于其他方法,表格部分展示了不同采样率下各方法的性能指标。*
* **If (sampling too much):** `Stricter sampling rules` are enforced. An `AND gate` is used as the `voting mechanism`. This means `TraStrainer` will only `sample the trace` if **both** `System-Biased Sampler` and `Diversity-Biased Sampler` yield a "True" result.
* **If (sampling too little):** `Sampling rules` are relaxed. An `OR gate` is used as the `voting mechanism`. This means `TraStrainer` will `sample the trace` if **at least one** of the `samplers` produces a "True" result.
* The final `sampling decision` determines whether the `coming trace` is `stored` or `discarded`. This dynamic adjustment helps `TraStrainer` stay within the `sampling budget` while maximizing the capture of `valuable traces` according to the current `system state` and `diversity`.
# 5. Experimental Setup
## 5.1. Datasets
The authors constructed two datasets,
\mathcal{A}
and
\mathcal{B}
, to evaluate `TraStrainer`.
* **Dataset Setup:**
* **Source:** Real-world data from 13 production `microservice systems` of Huawei.
* **Scale:** Involves 284 services and 1327 nodes.
* **Characteristics:** Collected `trace data` and `metric data` from 62 incidents between April 2023 and August 2023, resulting in 62 batches. The incidents covered various `fault types` such as `high CPU load`, `network delay`, `slow SQL execution`, `failed third-party package calls`, `code logic anomalies`, and `thread pool exhausted`.
* **Volume:** Comprises a total of 909,797 traces and 121 metrics.
* **Labeling:** `SREs` (Site Reliability Engineers) and technical experts annotated `uncommon` and `problem-related traces`, with an average label rate of 2.5%. They also annotated the actual `root causes` for each batch to evaluate `downstream root cause analysis accuracy`.
* **Effectiveness for Validation:** This dataset provides high realism due to being from production systems and expert annotations, making it very effective for validating the method's ability to identify `problem-related traces` and improve `RCA`.
* **Dataset Setup:**
* **Source:** Derived from two widely-used `open-source microservice benchmarks`: `OnlineBoutique` [12] and `TrainTicket` [9].
* **Deployment:** Deployed on a `Kubernetes` [1] platform with 12 virtual machines (each 8-core 2.10GHz CPU, 16GB memory, Ubuntu 18.04 OS).
* **Trace Collection:** `Opentelemetry Collector` [31] was used to collect `traces`, stored in `Grafana Tempo` [13].
* **Fault Injection:** To simulate `latency` or `reliability issues`, 56 `faults` were injected using `Chaosblade` [4]. These `fault types` included `CPU contention`, `CPU consumed`, `network delay`, `code exception`, and `error return`.
* **Volume:** Collected `trace data` and `metric data` for 56 batches, totaling 112,000 traces and 32 metrics.
* **Labeling:** `Uncommon traces` and `problem-related traces` were annotated based on the injected `fault positions` and `types`, with an average label rate of 5.0%.
* **Effectiveness for Validation:** This dataset provides a controlled environment for validation due to injected faults, allowing clear ground truth for `problem-related traces` and `root causes`, complementing the real-world complexity of Dataset
\mathcal{A}
.
The following are the results from Table 1 of the original paper:
Dataset
Microservice Benchmark
Trace Number
Metric Number
Batch Number
Uncommon Label Rate
Problem-related Label Rate
Fault Types Number
A
13 Production Microservice systems
909,797
121
62
2.5%
2.5%
6
B
OnlineBoutique and TrainTicket
112,000
32
56
5.0%
5.0%
5
## 5.2. Evaluation Metrics
The evaluation metrics used assess both the `quality of sampled traces` and the `effectiveness of downstream root cause analysis`.
* **Metrics for Sampling Quality:**
1. **Proportion (PRO):**
* **Conceptual Definition:** This metric reflects the ability of a `sampling approach` to `bias` towards specific types of `valuable traces` (e.g., `uncommon`, `problem-related`). It quantifies what percentage of the total `labeled valuable traces` were successfully captured by the `sampler`. A higher `PRO` indicates better targeting of `valuable traces`.
* **Mathematical Formula:**
* **Symbol Explanation:**
* : The total number of `labeled traces` of a specific type (e.g., `uncommon`, `problem-related`) in the original, unsampled dataset.
* : The number of `labeled traces` of that same specific type that were successfully `sampled` and retained.
* **Variants:** The paper uses three variants: `proportion of uncommon traces`, `proportion of problem-related traces`, and `proportion of traces that are both uncommon and problem-related`.
2. **Diversity (DIV):**
* **Conceptual Definition:** This metric reflects how `representative` the `sampled traces` are in terms of covering different `execution paths` or `patterns`. It indicates the variety of `trace patterns` captured, which is important for understanding the full scope of system behavior. A higher `DIV` suggests a broader representation of `trace types`.
* **Mathematical Formula:**
* **Symbol Explanation:**
* : The number of distinct `trace patterns` identified by clustering the `sampled traces` based on their `execution path`.
* **Metrics for Downstream Root Cause Analysis (RCA) Effectiveness:**
These metrics are standard for evaluating `RCA` performance.
1. **Top-k Accuracy (`A@k`):**
* **Conceptual Definition:** This metric measures the probability that the `true root cause` for a given issue is present within the top ranked `root cause candidates` identified by the `RCA method`. It assesses how accurately and highly the `RCA method` ranks the correct `root cause`. Higher values indicate better `accuracy`.
* **Mathematical Formula:**
* **Symbol Explanation:**
* : The total set of `issues` or `incidents` being analyzed.
* : The total number of `issues`.
* : The `true root cause` of the -th `issue`.
* : The list of the top `root cause candidates` returned by the `RCA method` for the -th `issue`.
* : An `indicator function` that equals 1 if the `true root cause` is found within the list, and 0 otherwise.
2. **Mean Reciprocal Rank (MRR):**
* **Conceptual Definition:** This metric evaluates the `ranking quality` of the `RCA method` by considering the inverse of the rank of the first `identified true root cause`. It gives higher scores if the `true root cause` is ranked higher (closer to the top). If the `true root cause` is not found in the result list, its `reciprocal rank` is considered zero. Higher values indicate better `accuracy` and `ranking`.
* **Mathematical Formula:**
* **Symbol Explanation:**
* : The total set of `issues` or `incidents` being analyzed.
* : The total number of `issues`.
* : The `rank` of the `true root cause` in the list of `root cause candidates` returned for the -th `issue`. If the `true root cause` is not found, is considered `infinity`, and becomes 0.
## 5.3. Baselines
The paper compares `TraStrainer` against several `baseline sampling approaches` and evaluates its impact on various `downstream root cause analysis approaches`.
### 5.3.1. Baseline Sampling Approaches
These methods represent different strategies for `trace sampling`, from simple random to `diversity-biased`.
* **Random:** This is a `head-based sampling approach` where each `trace` is captured with an equal, fixed probability at its inception.
* **Representativeness:** It's the simplest and most fundamental `sampling` strategy, serving as a lower bound for `sampling quality`.
* **HC [20]:** A `tail-based sampling approach` that uses `hierarchical clustering` to group `traces` based on `label counting` (e.g., error status, service invocation counts) and then `samples` to maximize `diversity`.
* **Representativeness:** An `offline method` that was an early attempt at `biased sampling` towards `diversity`.
* **Sifter [21]:** An `online tail-based sampling approach`. It learns an approximation of the `distributed system's common-case behavior` and then `samples new traces` based on how `poorly represented` they are by this `common-case model` (i.e., `uncommon traces`).
* **Representativeness:** A `state-of-the-art online diversity-biased sampler` that focuses on `uncommonness`.
* **Sieve [16]:** Another `online tail-based sampling approach`. It employs `Robust Random Cut Forest (RRCF)` to `detect uncommon traces` and `samples` them with a `high probability`. It considers both `trace structure` and `latency`.
* **Representativeness:** A `state-of-the-art online diversity-biased sampler` that is more sophisticated than `Sifter` in detecting `uncommonness`.
### 5.3.2. Baseline Downstream Analysis Approaches
These are `state-of-the-art trace-based RCA methods` used to assess the practical value of the `sampled traces`.
* **TraceAnomaly [26]:** This method leverages `deep learning`, specifically `Variational Autoencoders (VAE)`, to learn `normal trace patterns` from `trace data`. It then `detects abnormal traces` and `identifies root causes` by observing deviations from these learned `normal patterns`.
* **Representativeness:** A representative `RCA method` that relies on `deep learning` for `anomaly detection` and `root cause localization`, sensitive to the availability of `normal traces` for training.
* **TraceRCA [23]:** This approach utilizes `spectrum analysis` to pinpoint `root cause services`. It does so by examining the `proportion of normal and abnormal invocations` within `traces` flowing through different services.
* **Representativeness:** A representative `RCA method` that uses `statistical analysis` (`spectrum analysis`) and benefits from a balanced collection of both `normal` and `abnormal traces`.
* **MicroRank [45]:** This method identifies and ranks `root causes` by combining a `personalized PageRank algorithm` with `spectrum analysis`. It propagates `anomaly scores` through the `service dependency graph`.
* **Representativeness:** A representative `RCA method` combining `graph-based analysis` with `spectrum analysis`, also sensitive to the quality and balance of `sampled traces`.
### 5.3.3. Variants of TraStrainer (Ablation Studies)
To evaluate the individual contributions of its components, `TraStrainer` is also compared against its own variants:
* **TraStrainer w/o M (without System Runtime State):** This variant only considers `trace diversity` to set `sampling preferences`. It effectively functions by using only the `sampling results` from the `Diversity-Biased Sampler` as the final outcome.
* **Purpose:** To isolate the contribution of `system runtime state` to `TraStrainer's` performance.
* **TraStrainer w/o (without Trace Diversity):** This variant only considers the `system runtime state` to set `sampling preferences`. It achieves this by using only the `sampling results` from the `System-Biased Sampler` as the final outcome.
* **Purpose:** To isolate the contribution of `trace diversity` to `TraStrainer's` performance.
## 5.4. Warm-up Period
The paper mentions a common practice for `online samplers`: they make `essentially random sampling decisions` in the initial phase due to `random initialization` of their `internal models`. To ensure fair comparison and stable results, all `online samplers` (including `TraStrainer`, `Sifter`, `Sieve`, and their variants) undergo a `warm-up period` using 10% of the `traces` before actual testing and evaluation commence. This allows their models to stabilize and learn initial patterns.
# 6. Results & Analysis
## 6.1. Core Results Analysis
The experimental evaluation aimed to answer four research questions (RQs) regarding `TraStrainer's` sampling quality, its effectiveness in `downstream root cause analysis`, the contribution of combining `system state` and `trace diversity`, and its `efficiency`.
### 6.1.1. RQ1: Sampling Quality of TraStrainer
This RQ evaluates `TraStrainer's` ability to `bias sampling` towards `problem-related` and `edge-case traces` (`bias sampling`) and its capacity to prioritize `underrepresented` and `abnormal request types` (`representative sampling`). Experiments were conducted on both Dataset
\mathcal{A}
and Dataset
\mathcal{B}
across various `budget sampling rates` (0.1%, 1%, 2.5%, 5%, 10%).
The following figure (Figure 8 from the original paper) shows the proportions of uncommon, related, and uncommon-related traces sampled by different approaches across different budget settings in two datasets:

*该图像是一个柱状图,展示了图10中不同算法变体采样结果中各API分布比例。图中对比了Random、TraStrainer不同去除策略和完整版本对五个API的采样占比,体现了各方法对有问题API-4的重视程度差异。*
#### 6.1.1.1. Bias Sampling
* **Bias towards Uncommon Traces:**
* Figure 8 (a1) and (b1) show that `TraStrainer` and `Sieve` exhibit similar, strong performance in sampling `uncommon traces`, significantly outperforming `Random`, `HC`, and `Sifter`.
* `Random sampling` achieves `PRO` roughly equivalent to the `budget`.
* `HC` and `Sifter` perform poorly (`PRO` below 0.7) because they primarily consider `trace structure` and lack the ability to identify `uncommon traces` that differ mainly in `execution time` or `latency`.
* At low budgets, `TraStrainer` slightly outperforms `Sieve` due to its `metric-based encoding` effectively distinguishing anomalies.
* When the budget equals the `label rate` (2.5% in
\mathcal{A}
, 5% in
\mathcal{B}
), `TraStrainer's PRO` might be slightly lower than `Sieve's`. This is attributed to `TraStrainer's` dual bias (towards both `uncommon` and `problem-related traces`), where `problem-related traces` can sometimes include `common-case` ones.
* When the budget exceeds twice the `label rate`, both `TraStrainer` and `Sieve` achieve a `PRO` of 1, meaning all `uncommon traces` are captured.
* **Bias towards Problem-related Traces:**
* Figure 8 (a2) and (b2) demonstrate `TraStrainer's` significant superiority in sampling `problem-related traces` across both datasets.
* `Sieve`, `Sifter`, and `HC` show `PRO` growth similar to `Random` when the budget is high, with `PRO` remaining below 0.5. This highlights their inability to prioritize `problem-related traces` effectively, as they do not consider `system runtime state`.
* `TraStrainer`, by incorporating `system runtime state`, achieves a `PRO` of around 0.6 when the budget equals the `label rate` and approaches 1.0 when the budget is twice the `label rate`, indicating it captures almost all `problem-related traces`.
* **Bias towards Uncommon Problem-related Traces:**
* Figure 8 (a3) and (b3) show that `TraStrainer` significantly outperforms baselines, even at lower budgets.
* When the `budget rate` equals the `label rate` (1% in
\mathcal{A}
, 2.5% in
\mathcal{B}
), `TraStrainer` achieves a `PRO` above 0.9, whereas baselines remain below 0.55. This indicates `TraStrainer's` strong preference for the most critical traces – those that are both `uncommon` and `problem-related`.
#### 6.1.1.2. Representative Sampling
* **Diversity of the Sampling Result:**
The paper mentions a Table 2 showing `diversity` but does not include the table in the provided text. It states that `TraStrainer w/o M` (which focuses purely on diversity) achieves the best performance in `DIV` for both datasets, validating `TraStrainer's metric-based trace encoding` for distinguishing `trace patterns` and ensuring `fair sampling`. `Sieve` follows closely. `TraStrainer's` `DIV` is slightly lower (less than 5% difference from `TraStrainer w/o M`) because it also prioritizes `problem-related traces`, which might reduce pure `diversity`. `HC` and `Sifter` have noticeably lower `DIV` due to neglecting `latency` in encoding. `TraStrainer w/o D` (which does not consider diversity) shows no clear pattern in `DIV`.
* **Representative Ability (API Distribution):**
The paper analyzed the distribution of different `APIs` in the `sampling results` for one batch from each dataset.
* In Dataset
\mathcal{A}
, 8 `high-level API calls` were present, with API-4 and API-8 identified as `exceptional interfaces`.
* In Dataset
\mathcal{B}
, 5 `high-level API calls` were present, with API-4 considered an `exceptional interface`.
The following figure (Figure 9 from the original paper) shows the distribution of different APls in the sampling results:

*该图像是图11,展示了TraStrainer框架在不同采样方法下的效率验证。图(a)对比了TraStrainer与Sifter和Sieve在采样延迟上的累积分布函数(CDF)表现;图(b)和(c)分别展示了采样延迟对Trace跨度数和指标数的敏感性分析。*
* Figure 9 shows that `TraStrainer` enhances the `representation` of `less frequently used APIs`, leading to a more `balanced sampling`. Critically, it also improves the `representation` of `APIs` that are `related to system issues` and are `deteriorating`, thanks to its consideration of `system state`.
* `HC` and `Sifter` only slightly increase `low-frequency API representation`.
* `Sieve` achieves `balanced sampling results` but fails to enhance the `representation` of `deteriorating APIs`. This implies `TraStrainer's` ability to capture `contextually important traces` beyond mere `statistical uncommonness`.
### 6.1.2. RQ2: Effectiveness of TraStrainer for Downstream Root Cause Analysis
This RQ evaluates how `sampling strategies` impact `RCA effectiveness`. Sampled traces from various methods were fed as input to three `state-of-the-art RCA methods` (`TraceAnomaly`, `TraceRCA`, `MicroRank`). Experiments were conducted at `budget sampling rates` of 0.1%, 1.0%, and 2.5%.
The following are the results from Table 3 of the original paper:
| RCA Approach | Sampling Approach | A@1 | A@1 | A@1 | A@3 | A@3 | A@3 | MRR | MRR | MRR
|---|---|---|---|---|---|---|---|---|---|---
| | | 0.1% | 1.0% | 2.5% | 0.1% | 1.0% | 2.5% | 0.1% | 1.0% | 2.5%
| TraceAnomaly | Random | 10.71 | 9.26 | 16.67 | 44.73 | 57.41 | 57.41 | 0.2820 | 0.3503 | 0.3997
| | HC | 9.26 | 12.96 | 18.52 | 37.04 | 42.59 | 55.56 | 0.2590 | 0.3664 | 0.3747
| | Sifter | 11.67 | 24.07 | 16.67 | 37.04 | 57.41 | 62.96 | 0.2753 | 0.4025 | 0.4145
| | Sieve | 8.81 | 18.52 | 29.63 | 44.44 | 53.70 | 57.41 | 0.2620 | 0.3762 | 0.4383
| | TraStrainer w/o M | 11.67 | 20.37 | 22.22 | 45.15 | 51.85 | 55.56 | 0.2903 | 0.3722 | 0.4068
| | TraStrainer w/o D | 12.96 | 38.89 | 44.44 | 49.81 | 77.78 | 75.93 | 0.3485 | 0.5948 | 0.6247
| | **TraStrainer** | **46.30** | **51.61** | **54.84** | **66.67** | **79.19** | **87.10** | **0.5707** | **0.6438** | **0.7151**
| TraceRCA | Random | 7.41 | 20.37 | 29.63 | 40.74 | 61.11 | 68.52 | 0.2525 | 0.4123 | 0.4991
| | HC | 9.26 | 24.07 | 24.07 | 46.30 | 62.96 | 62.96 | 0.2546 | 0.4324 | 0.4627
| | Sifter | 8.67 | 19.63 | 25.19 | 37.04 | 55.56 | 61.11 | 0.2449 | 0.4272 | 0.4836
| | Sieve | 18.52 | 31.48 | 38.89 | 42.59 | 51.85 | 57.41 | 0.3008 | 0.4157 | 0.4873
| | TraStrainer w/o M | 18.52 | 33.33 | 35.19 | 44.44 | 55.56 | 55.56 | 0.3191 | 0.4432 | 0.4642
| | TraStrainer w/o D | 24.07 | 55.56 | 55.56 | 38.89 | 81.48 | 77.78 | 0.3650 | 0.6880 | 0.6843
| | **TraStrainer** | **55.56** | **55.56** | **58.06** | **70.37** | **85.19** | **89.63** | **0.6265** | **0.7019** | **0.7510**
| MicroRank | Random | 5.56 | 16.67 | 27.78 | 20.37 | 50.00 | 61.11 | 0.1571 | 0.3423 | 0.4352
| | HC | 7.41 | 18.52 | 22.22 | 27.78 | 46.30 | 51.85 | 0.1954 | 0.3398 | 0.3731
| | Sifter | 5.56 | 18.52 | 27.78 | 23.42 | 46.30 | 61.11 | 0.1605 | 0.3414 | 0.4358
| | Sieve | 9.26 | 25.83 | 35.19 | 20.37 | 58.15 | 62.96 | 0.1657 | 0.4246 | 0.4963
| | TraStrainer w/o M | 12.96 | 16.67 | 24.07 | 42.59 | 42.59 | 55.56 | 0.2994 | 0.3241 | 0.4012
| | TraStrainer w/o D | 29.63 | 42.59 | 46.30 | 74.04 | 68.52 | 72.22 | 0.5228 | 0.5463 | 0.5509
| | **TraStrainer** | **42.59** | **45.16** | **50.00** | **77.74** | **78.52** | **82.26** | **0.5509** | **0.5889** | **0.6556**
* **Overall Performance:** `TraStrainer` consistently achieves the best results across all `RCA methods` and `metrics` (`A@1, A@3`, and `MRR`), particularly at lower `sampling budgets`. This highlights its ability to provide higher `quality sampled data` for `downstream analysis`.
* **Impact on `TraceAnomaly`:**
* `TraStrainer` significantly improved `A@1`, `A@3`, and `MRR` by 33.78%, 26.88%, and 31.45% respectively (over the best baseline).
* `TraceAnomaly`, being `VAE-based`, needs `normal traces` to learn patterns. Baselines (`HC`, `Sifter`, `Sieve`) performed poorly at low budgets, sometimes worse than `Random`, because their `diversity-only bias` led to predominantly `abnormal traces`, making it hard for `VAE` to learn `normal patterns`.
* `TraStrainer` performs well even at low budgets because it samples both `abnormal` and `problem-related common traces`, providing a more balanced view that aids `VAE` in learning. At 2.5% budget, baselines' `A@1` was below 30%, while `TraStrainer` reached 54.84% `A@1` and 0.7151 `MRR`.
* **Impact on `TraceRCA` and `MicroRank`:**
* These methods use `spectrum analysis` (comparing `normal` vs. `abnormal invocations`). `TraStrainer` improved `A@1`, `A@3`, and `MRR` by 31.48%, 38.34%, and 28.96% respectively (over the best baseline).
* `HC` and `Sifter` performed similarly to `Random` because they only consider `trace structure` during `encoding`, providing no assistance to `latency-based RCA`.
* `Sieve` performed poorly at lower budgets because `spectrum analysis` needs both `normal` and `anomalous traces`, and `Sieve's` strong `uncommonness bias` might not provide enough `normal traces`.
* `TraStrainer` consistently achieved good results across budgets, maintaining `A@3` above 70%. This is because its `comprehensive sampling preferences` ensure that even at low budgets, it discards `traces unrelated to the problem` rather than `valuable traces` for `spectrum analysis`, enabling better `RCA performance`.
### 6.1.3. RQ3: Contribution of the Combination of System Status and Trace Diversity
This RQ uses `ablation experiments` with `TraStrainer w/o M` (diversity only) and `TraStrainer w/o D` (system state only) to analyze the impact of combining `system state` and `trace diversity`.
* **API Distribution (Representative Ability):**
The following figure (Figure 10 from the original paper) shows the distribution of different APls in the sampling results of different variants:

*该图像是论文中图2的示意图,展示了分布式追踪数据和系统指标的结构。左侧(a)为分布式追踪数据示意,包含TraceID、SpanID、被调用服务和注释,右侧(b)展示系统指标随时间变化的示意图,反映节点CPU使用率和延迟等指标。*
* Figure 10 (which is Fig. 9 from the paper, but in the context of ablation study, it's about variant comparison) illustrates the distribution of different `APIs` in `sampled results` for Dataset
\mathcal{B}$$.
* TraStrainer w/o D shows a significant imbalance: deteriorating interface API-4 exceeds 70% distribution ratio, while underrepresented API-5 is completely absent. This shows that focusing only on system state can oversample problem-related APIs to the detriment of diversity.
* TraStrainer w/o M produces relatively balanced sampling results on average but fails to bias towards the more crucial deteriorating API-4. This confirms that diversity alone might miss contextually important traces.
* TraStrainer (full version) effectively enhances diversity and shows a stronger inclination towards the problematic API.
- Effectiveness in Downstream Analysis (Table 3):
TraStrainer w/o D(system state only) generally performs better thanTraStrainer w/o M(diversity only). This suggests thatprioritizing problem-related traces(viasystem state) is more beneficial forRCA tasksthan solely focusing onuncommon traces.- At a low budget (0.1%),
TraStrainer w/o Dhas a significantly lowerA@1compared toTraStrainer. This is becauseTraStrainer w/o D's sampling preference(problem-relatedness) doesn't explicitlybiastowardsuncommon abnormal traces, potentially missing someroot cause tracesat very low budgets. - At a high budget (2.5%),
TraStrainer w/o Dapproaches the performance ofTraStrainer. This is because with more budget,TraStrainer w/o Deventually captures mostproblem-related traces, including theculprits. TraStrainer(full version), by consideringboth system state and trace diversity, achieves superioranalysis performanceacrossall budgets. This empirically validates the paper's core motivation that acomprehensive sampling preferencecombining both factors is most effective.
6.1.4. RQ4: Efficiency of TraStrainer
This RQ evaluates TraStrainer's efficiency for online sampling. The sampler part was deployed on an Intel Xeon Gold 5318Y 2.10GHz CPU in a single-process form.
The following figure (Figure 11 from the original paper) shows the efficiency validation of TraStrainer:
该图像是论文中用于说明问题发生两个阶段及采样器采样概率对比的示意图。左图展示了问题出现(Phase ①)和暴露(Phase ②)两个阶段中节点A CPU利用率变化及对应采样轨迹,右图对比了有无系统状态信息的采样器在问题相关轨迹采样概率上的变化。
-
Sampling Latency Distribution (Figure 11a):
TraStrainergenerally exhibitssmaller sampling latencies(0.28ms to 14.29ms) compared to baselines.- Its
maximum latencyis 40.82% to 50.16% lower than baselines, demonstrating its suitability foronline deploymentwithlow latency requirements.
-
Impact of Trace Size (Figure 11b):
- With
metric numberfixed at 75,sampling latencyincreaseslinearlywithtrace size(number of spans). - However, the
maximum latencyremains below 15ms, indicating good scalability withtrace complexity.
- With
-
Impact of Metric Number (Figure 11c):
- With
trace sizefixed at 50,sampling latencyincreaseslinearlywhen themetric numberis low. - Crucially, when the
metric numberexceeds 100, thegrowth rate of sampling latencybecomes much slower. This is attributed toTraStrainer's dimension reduction strategies, which effectively preventdimension explosionand maintainefficiencyeven with a large number ofsystem metrics.
- With
-
Conclusion on Efficiency:
TraStrainerisefficient enoughforonline sampling, and itsdimension reduction strategyeffectively handleslarge numbers of metrics.
6.2. Data Presentation (Tables)
The tables from the paper have been transcribed and presented in the relevant sections above (Table 1 in Datasets, Table 3 in RCA Effectiveness).
6.3. Ablation Studies / Parameter Analysis
The ablation studies are presented as part of RQ3 (Contribution of the Combination of System Status and Trace Diversity). They highlight the individual and combined impact of system state and trace diversity components. The key findings are:
- Both components contribute to
TraStrainer'soverall superiority. - The
system statecomponent (TraStrainer w/o D) is particularly strong in identifyingproblem-related tracesand aidingRCA. - The
diversitycomponent (TraStrainer w/o M) ensuresbalanced representationofAPIsand helps captureuncommon patterns. - The combination in full
TraStrainerprovides the best of both worlds, leading torobust performanceacross differentbudgetsandRCA tasks. This confirms that the synergy betweensystem-state biasanddiversity biasis critical forTraStrainer'seffectiveness.
7. Conclusion & Reflections
7.1. Conclusion Summary
This study introduces TraStrainer, a novel online biased sampler for distributed traces in microservice systems. Its core innovation lies in its comprehensive approach to sampling preferences, which integrates both system runtime state and trace diversity, allowing for dynamic adjustment based on real-time conditions. TraStrainer employs an interpretable and automated encoding method for traces and utilizes a dynamic voting mechanism to combine system-state bias and diversity bias into a final sampling decision.
Through extensive evaluation on two diverse datasets (real-world production and benchmark), TraStrainer demonstrated significant improvements. It effectively identified more valuable traces within the same budget, leading to an average increase of 32.63% in Top-1 RCA accuracy compared to four baseline methods. Furthermore, TraStrainer exhibited higher efficiency suitable for online deployment, even with varying trace sizes and large numbers of metrics, thanks to its dimension reduction strategies. The ablation studies confirmed that the combination of system state and trace diversity is crucial for its superior performance in downstream analysis tasks.
7.2. Limitations & Future Work
The authors acknowledge the following limitations and suggest future work:
- Tail-based Sampling Only: The current
TraStrainerimplementation is purelytail-based. The authors suggest that their method for settingsampling preferencescould be beneficial for other types ofbiased sampling, specifically mentioningretroactive sampling[52].Retroactive samplingretrievestrace datalazily to makebiased sampling decisionsearlier in thetrace lifecycle, potentially improvingefficiency. IntegratingTraStrainer'spreference setting withretroactive samplingcould be a promising future direction. - Limited Fault Types: While tested on data from 13 production systems and 2 benchmarks, the total number of distinct
fault typestested was limited (7 in total). They note thatmicroservice systemscan exhibit more complexfault types. Therefore, further testing is needed to assessTraStrainer'seffectivenessin more intricatefailure scenarios. - Implementation of Baselines: For
HCandSifter, lackingopen-source implementations, the authors had to develop their own based on papers, which introduces a potentialthreat to internal validity. However, they took care to use the same libraries.
7.3. Personal Insights & Critique
This paper presents a highly valuable and well-motivated contribution to the field of microservice monitoring and observability.
- Innovation and Practical Relevance: The core innovation of integrating
system runtime statewithtrace diversityis particularly astute. It directly addresses a critical oversight in previoussampling methodsand aligns closely with howSREsintuitively triage problems (e.g., "CPU is high on Node X, so look at traces touching Node X"). This makesTraStrainernot just theoretically sound but also immensely practically relevant forproduction environments. The significant improvement inRCA accuracyis a strong testament to its real-world utility. - Interpretability: The emphasis on
interpretable trace encoding, where dimensions correspond directly tosystem metrics, is a major strength. Indebuggingandroot cause analysis, transparency and understanding why a particulartracewas sampled are almost as important as thesampling qualityitself. This interpretability fosters trust and makes the system moreactionable. - Robustness against Data Volume: The
dimension scalabilityandreduction strategiesin theTrace Encoder, combined with the efficiency shown with a large number ofmetrics, demonstrate a thoughtful design forreal-world microservice systemsthat are constantly evolving and generating vast amounts oftelemetry data. - Dynamic Adaptability: The
dynamic voting mechanismis a clever way to ensurebudget adherencewhile flexibly balancingsampling biases. This adaptability is crucial in dynamicmicroservice environmentswhereanomaly ratesandsystem loadscan fluctuate wildly. - Potential Issues/Areas for Improvement:
-
Cold Start for System Bias Extractor: The
DLinear modelforanomaly detectionwould require historical data to learnnormal patterns. In a newly deployed system or for entirely newmetrics, acold start periodmight lead to less accuratesystem biasuntil sufficienttime-series datais collected. The paper mentions awarm-up periodfor thesampler, but theSystem Bias Extractoritself might also benefit from specificcold-start handling. -
Thresholds and Sensitivity: The
budget sampling rateis a critical parameter. While thedynamic voting mechanismadapts to relative to , the initial setting of and how it interacts with theanomaly detection thresholdsin theSystem Bias Extractorordiversity clusteringcould be sensitive. Further analysis on tuning these parameters might be beneficial. -
Generalizability of Fault Types: While the authors acknowledge the limited
fault types, deeper investigation into complex, interactingfaults(e.g., cascading failures involving multiple services and resource contention) would be valuable. The currentmetric-based encodingshould, in principle, handle these, but empirical validation is key. -
Integration with Retroactive Sampling: The suggestion to integrate with
retroactive samplingis compelling. This would allowTraStrainer'sintelligentpreference settingto influencesampling decisionsearlier, potentially saving even morecomputational overheadintrace collectionpipelines. -
Computational Cost of Clustering: The
Diversity-Biased Samplerinvolvesclusteringtrace vectorsin thelook-back window. WhileDLinearis light,clusteringcan be computationally intensive, especially for largelook-back windowsor high-dimensionalvectors. The paper doesn't detail the specificclustering algorithmused or its overhead.In conclusion,
TraStrainerrepresents a significant step forward intrace sampling, offering a more intelligent and context-aware approach that promises to deliver highlyactionable dataformicroservice operations, moving beyond simplediversitytosystem-aware intelligence.
-
Similar papers
Recommended via semantic vector search.