TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State

HAIYU HUANG

Paper status: completed

TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State

Published:07/12/2024

Distributed Trace Sampling (1)System Runtime State Awareness (1)Online Adaptive Sampling Framework (1)Root Cause Analysis Optimization (1)Microservice Monitoring Data Processing (1)

Original Link

Price: 0.10

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

TraStrainer adaptively samples distributed traces by integrating system runtime state and trace diversity, improving sampling quality and root cause analysis accuracy by over 30% while reducing overhead.

Abstract

TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State HAIYU HUANG, Sun Yat-sen University, China XIAOYU ZHANG, Huawei, China PENGFEI CHEN, ZILONG HE, ZHIMING CHEN, GUANGBA YU, and HONGYANG CHEN, Sun Yat-sen University, China CHEN SUN, Huawei, China Distributed tracing has been widely adopted in many microservice systems and plays an important role in monitoring and analyzing the system. However, trace data often come in large volumes, incurring substantial computational and storage costs. To reduce the quantity of traces, trace sampling has become a prominent topic of discussion, and several methods have been proposed in prior work. To attain higher-quality sampling outcomes, biased sampling has gained more attention compared to random sampling. Previous biased sampling methods primarily considered the importance of traces based on diversity, aiming to sample more edge-case traces and fewer common-case traces. However, we contend that relying solely on trace diversity for sampling is insufficient, system runtime state is another crucial factor that needs to be considered, especially in cases of system failures. In this study, we introduce Tra

Mind Map

In-depth Reading

English Analysis~20 min read · 21,721 chars

1. Bibliographic Information

1.1. Title

TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State

1.2. Authors

HAIYU HUANG, Sun Yat-sen University, China XIAOYU ZHANG, Huawei, China PENGFEI CHEN, ZILONG HE, ZHIMING CHEN, GUANGBA YU, and HONGYANG CHEN, Sun Yat-sen University, China CHEN SUN, Huawei, China

1.3. Journal/Conference

Proc. ACM Softw. Eng. 1, FSE, Article 22 (July 2024), 21 pages. The publication venue, FSE (ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering), is a highly reputable and influential conference in the field of software engineering. Its selective nature ensures that accepted papers represent significant contributions and high-quality research.

1.4. Publication Year

2024

1.5. Abstract

Distributed tracing is essential for monitoring microservice systems but generates large volumes of data, causing high computational and storage costs. Existing sampling methods often rely on trace diversity to prioritize sampling, which may overlook important system runtime conditions, especially during failures. We propose TraStrainer, an online sampling framework integrating both system runtime state and trace diversity. TraStrainer encodes traces into vectors and adaptively adjusts sampling preferences by analyzing real-time system metrics. It combines system-state bias and diversity bias through a dynamic voting mechanism to select high-value traces. Experimental evaluation shows TraStrainer significantly improves sampling quality and downstream root cause analysis, achieving a 32.63% average increase in Top-1 accuracy over four baselines on two datasets, demonstrating its effectiveness in producing actionable trace data while reducing overhead.

1.6. Original Source Link

/files/papers/6901a5fc84ecf5fffe471766/paper.pdf (This link indicates an officially published paper, likely hosted on a conference or publisher's server).

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the high computational and storage costs associated with distributed tracing in microservice systems. Distributed tracing, using frameworks like OpenTelemetry or SkyWalking, provides invaluable insights into system behavior, aids in anomaly detection, and assists in root cause analysis (RCA). However, the sheer volume of trace data generated by large-scale microservice systems (millions to billions daily) makes storing and analyzing all of it impractical and expensive.

This problem is important because effective monitoring and troubleshooting are critical for maintaining the reliability and performance of complex, dynamic microservice environments. While trace sampling techniques exist to reduce data volume by selectively capturing traces, existing methods face significant challenges:

Limited Scope of Existing Biased Sampling: Previous biased sampling approaches primarily focus on trace diversity, aiming to capture edge-case (uncommon) traces while discarding common-case ones. This is based on the intuition that edge-case traces are more informative.
Neglect of System Runtime State: A critical gap is that these methods often disregard the system runtime state. The paper argues that the definition of a "valuable trace" can change dramatically depending on whether the system is operating normally or experiencing a failure. For instance, common-case traces accessing a database might become highly valuable if the database server experiences an exception or high CPU utilization.
Inadequacy for Downstream Analysis: Solely favoring edge-case traces is insufficient because common-case traces can also be related to root causes or are necessary for RCA algorithms to learn normal patterns or perform spectrum analysis. When anomaly rates are high, existing methods might still miss the most significant abnormal traces due to budget constraints, lacking a mechanism to prioritize among numerous anomalies.

The paper's entry point and innovative idea is to propose TraStrainer, an online biased trace sampler that explicitly considers both system runtime state and trace diversity to adaptively determine sampling preferences. This allows for a more comprehensive and context-aware selection of valuable traces.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

A Novel Online Biased Trace Sampler: Proposes TraStrainer, an online biased trace sampler that innovatively integrates system runtime state with trace diversity to adaptively determine sampling preferences. This provides a more comprehensive approach than existing methods.
Interpretable and Automated Trace Representation: Introduces an interpretable and automated method for trace encoding. The generated trace vector captures both structural and status information, with each system metric corresponding to a specific dimension, allowing for dynamic expansion and reduction.
Impact on Downstream Root Cause Analysis: Systematically investigates the impact of different sampling methods on downstream root cause analysis (RCA) tasks by combining TraStrainer with several state-of-the-art trace-based RCA approaches.
Empirical Validation and Superior Performance: Implements TraStrainer and constructs two diverse datasets (one from 13 real-world production microservice systems, another from OnlineBoutique and TrainTicket benchmarks) to validate its sampling quality, effectiveness in improving downstream analysis tasks, and efficiency.

The key conclusions and findings are:
TraStrainer significantly improves sampling quality, identifying more valuable traces within the same budget compared to four baseline methods.
It substantially enhances the performance of downstream root cause analysis (RCA) tasks, achieving an average increase of 32.63% in Top-1 RCA accuracy compared to four baselines across two datasets.
The combination of system runtime state and trace diversity is crucial, with TraStrainer outperforming variants that consider only one factor. Prioritizing problem-related traces (via system state) is particularly impactful for analysis tasks.
TraStrainer demonstrates higher efficiency than other online biased sampling methods, with low sampling latencies suitable for online deployment and effective dimension reduction strategies. These findings directly address the problem of high data volume by providing a smarter sampling mechanism that retains more actionable data for analysis.

3.1. Foundational Concepts

To fully grasp the TraStrainer paper, a novice reader should understand the following fundamental concepts:

Microservice Systems: A software architecture where complex applications are composed of small, independent processes communicating with each other. Each service is self-contained, can be developed, deployed, and scaled independently. This distributed nature makes monitoring challenging.
Distributed Tracing: A method used to monitor requests as they flow through various services in a microservice system.
- Trace: Represents the end-to-end path of a single request through multiple services.
- Span: A logical unit of work within a trace, representing an operation or a call to a service. Each span has a unique ID, parent span ID (for hierarchical relationships), start/end timestamps, service name, operation name, and can include event annotations (key-value pairs with additional information like errors or business-specific data).
- Trace Context: Information (like TraceID and SpanID) that is propagated across service boundaries to link spans together into a complete trace.
- OpenTelemetry/SkyWalking: Open-source frameworks that provide APIs, libraries, and agents to instrument, generate, collect, and export telemetry data (traces, metrics, logs) from cloud-native applications. They are implementations of distributed tracing.
System Metrics: Time-series data reflecting the runtime state and performance of system components (e.g., CPU utilization, memory usage, disk I/O, network latency, request rates, error rates). These metrics are crucial for identifying anomalies and understanding system health.
- Time-series data: A sequence of data points indexed in time order.
- Anomaly Detection: The process of identifying data points, events, or observations that deviate significantly from the majority of the data, indicating unusual behavior.
Trace Sampling: The process of selectively capturing a subset of traces to reduce the volume of data stored and processed, while still aiming to retain enough information for monitoring and analysis.
- Head-based Sampling: The decision to sample a trace is made at the very beginning of the request (at the "head" of the trace). This is simple but cannot use information from later spans (e.g., whether an error occurred).
- Tail-based Sampling (Biased Sampling): The decision to sample a trace is made after the entire trace has completed (at the "tail" of the trace). This allows the sampling logic to use all available information within the trace (e.g., latency, error status, specific operations) to determine its "value" or "importance," enabling biased sampling towards valuable traces.
- Budget Sampling Rate ( $\beta$ ): The target percentage or proportion of traces that should be sampled and retained.
Root Cause Analysis (RCA): The process of identifying the underlying reasons for a problem or an anomaly in a system. In microservice systems, this often involves pinpointing which service, host, or component is responsible for a detected issue based on trace data and metrics.
Machine Learning/Statistical Concepts:
- Vector Representation/Embedding: Transforming complex data (like a trace) into a numerical vector in a high-dimensional space. This allows computational models to process and compare traces mathematically.
- Clustering: An unsupervised machine learning task that groups a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups.
- Jaccard Similarity: A statistic used for comparing the similarity and diversity of sample sets. It measures the size of the intersection divided by the size of the union of the sample sets.
- Time Series Forecasting (TSF): The use of models to predict future values based on previously observed values of a time series.
- Neural Network Models: A type of machine learning model inspired by the structure of the human brain, capable of learning complex patterns from data. LSTM (Long Short-Term Memory) and Transformer are advanced neural network architectures used for sequence data, while linear models are simpler forms.

3.2. Previous Works

The paper primarily focuses on trace sampling and its impact on root cause analysis (RCA).

Trace Sampling Approaches:

Uniform Random Sampling: Used by early tracing systems like Dapper [37], Jaeger [17], and Zipkin [2]. Decision is made at the trace head with equal probability.
- Limitation: Fails to guarantee the quality of sampled traces, often missing valuable or anomalous traces.
Biased Sampling (Tail-based): Aims to select more "important" traces after their completion. Most prior works define importance based on trace diversity or uncommonness.
- HC (Hierarchical Clustering) [20]: An offline tail-based sampling approach that uses hierarchical clustering to group traces by label counting (e.g., error status, latency bucket) to maximize diversity.
  - Differentiation from TraStrainer: Offline (not suitable for real-time), relies on label counting (potentially limited features), focuses only on diversity.
- Sifter [21]: An online tail-based sampling method that approximates the common-case behavior of a distributed system and biases sampling towards new traces that are poorly represented by this model (i.e., uncommon).
  - Differentiation from TraStrainer: Online, but focuses solely on diversity/uncommonness and doesn't consider system runtime state.
- Sieve [16]: An online tail-based sampling approach that uses Robust Random Cut Forest (RRCF) to detect uncommon traces and samples them with a high probability. It considers both structure and latency.
  - Differentiation from TraStrainer: Online, considers more features than Sifter, but still primarily diversity-biased and does not integrate system runtime state.
- TraceCRL [51]: Utilizes contrastive learning and Graph Neural Network (GNN) techniques to encode trace data into vectors, enabling sampling based on the diversity of these vector representations.
- SampleHST [11]: Biases sampling towards edge-cases based on the distribution of mass scores obtained from a forest of Half Space Trees (HST).
- Hindsight [52]: Introduces retroactive sampling, a technique that lazily retrieves trace data to perform biased sampling at earlier stages of the trace lifecycle. This aims to improve the efficiency of biased sampling by not waiting for the full trace completion before making preliminary decisions, though the final decision might still be tail-based.
  - Differentiation from TraStrainer: While Hindsight improves efficiency, it still primarily focuses on edge-cases (diversity) as its sampling preference, similar to other biased sampling methods.

Trace-based Analysis Approaches (RCA):

TraceAnomaly [26]: Leverages deep learning (specifically Variational Autoencoders - VAE) to learn normal trace patterns offline, then detects abnormal traces and identifies root causes online.
- Differentiation from TraStrainer (as an RCA method): TraceAnomaly is an RCA method, while TraStrainer is a sampling method that feeds RCA methods. TraceAnomaly needs normal traces to learn patterns, which diversity-biased samplers might undersample.
TraceRCA [23]: Utilizes spectrum analysis to identify root cause services by analyzing the proportion of normal and abnormal invocations.
- Differentiation from TraStrainer: An RCA method. Its reliance on spectrum analysis (comparing normal vs. abnormal) means it benefits from a balanced view of both common and uncommon traces.
MicroRank [45]: Identifies and ranks root causes by combining a personalized PageRank method and spectrum analysis.
- Differentiation from TraStrainer: An RCA method. Similar to TraceRCA, it relies on a nuanced view of trace patterns.
Sage [10]: Utilizes causal Bayesian networks and graphical variational autoencoders to pinpoint root cause microservices.
FSF [35]: Utilizes knowledge of failure propagation and the client-server communication model to deduce root causes.

3.3. Technological Evolution

The evolution of trace sampling has progressed from simple uniform random sampling to more sophisticated biased sampling techniques. Initially, the primary concern was just reducing data volume. As microservice systems grew in complexity and the analytical value of traces became clearer, researchers sought ways to intelligently select traces that were most "interesting" or "informative." This led to the development of tail-based sampling, which could leverage the full context of a trace.

Early biased sampling methods (like HC, Sifter, Sieve) focused heavily on trace diversity and uncommonness. The rationale was that edge cases and anomalies are often the most valuable for debugging and understanding system behavior. Techniques like clustering or random cut forests were employed to identify these rare patterns.

However, a limitation in this evolution was the implicit assumption that "valuable" always meant "diverse" or "uncommon." The TraStrainer paper marks a significant step by introducing the concept that the "value" of a trace is not static but dynamically influenced by the broader system runtime state. It integrates real-time system metrics as a crucial context, suggesting that common-case traces can become highly relevant during specific system failures if they interact with an affected component. This represents a shift from purely trace-centric importance to a holistic system-aware importance definition for sampling. The idea of retroactive sampling (Hindsight) also shows an ongoing effort to improve the efficiency of biased sampling without sacrificing quality.

3.4. Differentiation Analysis

Compared to the main methods in related work, TraStrainer's core differences and innovations are:

Multi-Modal Sampling Preference (System State + Trace Diversity):
- Previous methods (HC, Sifter, Sieve, TraceCRL, SampleHST): Primarily rely on trace diversity or uncommonness as the sole criteria for sampling preference. They analyze the structure or latency within the trace itself to identify edge-cases.
- TraStrainer: Uniquely combines system runtime state (derived from real-time system metrics) with trace diversity. This means it can dynamically adjust its sampling preferences. For example, it can prioritize common-case traces interacting with a database if database metrics show an anomaly.
Adaptive and Dynamic Preference Adjustment:
- Previous methods: Sampling preferences for uncommon traces are relatively static or based on historical trace patterns. They lack the ability to react to immediate system-wide issues that might make otherwise common traces suddenly critical.
- TraStrainer: Uses a System Bias Extractor to adaptively determine sampling preferences by analyzing real-time fluctuations in system metrics. This allows it to dynamically shift its bias based on what the system is currently struggling with.
Interpretable and Automated Trace Encoding:
- Previous methods: Trace encoding often involves manual feature engineering or less interpretable vector representations.
- TraStrainer: Proposes an automated trace encoding method guided by system metrics, ensuring interpretability of vector dimensions. Each dimension corresponds to a specific system metric, making it easier to understand why a trace is considered valuable. It also includes dimension scalability and reduction strategies.
Dynamic Voting Mechanism for Combined Decision:
- Previous methods: Do not have a mechanism to combine different types of sampling biases under a sampling budget.
- TraStrainer: Introduces a Composite Sampler with a dynamic voting mechanism (AND/OR gate) that adjusts based on the current sampling frequency relative to the budget. This allows for a flexible balance between different biases while adhering to budget constraints.
Improved Downstream RCA Performance:
- Previous methods: While aiming for quality traces, their diversity-only bias can sometimes hinder RCA methods that require normal traces (e.g., for spectrum analysis) or a specific type of problem-related trace that might not be uncommon in its structure.
- TraStrainer: By incorporating system state, it ensures that traces related to ongoing problems are captured, leading to significantly better performance in downstream RCA tasks by providing more actionable data.
  
  In essence, TraStrainer shifts the paradigm from merely identifying diverse or uncommon traces to identifying contextually valuable traces by actively incorporating system health information, thus providing a more intelligent and effective sampling strategy for complex microservice environments.

4. Methodology

4.1. Principles

The core idea behind TraStrainer is to implement biased trace sampling in a more comprehensive way than previous approaches. It addresses the limitation of existing methods that primarily focus on trace diversity by explicitly incorporating the system runtime state. The theoretical basis is that the "value" or "importance" of a trace is not static; it dynamically changes based on real-time system conditions and its structural and behavioral characteristics. Therefore, to capture higher-quality traces that are most useful for monitoring and root cause analysis, a sampling preference must consider both system anomalies and trace uniqueness.

TraStrainer is an adaptive online biased trace sampler. This means it operates continuously, making sampling decisions in real-time as traces arrive, and it can adjust its behavior based on observed system dynamics and trace patterns.

4.2. Core Methodology In-depth (Layer by Layer)

The overall architecture of TraStrainer is illustrated in Figure 4 of the original paper, comprising two main phases: runtime data preprocessing and comprehensive sampling.

The following figure (Figure 4 from the original paper) shows an overview of TraStrainer:

Fig. 6. An example of assessing the anomaly degree of metrics. 该图像是图6的示意图，展示了度量指标异常程度的评估流程。通过对历史数据窗口进行趋势和剩余量的线性建模，计算期望值，再结合当前值计算异常程度α，用于构成偏好向量。

4.2.1. Runtime Data Preprocessing Phase

This phase prepares the incoming trace and system metrics for sampling decisions.

4.2.1.1. Trace Encoder

The Trace Encoder module is responsible for converting raw distributed traces into a machine-readable vector representation. This is crucial because raw traces are complex, tree-like structures that cannot be directly processed by numerical sampling algorithms. A key design goal is interpretability and automation, meaning the encoding should be understandable and adapt to new metrics or trace structures without manual intervention.

The encoding process is divided into two parts: status-related and structure-related encoding.

The following figure (Figure 5 from the original paper) shows the process of encoding traces in two parts: status-related and structure-related:

Fig. 7. The process of the dynamic voting mechanism that takes the budget into account to combine the sampling results from the two previous samplers. 该图像是图7的示意图，展示了动态投票机制的过程，该机制结合系统偏置和多样性偏置的采样结果，通过比较当前采样率 $\theta$ 与预算采样率 $\beta$ ，利用 AND/OR 门决策最终采样结果。

Encoding the Status-Related Part of the Trace: When the trace structure is disregarded, a trace can be viewed as a "bag of spans," denoted as $S = \{s_1, ..., s_n\}$ , where each span $s$ represents a unit of work. Each span contains information like duration, status code, associated node or pod, and event annotations. Instead of relying on manual feature selection from event annotations, TraStrainer leverages system metrics to automatically determine relevant features.

Definition 1 (Related Sub Span Bag $S_m$ ): For a metric m(m.node, m.type) and a trace $t$ , let $S = \{s_1, ..., s_n\}$ be the span bag of $t$ . We use $S_m = \{s_{m1}, ..., s_{mn}\}$ to denote the related sub span bag of $m$ , where $s_{mi}$ must satisfy that s_{mi}.node = m.node and m.type related to $s_{mi}.annotation$ .
- Explanation: This definition identifies all spans within a trace $t$ that are relevant to a specific system metric $m$ . For example, if $m$ is "CPU usage on Node A," then $S_m$ would contain all spans from trace $t$ that executed on Node A and whose annotations (e.g., SQL query) are related to the CPU usage metric type.
Definition 2 (Feature Value $f_m$ ): For the related sub span bag $S_m = \{\tilde{s}_{m1}, ..., s_{mn}\}$ of the metric $m$ , we use $S_a = \{s_{a1}, ..., s_{an}\}$ to denote the abnormal span in $S_m$ , where $s_{ai}$ must satisfy that $s_{ai}.status$ is abnormal. The feature value $f_m$ of $m$ is calculated as follows: $f_m = (|S_a| + 1) * \sum_{i=1}^n s_{mi}.duration$
- Explanation: $f_m$ $f_{m}$ quantifies the "impact" of a trace on a specific metric $m$ $m$ .
  - $|S_a|$ : The number of abnormal spans within the related sub span bag $S_m$ . This captures the severity of errors. Adding $+1$ ensures that even if there are no abnormal spans, this factor is at least 1, preventing the total duration from being nullified.
  - $\sum_{i=1}^n s_{mi}.duration$ : The sum of durations of all spans in $S_m$ . This represents the total time spent by the trace interacting with the component relevant to metric $m$ .
- The product of these two terms ( $|S_a|+1$ ) and the total duration gives a combined measure of the trace's significance with respect to metric $m$ . This value $f_m$ becomes a dimension in the trace's vector representation.
Encoding the Structure-Related Part of the Trace: The trace structure, an invocation tree, shows the order and hierarchy of service operations. Since asynchronous calls can change the order of spans at the same level, the focus is on invocation hierarchy.
- Each layer of the invocation tree is encoded as a feature.
- Each span in a layer is represented by its parent span, method name, and argument (pma).
- If a trace doesn't have a span at a certain depth, that position is filled with null.
Dimension Scalability and Reduction:
- Scalability: Trace Encoder automatically expands vector dimensions when new metrics are added or new trace structures occur, by identifying related sub span bags and performing statistics.
- Reduction: To prevent dimension explosion:
  - For status-related dimensions: If two metrics consistently have the same values for their dimensions over a period, those dimensions are merged.
  - For structure-related dimensions: If dimensions corresponding to deeper call hierarchies no longer occur in recent traces, those dimensions are removed.

4.2.1.2. System Bias Extractor

The System Bias Extractor module continuously monitors system metrics to identify which ones are currently exhibiting anomalous behavior and thus deserve more attention. It adaptively determines the preference weight for each metric based on its anomaly degree.

Assess the Anomaly Degree of Metrics: Each system metric $m$ is a time-series data $m = \{(t_1, v_1), ..., (t_n, v_n)\}$ . The goal is to calculate the anomaly degree $\alpha$ of each metric online at the current time $t_k$ . This is done by comparing the current actual value $v_k$ with an expected value $v_k'$ derived from historical data within a look-back window $[t_1, ..., t_{k-1}]$ .
- The paper opts for a neural network model approach over statistical methods (like boxplot or 3-sigma detection) because neural networks can better learn historical waveform patterns in time-series data, which is important for business operations with periodic patterns.
- TraStrainer uses a modified DLinear algorithm [49], a linear forecasting model inspired by recent advancements in time series forecasting. This model takes a historical time series data window as input and outputs the expected value $v_k'$ for the current time point.
- The anomaly degree $\alpha$ $α$ is then calculated as the normalized difference between the actual value $v_k$ $v_{k}$ and the expected value $v_k'$ $v_{k}^{'}$ : $\alpha = \frac{|v_k' - v_k|}{\max(v_k', v_k)}$
  - Explanation: This formula calculates the relative deviation of the actual metric value from its expected value.
    - $v_k'$ : The expected value of the metric at time $t_k$ , predicted by the DLinear model.
    - $v_k$ : The actual observed value of the metric at time $t_k$ .
    - $|v_k' - v_k|$ : The absolute difference between the actual and expected values, indicating the magnitude of the anomaly.
    - $\max(v_k', v_k)$ : Normalizes the difference by the larger of the two values, providing a relative anomaly degree between 0 and 1. A higher $\alpha$ indicates a greater anomaly.
      
      The following figure (Figure 6 from the original paper) shows an example of assessing the anomaly degree of metrics:
      
      该图像是图表，展示了图8中两组数据集中不同采样预算下多种采样方法对罕见、相关及罕见相关追踪的采样比例，TraStrainer方法在各类指标中表现出显著优势。
Form the Preference Vector: After obtaining the current anomaly degree $\alpha_i$ for each metric $m_i$ , this value is directly considered as the preference score $p_i$ for that metric. All preference scores for all metrics $M$ at the current moment form the preference vector $\mathcal{P}$ : $\mathcal{P} = [p_1, ..., p_n]$
- Explanation: This vector quantifies which system metrics are currently anomalous and, by extension, which trace dimensions (which correspond to these metrics) should be given higher sampling preference.

4.2.2. Online Comprehensive Sampling Phase

This phase utilizes the trace vector and preference vector to make actual sampling decisions.

4.2.2.1. System-Biased Sampler

The System-Biased Sampler aims to prioritize traces that are more relevant to the detected system fluctuations (i.e., anomalous metrics). It calculates a sampling probability based on how much a coming trace interacts with the currently anomalous system components.

Calculate the Attention Score Vector: The sampler maintains a dynamic look-back window $\mathcal{W} = [t_1, ..., t_k]$ of the most recent $k$ trace vectors. Only the status-related part of each trace vector, denoted as $t_i = [f_{1i}, ..., f_{ni}]$ , is considered here. For each dimension $i$ (corresponding to a system metric), the mean $\mu_i$ and standard deviation $\sigma_i$ of the feature values $f_{ji}$ (for $j=1...k$ ) from the previous k trace vectors in that dimension are calculated. When a new trace $t_{k+1}$ arrives with its feature value $f_{i(k+1)}$ for dimension $i$ , its attention score $a_i$ for that dimension is calculated as: $a_i = \frac{|f_{i(k+1)} - \mu_i|}{\sigma_i}$
- Explanation: This formula measures how much the coming trace $t_{k+1}$ $t_{k + 1}$ deviates from the historical average in resource utilization or impact for dimension $i$ $i$ .
  - $f_{i(k+1)}$ : The feature value of the coming trace $t_{k+1}$ for metric $i$ .
  - $\mu_i$ : The mean of feature values for metric $i$ from traces in the look-back window.
  - $\sigma_i$ : The standard deviation of feature values for metric $i$ from traces in the look-back window.
- A higher $a_i$ means the coming trace has a more unusual impact or resource utilization for metric i compared to recent history. These attention scores for all dimensions form the `attention score vector``\mathcal{A} = [a_1, ..., a_n]$.
Calculate the System-Biased Sampling Probability: A coming trace is considered more valuable if it has higher attention scores in dimensions that also have higher preference scores (from the System Bias Extractor). The system-biased sampling probability $p_s(t_{k+1})$ for the coming trace $t_{k+1}$ is calculated by taking the dot product of its attention score vector $\mathcal{A}(t_{k+1})$ and the current preference vector $\mathcal{P}$ , and then applying a tanh function to map the result to the range [0, 1]: $p_s(t_{k+1}) = \frac{2}{1 + e^{-2\mathcal{P} \cdot \mathcal{A}(t_{k+1})}} - 1$
- Explanation:
  - $\mathcal{P} \cdot \mathcal{A}(t_{k+1})$ : The dot product between the preference vector and the attention score vector. This effectively sums up the product of metric anomaly degrees and the trace's unusual impact on those metrics. A higher dot product means the trace is unusually active or impactful in currently anomalous system areas.
  - The tanh function (transformed to map to [0,1] instead of [-1,1]) squashes this potentially large dot product into a probability range.
  - Specifically, the standard tanh(x) function ranges from -1 to 1. The formula $\frac{2}{1+e^{-2x}} - 1$ $\frac{2}{1 + e ^{- 2 x}} - 1$ is a common way to transform a value $x$ $x$ (here, the dot product) into the range [0, 1] for probability. If $x$ $x$ is large and positive, $e^{-2x}$ $e^{- 2 x}$ approaches 0, and the expression approaches $\frac{2}{1}-1 = 1$ $\frac{2}{1} - 1 = 1$ . If $x$ $x$ is large and negative, $e^{-2x}$ $e^{- 2 x}$ approaches infinity, and the expression approaches $\frac{2}{\infty}-1 = -1$ $\frac{2}{\infty} - 1 = - 1$ . Wait, the paper states the range is [0,1]. Let's re-evaluate the transformation from $\tanh$ $tanh$ . The standard $\tanh(x)$ $tanh (x)$ can be written as $\frac{e^x - e^{-x}}{e^x + e^{-x}}$ $\frac{e ^{x} - e ^{- x}}{e ^{x} + e ^{- x}}$ . A common sigmoid function $\sigma(x) = \frac{1}{1+e^{-x}}$ $σ (x) = \frac{1}{1 + e ^{- x}}$ maps to $(0,1)$ $(0, 1)$ . The given formula $\frac{2}{1 + e^{-2(\mathcal{P} \cdot \mathcal{A}(t_{k+1}))}} - 1$ $\frac{2}{1 + e ^{- 2 (P \cdot A (t_{k + 1}))}} - 1$ is actually $\mathrm{sigmoid}(2(\mathcal{P} \cdot \mathcal{A}(t_{k+1}))) * 2 - 1$ $sigmoid (2 (P \cdot A (t_{k + 1}))) * 2 - 1$ , which maps from (-1,1). However, it can be mapped to [0,1] if the dot product $\mathcal{P} \cdot \mathcal{A}(t_{k+1})$ $P \cdot A (t_{k + 1})$ is always non-negative or the intention is to use a different sigmoid-like transformation. Given the context of probability, it should map to [0,1]. Let's assume the authors mean a transformation that results in [0,1] or that the dot product is inherently non-negative.
    - Let's check the function $f(x) = \frac{2}{1+e^{-2x}} - 1$ $f (x) = \frac{2}{1 + e ^{- 2 x}} - 1$ .
      - As $x \to \infty$ , $e^{-2x} \to 0$ , $f(x) \to \frac{2}{1} - 1 = 1$ .
      - As $x \to -\infty$ , $e^{-2x} \to \infty$ , $f(x) \to 0 - 1 = -1$ .
      - At $x=0$ , $f(x) = \frac{2}{1+1} - 1 = 0$ . So the range is indeed $[-1, 1]$ . This is a common practice for activation functions in neural networks, but for probability, it's typically [0,1]. Perhaps the term $p_s$ here is an "importance score" that is then normalized or interpreted as a probability. The paper explicitly states it yields system-biased sampling probabilityp_s(t_{k+1})

and maps to `[0,1] range`. There might be a slight discrepancy or an implicit assumption/further processing not detailed. For strict adherence, I will present the formula as is and note the stated range. It's possible the dot product itself is constrained or the output is later clamped. For academic rigor, I *must* present the formula exactly as provided.

#### 4.2.2.2. Diversity-Biased Sampler

The `Diversity-Biased Sampler` aims to identify `edge-case traces` (uncommon or unique) and assign them a `higher sampling probability`. This is achieved by comparing a `coming trace` to `clusters` of previously observed `traces`.

*   **Cluster Traces within the Look-Back Window:**
    To understand the `uncommon degree` of a `coming trace`, `patterns` from `previous traces` are established. `Trace vectors` within the `look-back window`  $\mathcal{W} = [t_1, ..., t_k]$  are `clustered`.
    *   Assuming  $n$  `clusters`  $C = \{c_1, ..., c_n\}$  are obtained, the `mass` of each `cluster` (`number of traces` included) is calculated, denoted as  $\mathcal{M}a = \{ma_1, ..., ma_n\}$ .

*   **Calculate the Diversity-Biased Sampling Probability:**
    For the `coming trace`  $t_{k+1}$ , its `Jaccard similarity` [29] with each `cluster` is calculated. The `cluster` with the highest `similarity` is identified as the `closest cluster`,  $c_{k+1}'$ . Its `mass` is  $ma_{k+1}'$ .
    *   A smaller  $ma_{k+1}'$  (meaning the `closest cluster` is less common) and a smaller `Jaccard similarity`  $si(t_{k+1})$  (meaning the `trace` is less similar to even its `closest cluster`) indicate a higher `uncommonness` for  $t_{k+1}$ .
    *   The `diversity-biased sampling probability`  $p_d(t_{k+1})$  is calculated as:

    p_d(t_{k+1}) = \frac{\frac{1}{ma_{k+1}' * si(t_{k+1})}}{\sum_{i=1}^{k+1} \frac{1}{ma_i' * si(t_i)}}

*   **Explanation:**
            *    $ma_{k+1}'$ : The `mass` (number of traces) of the `closest cluster` to  $t_{k+1}$ .
            *    $si(t_{k+1})$ : The `Jaccard similarity` between  $t_{k+1}$  and its `closest cluster`  $c_{k+1}'$ .
            *   The term  $\frac{1}{ma_{k+1}' * si(t_{k+1})}$  represents the `uncommonness score` of the `coming trace`. A smaller `mass` or `similarity` leads to a larger `uncommonness score`.
            *   The denominator  $\sum_{i=1}^{k+1} \frac{1}{ma_i' * si(t_i)}$  is a `normalization factor` that sums the `uncommonness scores` of all `traces` up to  $t_{k+1}$  (or perhaps all `traces` in the `look-back window` plus the `current trace`). This ensures that  $p_d(t_{k+1})$  is a `probability` (between 0 and 1) relative to other `traces`.

#### 4.2.2.3. Composite Sampler

The `Composite Sampler` is responsible for making the final `sampling decision` by combining the `system-biased sampling probability`  $p_s(t)$  and the `diversity-biased sampling probability`  $p_d(t)$ , while also considering the `sampling budget`.

*   **Decision Generation:**
    For a `coming trace`  $t$ , after `System-Biased Sampler` and `Diversity-Biased Sampler` provide their respective probabilities  $p_s(t)$  and  $p_d(t)$ , `TraStrainer` generates a `random number` between `[0, 1]`.
    *   If  $p_s(t)$  is greater than the `random number`, the `system-biased sample result` is "True"; otherwise, "False".
    *   Similarly, if  $p_d(t)$  is greater than the `random number`, the `diversity-biased sample result` is "True"; otherwise, "False".

*   **Dynamic Voting Mechanism:**
    To ensure the overall `sampling rate` aligns with the `expected budget`, a `dynamic voting mechanism` is employed. It tracks the `recent sampling frequency`  $\theta$  within a `look-back window` and compares it to the `budget sampling rate`  $\beta$ .

The following figure (Figure 7 from the original paper) shows the process of the dynamic voting mechanism that takes the budget into account to combine the sampling results from the two previous samplers:

![Fig. 9. Distribution of different APls in the sampling results](/files/papers/6901a5fc84ecf5fffe471766/images/9.jpg)
*该图像是图表，展示了论文中不同方法在两个数据集上对多个API采样结果的分布情况，其中API-4和API-8为异常API，TraStrainer在异常API采样比例上显著高于其他方法，表格部分展示了不同采样率下各方法的性能指标。*

    *   **If  $\theta > \beta$  (sampling too much):** `Stricter sampling rules` are enforced. An `AND gate` is used as the `voting mechanism`. This means `TraStrainer` will only `sample the trace` if **both** `System-Biased Sampler` and `Diversity-Biased Sampler` yield a "True" result.
    *   **If  $\theta < \beta$  (sampling too little):** `Sampling rules` are relaxed. An `OR gate` is used as the `voting mechanism`. This means `TraStrainer` will `sample the trace` if **at least one** of the `samplers` produces a "True" result.
    *   The final `sampling decision` determines whether the `coming trace`  $t$  is `stored` or `discarded`. This dynamic adjustment helps `TraStrainer` stay within the `sampling budget` while maximizing the capture of `valuable traces` according to the current `system state` and `diversity`.

# 5. Experimental Setup

## 5.1. Datasets
The authors constructed two datasets,

\mathcal{A}

and

\mathcal{B}

, to evaluate `TraStrainer`.

*   **Dataset  $\mathcal{A}$  Setup:**
    *   **Source:** Real-world data from 13 production `microservice systems` of Huawei.
    *   **Scale:** Involves 284 services and 1327 nodes.
    *   **Characteristics:** Collected `trace data` and `metric data` from 62 incidents between April 2023 and August 2023, resulting in 62 batches. The incidents covered various `fault types` such as `high CPU load`, `network delay`, `slow SQL execution`, `failed third-party package calls`, `code logic anomalies`, and `thread pool exhausted`.
    *   **Volume:** Comprises a total of 909,797 traces and 121 metrics.
    *   **Labeling:** `SREs` (Site Reliability Engineers) and technical experts annotated `uncommon` and `problem-related traces`, with an average label rate of 2.5%. They also annotated the actual `root causes` for each batch to evaluate `downstream root cause analysis accuracy`.
    *   **Effectiveness for Validation:** This dataset provides high realism due to being from production systems and expert annotations, making it very effective for validating the method's ability to identify `problem-related traces` and improve `RCA`.

*   **Dataset  $\mathcal{B}$  Setup:**
    *   **Source:** Derived from two widely-used `open-source microservice benchmarks`: `OnlineBoutique` [12] and `TrainTicket` [9].
    *   **Deployment:** Deployed on a `Kubernetes` [1] platform with 12 virtual machines (each 8-core 2.10GHz CPU, 16GB memory, Ubuntu 18.04 OS).
    *   **Trace Collection:** `Opentelemetry Collector` [31] was used to collect `traces`, stored in `Grafana Tempo` [13].
    *   **Fault Injection:** To simulate `latency` or `reliability issues`, 56 `faults` were injected using `Chaosblade` [4]. These `fault types` included `CPU contention`, `CPU consumed`, `network delay`, `code exception`, and `error return`.
    *   **Volume:** Collected `trace data` and `metric data` for 56 batches, totaling 112,000 traces and 32 metrics.
    *   **Labeling:** `Uncommon traces` and `problem-related traces` were annotated based on the injected `fault positions` and `types`, with an average label rate of 5.0%.
    *   **Effectiveness for Validation:** This dataset provides a controlled environment for validation due to injected faults, allowing clear ground truth for `problem-related traces` and `root causes`, complementing the real-world complexity of Dataset

\mathcal{A}

.

        The following are the results from Table 1 of the original paper:

        
        
        
        Dataset
        Microservice Benchmark
        Trace Number
        Metric Number
        Batch Number
        Uncommon Label Rate
        Problem-related Label Rate
        Fault Types Number
        
        
        
        
        A
        13 Production Microservice systems
        909,797
        121
        62
        2.5%
        2.5%
        6
        
        
        B
        OnlineBoutique and TrainTicket
        112,000
        32
        56
        5.0%
        5.0%
        5
        
        
        

## 5.2. Evaluation Metrics

The evaluation metrics used assess both the `quality of sampled traces` and the `effectiveness of downstream root cause analysis`.

*   **Metrics for Sampling Quality:**
    1.  **Proportion (PRO):**
        *   **Conceptual Definition:** This metric reflects the ability of a `sampling approach` to `bias` towards specific types of `valuable traces` (e.g., `uncommon`, `problem-related`). It quantifies what percentage of the total `labeled valuable traces` were successfully captured by the `sampler`. A higher `PRO` indicates better targeting of `valuable traces`.
        *   **Mathematical Formula:**
             $PRO = \frac{t}{T}$ 
        *   **Symbol Explanation:**
            *    $T$ : The total number of `labeled traces` of a specific type (e.g., `uncommon`, `problem-related`) in the original, unsampled dataset.
            *    $t$ : The number of `labeled traces` of that same specific type that were successfully `sampled` and retained.
        *   **Variants:** The paper uses three variants: `proportion of uncommon traces`, `proportion of problem-related traces`, and `proportion of traces that are both uncommon and problem-related`.

    2.  **Diversity (DIV):**
        *   **Conceptual Definition:** This metric reflects how `representative` the `sampled traces` are in terms of covering different `execution paths` or `patterns`. It indicates the variety of `trace patterns` captured, which is important for understanding the full scope of system behavior. A higher `DIV` suggests a broader representation of `trace types`.
        *   **Mathematical Formula:**
             $DIV = m$ 
        *   **Symbol Explanation:**
            *    $m$ : The number of distinct `trace patterns` identified by clustering the `sampled traces` based on their `execution path`.

*   **Metrics for Downstream Root Cause Analysis (RCA) Effectiveness:**
    These metrics are standard for evaluating `RCA` performance.
    1.  **Top-k Accuracy (`A@k`):**
        *   **Conceptual Definition:** This metric measures the probability that the `true root cause` for a given issue is present within the top  $k$  ranked `root cause candidates` identified by the `RCA method`. It assesses how accurately and highly the `RCA method` ranks the correct `root cause`. Higher values indicate better `accuracy`.
        *   **Mathematical Formula:**
             $A@k = \frac{1}{|I|} \sum_{i=1}^{|I|} (rc_i \in Rank_i^k)$ 
        *   **Symbol Explanation:**
            *    $I$ : The total set of `issues` or `incidents` being analyzed.
            *    $|I|$ : The total number of `issues`.
            *    $rc_i$ : The `true root cause` of the  $i$ -th `issue`.
            *    $Rank_i^k$ : The list of the top  $k$  `root cause candidates` returned by the `RCA method` for the  $i$ -th `issue`.
            *    $(rc_i \in Rank_i^k)$ : An `indicator function` that equals 1 if the `true root cause`  $rc_i$  is found within the  $Rank_i^k$  list, and 0 otherwise.

    2.  **Mean Reciprocal Rank (MRR):**
        *   **Conceptual Definition:** This metric evaluates the `ranking quality` of the `RCA method` by considering the inverse of the rank of the first `identified true root cause`. It gives higher scores if the `true root cause` is ranked higher (closer to the top). If the `true root cause` is not found in the result list, its `reciprocal rank` is considered zero. Higher values indicate better `accuracy` and `ranking`.
        *   **Mathematical Formula:**
             $MRR = \frac{1}{|I|} \sum_{i=1}^{|I|} \frac{1}{r_i}$ 
        *   **Symbol Explanation:**
            *    $I$ : The total set of `issues` or `incidents` being analyzed.
            *    $|I|$ : The total number of `issues`.
            *    $r_i$ : The `rank` of the `true root cause` in the list of `root cause candidates` returned for the  $i$ -th `issue`. If the `true root cause` is not found,  $r_i$  is considered `infinity`, and  $\frac{1}{r_i}$  becomes 0.

## 5.3. Baselines

The paper compares `TraStrainer` against several `baseline sampling approaches` and evaluates its impact on various `downstream root cause analysis approaches`.

### 5.3.1. Baseline Sampling Approaches
These methods represent different strategies for `trace sampling`, from simple random to `diversity-biased`.
*   **Random:** This is a `head-based sampling approach` where each `trace` is captured with an equal, fixed probability at its inception.
    *   **Representativeness:** It's the simplest and most fundamental `sampling` strategy, serving as a lower bound for `sampling quality`.
*   **HC [20]:** A `tail-based sampling approach` that uses `hierarchical clustering` to group `traces` based on `label counting` (e.g., error status, service invocation counts) and then `samples` to maximize `diversity`.
    *   **Representativeness:** An `offline method` that was an early attempt at `biased sampling` towards `diversity`.
*   **Sifter [21]:** An `online tail-based sampling approach`. It learns an approximation of the `distributed system's common-case behavior` and then `samples new traces` based on how `poorly represented` they are by this `common-case model` (i.e., `uncommon traces`).
    *   **Representativeness:** A `state-of-the-art online diversity-biased sampler` that focuses on `uncommonness`.
*   **Sieve [16]:** Another `online tail-based sampling approach`. It employs `Robust Random Cut Forest (RRCF)` to `detect uncommon traces` and `samples` them with a `high probability`. It considers both `trace structure` and `latency`.
    *   **Representativeness:** A `state-of-the-art online diversity-biased sampler` that is more sophisticated than `Sifter` in detecting `uncommonness`.

### 5.3.2. Baseline Downstream Analysis Approaches
These are `state-of-the-art trace-based RCA methods` used to assess the practical value of the `sampled traces`.
*   **TraceAnomaly [26]:** This method leverages `deep learning`, specifically `Variational Autoencoders (VAE)`, to learn `normal trace patterns` from `trace data`. It then `detects abnormal traces` and `identifies root causes` by observing deviations from these learned `normal patterns`.
    *   **Representativeness:** A representative `RCA method` that relies on `deep learning` for `anomaly detection` and `root cause localization`, sensitive to the availability of `normal traces` for training.
*   **TraceRCA [23]:** This approach utilizes `spectrum analysis` to pinpoint `root cause services`. It does so by examining the `proportion of normal and abnormal invocations` within `traces` flowing through different services.
    *   **Representativeness:** A representative `RCA method` that uses `statistical analysis` (`spectrum analysis`) and benefits from a balanced collection of both `normal` and `abnormal traces`.
*   **MicroRank [45]:** This method identifies and ranks `root causes` by combining a `personalized PageRank algorithm` with `spectrum analysis`. It propagates `anomaly scores` through the `service dependency graph`.
    *   **Representativeness:** A representative `RCA method` combining `graph-based analysis` with `spectrum analysis`, also sensitive to the quality and balance of `sampled traces`.

### 5.3.3. Variants of TraStrainer (Ablation Studies)
To evaluate the individual contributions of its components, `TraStrainer` is also compared against its own variants:
*   **TraStrainer w/o M (without System Runtime State):** This variant only considers `trace diversity` to set `sampling preferences`. It effectively functions by using only the `sampling results` from the `Diversity-Biased Sampler` as the final outcome.
    *   **Purpose:** To isolate the contribution of `system runtime state` to `TraStrainer's` performance.
*   **TraStrainer w/o  $\mathcal{D}$  (without Trace Diversity):** This variant only considers the `system runtime state` to set `sampling preferences`. It achieves this by using only the `sampling results` from the `System-Biased Sampler` as the final outcome.
    *   **Purpose:** To isolate the contribution of `trace diversity` to `TraStrainer's` performance.

## 5.4. Warm-up Period
The paper mentions a common practice for `online samplers`: they make `essentially random sampling decisions` in the initial phase due to `random initialization` of their `internal models`. To ensure fair comparison and stable results, all `online samplers` (including `TraStrainer`, `Sifter`, `Sieve`, and their variants) undergo a `warm-up period` using 10% of the `traces` before actual testing and evaluation commence. This allows their models to stabilize and learn initial patterns.

# 6. Results & Analysis

## 6.1. Core Results Analysis

The experimental evaluation aimed to answer four research questions (RQs) regarding `TraStrainer's` sampling quality, its effectiveness in `downstream root cause analysis`, the contribution of combining `system state` and `trace diversity`, and its `efficiency`.

### 6.1.1. RQ1: Sampling Quality of TraStrainer

This RQ evaluates `TraStrainer's` ability to `bias sampling` towards `problem-related` and `edge-case traces` (`bias sampling`) and its capacity to prioritize `underrepresented` and `abnormal request types` (`representative sampling`). Experiments were conducted on both Dataset

\mathcal{A}

and Dataset

\mathcal{B}

across various `budget sampling rates` (0.1%, 1%, 2.5%, 5%, 10%).

The following figure (Figure 8 from the original paper) shows the proportions of uncommon, related, and uncommon-related traces sampled by different approaches across different budget settings in two datasets:

![Fig. 10. Distribution of different APls in the sampling results of different variants.](/files/papers/6901a5fc84ecf5fffe471766/images/10.jpg)
*该图像是一个柱状图，展示了图10中不同算法变体采样结果中各API分布比例。图中对比了Random、TraStrainer不同去除策略和完整版本对五个API的采样占比，体现了各方法对有问题API-4的重视程度差异。*

#### 6.1.1.1. Bias Sampling

*   **Bias towards Uncommon Traces:**
    *   Figure 8 (a1) and (b1) show that `TraStrainer` and `Sieve` exhibit similar, strong performance in sampling `uncommon traces`, significantly outperforming `Random`, `HC`, and `Sifter`.
    *   `Random sampling` achieves `PRO` roughly equivalent to the `budget`.
    *   `HC` and `Sifter` perform poorly (`PRO` below 0.7) because they primarily consider `trace structure` and lack the ability to identify `uncommon traces` that differ mainly in `execution time` or `latency`.
    *   At low budgets, `TraStrainer` slightly outperforms `Sieve` due to its `metric-based encoding` effectively distinguishing anomalies.
    *   When the budget equals the `label rate` (2.5% in

\mathcal{A}

, 5% in

\mathcal{B}

), `TraStrainer's PRO` might be slightly lower than `Sieve's`. This is attributed to `TraStrainer's` dual bias (towards both `uncommon` and `problem-related traces`), where `problem-related traces` can sometimes include `common-case` ones.
    *   When the budget exceeds twice the `label rate`, both `TraStrainer` and `Sieve` achieve a `PRO` of 1, meaning all `uncommon traces` are captured.

*   **Bias towards Problem-related Traces:**
    *   Figure 8 (a2) and (b2) demonstrate `TraStrainer's` significant superiority in sampling `problem-related traces` across both datasets.
    *   `Sieve`, `Sifter`, and `HC` show `PRO` growth similar to `Random` when the budget is high, with `PRO` remaining below 0.5. This highlights their inability to prioritize `problem-related traces` effectively, as they do not consider `system runtime state`.
    *   `TraStrainer`, by incorporating `system runtime state`, achieves a `PRO` of around 0.6 when the budget equals the `label rate` and approaches 1.0 when the budget is twice the `label rate`, indicating it captures almost all `problem-related traces`.

*   **Bias towards Uncommon Problem-related Traces:**
    *   Figure 8 (a3) and (b3) show that `TraStrainer` significantly outperforms baselines, even at lower budgets.
    *   When the `budget rate` equals the `label rate` (1% in

\mathcal{A}

, 2.5% in

\mathcal{B}

), `TraStrainer` achieves a `PRO` above 0.9, whereas baselines remain below 0.55. This indicates `TraStrainer's` strong preference for the most critical traces – those that are both `uncommon` and `problem-related`.

#### 6.1.1.2. Representative Sampling

*   **Diversity of the Sampling Result:**
    The paper mentions a Table 2 showing `diversity` but does not include the table in the provided text. It states that `TraStrainer w/o M` (which focuses purely on diversity) achieves the best performance in `DIV` for both datasets, validating `TraStrainer's metric-based trace encoding` for distinguishing `trace patterns` and ensuring `fair sampling`. `Sieve` follows closely. `TraStrainer's` `DIV` is slightly lower (less than 5% difference from `TraStrainer w/o M`) because it also prioritizes `problem-related traces`, which might reduce pure `diversity`. `HC` and `Sifter` have noticeably lower `DIV` due to neglecting `latency` in encoding. `TraStrainer w/o D` (which does not consider diversity) shows no clear pattern in `DIV`.

*   **Representative Ability (API Distribution):**
    The paper analyzed the distribution of different `APIs` in the `sampling results` for one batch from each dataset.
    *   In Dataset

\mathcal{A}

, 8 `high-level API calls` were present, with API-4 and API-8 identified as `exceptional interfaces`.
    *   In Dataset

\mathcal{B}

, 5 `high-level API calls` were present, with API-4 considered an `exceptional interface`.

        The following figure (Figure 9 from the original paper) shows the distribution of different APls in the sampling results:

        ![Fig. 11. Efficiency Validation of TraStrainer.](/files/papers/6901a5fc84ecf5fffe471766/images/11.jpg)
        *该图像是图11，展示了TraStrainer框架在不同采样方法下的效率验证。图(a)对比了TraStrainer与Sifter和Sieve在采样延迟上的累积分布函数(CDF)表现；图(b)和(c)分别展示了采样延迟对Trace跨度数和指标数的敏感性分析。*

    *   Figure 9 shows that `TraStrainer` enhances the `representation` of `less frequently used APIs`, leading to a more `balanced sampling`. Critically, it also improves the `representation` of `APIs` that are `related to system issues` and are `deteriorating`, thanks to its consideration of `system state`.
    *   `HC` and `Sifter` only slightly increase `low-frequency API representation`.
    *   `Sieve` achieves `balanced sampling results` but fails to enhance the `representation` of `deteriorating APIs`. This implies `TraStrainer's` ability to capture `contextually important traces` beyond mere `statistical uncommonness`.

### 6.1.2. RQ2: Effectiveness of TraStrainer for Downstream Root Cause Analysis

This RQ evaluates how `sampling strategies` impact `RCA effectiveness`. Sampled traces from various methods were fed as input to three `state-of-the-art RCA methods` (`TraceAnomaly`, `TraceRCA`, `MicroRank`). Experiments were conducted at `budget sampling rates` of 0.1%, 1.0%, and 2.5%.

The following are the results from Table 3 of the original paper:

| RCA Approach | Sampling Approach | A@1 | A@1 | A@1 | A@3 | A@3 | A@3 | MRR | MRR | MRR
|---|---|---|---|---|---|---|---|---|---|---
| | | 0.1% | 1.0% | 2.5% | 0.1% | 1.0% | 2.5% | 0.1% | 1.0% | 2.5%
| TraceAnomaly | Random | 10.71 | 9.26 | 16.67 | 44.73 | 57.41 | 57.41 | 0.2820 | 0.3503 | 0.3997
| | HC | 9.26 | 12.96 | 18.52 | 37.04 | 42.59 | 55.56 | 0.2590 | 0.3664 | 0.3747
| | Sifter | 11.67 | 24.07 | 16.67 | 37.04 | 57.41 | 62.96 | 0.2753 | 0.4025 | 0.4145
| | Sieve | 8.81 | 18.52 | 29.63 | 44.44 | 53.70 | 57.41 | 0.2620 | 0.3762 | 0.4383
| | TraStrainer w/o M | 11.67 | 20.37 | 22.22 | 45.15 | 51.85 | 55.56 | 0.2903 | 0.3722 | 0.4068
| | TraStrainer w/o D | 12.96 | 38.89 | 44.44 | 49.81 | 77.78 | 75.93 | 0.3485 | 0.5948 | 0.6247
| | **TraStrainer** | **46.30** | **51.61** | **54.84** | **66.67** | **79.19** | **87.10** | **0.5707** | **0.6438** | **0.7151**
| TraceRCA | Random | 7.41 | 20.37 | 29.63 | 40.74 | 61.11 | 68.52 | 0.2525 | 0.4123 | 0.4991
| | HC | 9.26 | 24.07 | 24.07 | 46.30 | 62.96 | 62.96 | 0.2546 | 0.4324 | 0.4627
| | Sifter | 8.67 | 19.63 | 25.19 | 37.04 | 55.56 | 61.11 | 0.2449 | 0.4272 | 0.4836
| | Sieve | 18.52 | 31.48 | 38.89 | 42.59 | 51.85 | 57.41 | 0.3008 | 0.4157 | 0.4873
| | TraStrainer w/o M | 18.52 | 33.33 | 35.19 | 44.44 | 55.56 | 55.56 | 0.3191 | 0.4432 | 0.4642
| | TraStrainer w/o D | 24.07 | 55.56 | 55.56 | 38.89 | 81.48 | 77.78 | 0.3650 | 0.6880 | 0.6843
| | **TraStrainer** | **55.56** | **55.56** | **58.06** | **70.37** | **85.19** | **89.63** | **0.6265** | **0.7019** | **0.7510**
| MicroRank | Random | 5.56 | 16.67 | 27.78 | 20.37 | 50.00 | 61.11 | 0.1571 | 0.3423 | 0.4352
| | HC | 7.41 | 18.52 | 22.22 | 27.78 | 46.30 | 51.85 | 0.1954 | 0.3398 | 0.3731
| | Sifter | 5.56 | 18.52 | 27.78 | 23.42 | 46.30 | 61.11 | 0.1605 | 0.3414 | 0.4358
| | Sieve | 9.26 | 25.83 | 35.19 | 20.37 | 58.15 | 62.96 | 0.1657 | 0.4246 | 0.4963
| | TraStrainer w/o M | 12.96 | 16.67 | 24.07 | 42.59 | 42.59 | 55.56 | 0.2994 | 0.3241 | 0.4012
| | TraStrainer w/o D | 29.63 | 42.59 | 46.30 | 74.04 | 68.52 | 72.22 | 0.5228 | 0.5463 | 0.5509
| | **TraStrainer** | **42.59** | **45.16** | **50.00** | **77.74** | **78.52** | **82.26** | **0.5509** | **0.5889** | **0.6556**

*   **Overall Performance:** `TraStrainer` consistently achieves the best results across all `RCA methods` and `metrics` (`A@1, A@3`, and `MRR`), particularly at lower `sampling budgets`. This highlights its ability to provide higher `quality sampled data` for `downstream analysis`.

*   **Impact on `TraceAnomaly`:**
    *   `TraStrainer` significantly improved `A@1`, `A@3`, and `MRR` by 33.78%, 26.88%, and 31.45% respectively (over the best baseline).
    *   `TraceAnomaly`, being `VAE-based`, needs `normal traces` to learn patterns. Baselines (`HC`, `Sifter`, `Sieve`) performed poorly at low budgets, sometimes worse than `Random`, because their `diversity-only bias` led to predominantly `abnormal traces`, making it hard for `VAE` to learn `normal patterns`.
    *   `TraStrainer` performs well even at low budgets because it samples both `abnormal` and `problem-related common traces`, providing a more balanced view that aids `VAE` in learning. At 2.5% budget, baselines' `A@1` was below 30%, while `TraStrainer` reached 54.84% `A@1` and 0.7151 `MRR`.

*   **Impact on `TraceRCA` and `MicroRank`:**
    *   These methods use `spectrum analysis` (comparing `normal` vs. `abnormal invocations`). `TraStrainer` improved `A@1`, `A@3`, and `MRR` by 31.48%, 38.34%, and 28.96% respectively (over the best baseline).
    *   `HC` and `Sifter` performed similarly to `Random` because they only consider `trace structure` during `encoding`, providing no assistance to `latency-based RCA`.
    *   `Sieve` performed poorly at lower budgets because `spectrum analysis` needs both `normal` and `anomalous traces`, and `Sieve's` strong `uncommonness bias` might not provide enough `normal traces`.
    *   `TraStrainer` consistently achieved good results across budgets, maintaining `A@3` above 70%. This is because its `comprehensive sampling preferences` ensure that even at low budgets, it discards `traces unrelated to the problem` rather than `valuable traces` for `spectrum analysis`, enabling better `RCA performance`.

### 6.1.3. RQ3: Contribution of the Combination of System Status and Trace Diversity

This RQ uses `ablation experiments` with `TraStrainer w/o M` (diversity only) and `TraStrainer w/o D` (system state only) to analyze the impact of combining `system state` and `trace diversity`.

*   **API Distribution (Representative Ability):**
    The following figure (Figure 10 from the original paper) shows the distribution of different APls in the sampling results of different variants:

    ![Fig. 2. Trace data and system metrics.](/files/papers/6901a5fc84ecf5fffe471766/images/2.jpg)
    *该图像是论文中图2的示意图，展示了分布式追踪数据和系统指标的结构。左侧(a)为分布式追踪数据示意，包含TraceID、SpanID、被调用服务和注释，右侧(b)展示系统指标随时间变化的示意图，反映节点CPU使用率和延迟等指标。*

    *   Figure 10 (which is Fig. 9 from the paper, but in the context of ablation study, it's about variant comparison) illustrates the distribution of different `APIs` in `sampled results` for Dataset

\mathcal{B}$$. * TraStrainer w/o D shows a significant imbalance: deteriorating interface API-4 exceeds 70% distribution ratio, while underrepresented API-5 is completely absent. This shows that focusing only on system state can oversample problem-related APIs to the detriment of diversity. * TraStrainer w/o M produces relatively balanced sampling results on average but fails to bias towards the more crucial deteriorating API-4. This confirms that diversity alone might miss contextually important traces. * TraStrainer (full version) effectively enhances diversity and shows a stronger inclination towards the problematic API.

Effectiveness in Downstream Analysis (Table 3):
- TraStrainer w/o D (system state only) generally performs better than TraStrainer w/o M (diversity only). This suggests that prioritizing problem-related traces (via system state) is more beneficial for RCA tasks than solely focusing on uncommon traces.
- At a low budget (0.1%), TraStrainer w/o D has a significantly lower A@1 compared to TraStrainer. This is because TraStrainer w/o D's sampling preference (problem-relatedness) doesn't explicitly bias towards uncommon abnormal traces, potentially missing some root cause traces at very low budgets.
- At a high budget (2.5%), TraStrainer w/o D approaches the performance of TraStrainer. This is because with more budget, TraStrainer w/o D eventually captures most problem-related traces, including the culprits.
- TraStrainer (full version), by considering both system state and trace diversity, achieves superior analysis performance across all budgets. This empirically validates the paper's core motivation that a comprehensive sampling preference combining both factors is most effective.

6.1.4. RQ4: Efficiency of TraStrainer

This RQ evaluates TraStrainer's efficiency for online sampling. The sampler part was deployed on an Intel Xeon Gold 5318Y 2.10GHz CPU in a single-process form.

The following figure (Figure 11 from the original paper) shows the efficiency validation of TraStrainer:

该图像是论文中用于说明问题发生两个阶段及采样器采样概率对比的示意图。左图展示了问题出现（Phase ①）和暴露（Phase ②）两个阶段中节点A CPU利用率变化及对应采样轨迹，右图对比了有无系统状态信息的采样器在问题相关轨迹采样概率上的变化。

Sampling Latency Distribution (Figure 11a):
- TraStrainer generally exhibits smaller sampling latencies (0.28ms to 14.29ms) compared to baselines.
- Its maximum latency is 40.82% to 50.16% lower than baselines, demonstrating its suitability for online deployment with low latency requirements.
Impact of Trace Size (Figure 11b):
- With metric number fixed at 75, sampling latency increases linearly with trace size (number of spans).
- However, the maximum latency remains below 15ms, indicating good scalability with trace complexity.
Impact of Metric Number (Figure 11c):
- With trace size fixed at 50, sampling latency increases linearly when the metric number is low.
- Crucially, when the metric number exceeds 100, the growth rate of sampling latency becomes much slower. This is attributed to TraStrainer's dimension reduction strategies, which effectively prevent dimension explosion and maintain efficiency even with a large number of system metrics.
Conclusion on Efficiency: TraStrainer is efficient enough for online sampling, and its dimension reduction strategy effectively handles large numbers of metrics.

6.2. Data Presentation (Tables)

The tables from the paper have been transcribed and presented in the relevant sections above (Table 1 in Datasets, Table 3 in RCA Effectiveness).

6.3. Ablation Studies / Parameter Analysis

The ablation studies are presented as part of RQ3 (Contribution of the Combination of System Status and Trace Diversity). They highlight the individual and combined impact of system state and trace diversity components. The key findings are:

Both components contribute to TraStrainer's overall superiority.
The system state component (TraStrainer w/o D) is particularly strong in identifying problem-related traces and aiding RCA.
The diversity component (TraStrainer w/o M) ensures balanced representation of APIs and helps capture uncommon patterns.
The combination in full TraStrainer provides the best of both worlds, leading to robust performance across different budgets and RCA tasks. This confirms that the synergy between system-state bias and diversity bias is critical for TraStrainer's effectiveness.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study introduces TraStrainer, a novel online biased sampler for distributed traces in microservice systems. Its core innovation lies in its comprehensive approach to sampling preferences, which integrates both system runtime state and trace diversity, allowing for dynamic adjustment based on real-time conditions. TraStrainer employs an interpretable and automated encoding method for traces and utilizes a dynamic voting mechanism to combine system-state bias and diversity bias into a final sampling decision.

Through extensive evaluation on two diverse datasets (real-world production and benchmark), TraStrainer demonstrated significant improvements. It effectively identified more valuable traces within the same budget, leading to an average increase of 32.63% in Top-1 RCA accuracy compared to four baseline methods. Furthermore, TraStrainer exhibited higher efficiency suitable for online deployment, even with varying trace sizes and large numbers of metrics, thanks to its dimension reduction strategies. The ablation studies confirmed that the combination of system state and trace diversity is crucial for its superior performance in downstream analysis tasks.

7.2. Limitations & Future Work

The authors acknowledge the following limitations and suggest future work:

Tail-based Sampling Only: The current TraStrainer implementation is purely tail-based. The authors suggest that their method for setting sampling preferences could be beneficial for other types of biased sampling, specifically mentioning retroactive sampling [52]. Retroactive sampling retrieves trace data lazily to make biased sampling decisions earlier in the trace lifecycle, potentially improving efficiency. Integrating TraStrainer's preference setting with retroactive sampling could be a promising future direction.
Limited Fault Types: While tested on data from 13 production systems and 2 benchmarks, the total number of distinct fault types tested was limited (7 in total). They note that microservice systems can exhibit more complex fault types. Therefore, further testing is needed to assess TraStrainer's effectiveness in more intricate failure scenarios.
Implementation of Baselines: For HC and Sifter, lacking open-source implementations, the authors had to develop their own based on papers, which introduces a potential threat to internal validity. However, they took care to use the same libraries.

7.3. Personal Insights & Critique

This paper presents a highly valuable and well-motivated contribution to the field of microservice monitoring and observability.

Innovation and Practical Relevance: The core innovation of integrating system runtime state with trace diversity is particularly astute. It directly addresses a critical oversight in previous sampling methods and aligns closely with how SREs intuitively triage problems (e.g., "CPU is high on Node X, so look at traces touching Node X"). This makes TraStrainer not just theoretically sound but also immensely practically relevant for production environments. The significant improvement in RCA accuracy is a strong testament to its real-world utility.
Interpretability: The emphasis on interpretable trace encoding, where dimensions correspond directly to system metrics, is a major strength. In debugging and root cause analysis, transparency and understanding why a particular trace was sampled are almost as important as the sampling quality itself. This interpretability fosters trust and makes the system more actionable.
Robustness against Data Volume: The dimension scalability and reduction strategies in the Trace Encoder, combined with the efficiency shown with a large number of metrics, demonstrate a thoughtful design for real-world microservice systems that are constantly evolving and generating vast amounts of telemetry data.
Dynamic Adaptability: The dynamic voting mechanism is a clever way to ensure budget adherence while flexibly balancing sampling biases. This adaptability is crucial in dynamic microservice environments where anomaly rates and system loads can fluctuate wildly.
Potential Issues/Areas for Improvement:
- Cold Start for System Bias Extractor: The DLinear model for anomaly detection would require historical data to learn normal patterns. In a newly deployed system or for entirely new metrics, a cold start period might lead to less accurate system bias until sufficient time-series data is collected. The paper mentions a warm-up period for the sampler, but the System Bias Extractor itself might also benefit from specific cold-start handling.
- Thresholds and Sensitivity: The budget sampling rate $\beta$ is a critical parameter. While the dynamic voting mechanism adapts to $\theta$ relative to $\beta$ , the initial setting of $\beta$ and how it interacts with the anomaly detection thresholds in the System Bias Extractor or diversity clustering could be sensitive. Further analysis on tuning these parameters might be beneficial.
- Generalizability of Fault Types: While the authors acknowledge the limited fault types, deeper investigation into complex, interacting faults (e.g., cascading failures involving multiple services and resource contention) would be valuable. The current metric-based encoding should, in principle, handle these, but empirical validation is key.
- Integration with Retroactive Sampling: The suggestion to integrate with retroactive sampling is compelling. This would allow TraStrainer's intelligent preference setting to influence sampling decisions earlier, potentially saving even more computational overhead in trace collection pipelines.
- Computational Cost of Clustering: The Diversity-Biased Sampler involves clustering trace vectors in the look-back window. While DLinear is light, clustering can be computationally intensive, especially for large look-back windows or high-dimensional vectors. The paper doesn't detail the specific clustering algorithm used or its overhead.
  
  In conclusion, TraStrainer represents a significant step forward in trace sampling, offering a more intelligent and context-aware approach that promises to deliver highly actionable data for microservice operations, moving beyond simple diversity to system-aware intelligence.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Dataset	Microservice Benchmark	Trace Number	Metric Number	Batch Number	Uncommon Label Rate	Problem-related Label Rate	Fault Types Number
A	13 Production Microservice systems	909,797	121	62	2.5%	2.5%	6
B	OnlineBoutique and TrainTicket	112,000	32	56	5.0%	5.0%	5

TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~20 min read · 21,721 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Runtime Data Preprocessing Phase

4.2.1.1. Trace Encoder

4.2.1.2. System Bias Extractor

4.2.2. Online Comprehensive Sampling Phase

4.2.2.1. System-Biased Sampler

6.1.4. RQ4: Efficiency of TraStrainer

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers