Paper status: completed

Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems

Published:09/01/2021
Original Link
Price: 0.100000
5 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Sieve uses attention and random cut forests to sample structurally and temporally uncommon traces, improving information retention and reducing storage costs in large-scale microservice systems.

Abstract

Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems Zicheng Huang, Pengfei Chen*, Guangba Yu, Hongyang Chen, Zibin Zheng School of Data and Computer Science Sun Yat-sen University Guangzhou 510006, China Email: { huangzch8, yugb5, chenhy95 } @mail2.sysu.edu.cn, { *chenpf7, zhzibin } @mail.sysu.edu.cn Abstract —End-to-end tracing plays an important role in un- derstanding and monitoring distributed microservice systems. The trace data are valuable to help find out the anomalous or erroneous behavior of the system. However, the volume of trace data is huge leading to a heavy burden on analyzing and storing them. To reduce the volume of trace data, the sampling technique is widely adopted. However, existing uniform sampling approaches are unable to capture uncommon traces that are more interesting and informative. To tackle this problem, we design and implement Sieve , an online sampler that aims to bias sampling towards uncommon traces by taking advantage of the attention mechanism. The evaluation results on the trace datasets collected from real-world and experimental microservice systems show that Sieve is effective to increas

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems

1.2. Authors

Zicheng Huang, Pengfei Chen*, Guangba Yu, Hongyang Chen, Zibin Zheng. Their affiliation is the School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China. Pengfei Chen is indicated as the corresponding author. The authors are primarily affiliated with a university, suggesting an academic research background focused on computer science, distributed systems, and potentially AI/machine learning given the use of attention mechanisms and anomaly detection.

1.3. Journal/Conference

The paper does not explicitly state the journal or conference where it was published within the provided text. However, the presence of an Index Terms section and a REFERENCES section formatted with IEEE style suggests it is intended for publication in a peer-reviewed conference or journal in the field of computer science, likely related to distributed systems, software engineering, or machine learning for systems.

1.4. Publication Year

The publication year is not explicitly stated in the provided text. However, the references section includes citations up to 2020, with several papers from 2018, 2019, and 2020 (e.g., [7], [16]). This suggests the paper was likely published in 2020 or later.

1.5. Abstract

End-to-end tracing is crucial for understanding and monitoring distributed microservice systems, but the vast volume of trace data poses significant challenges for analysis and storage. Current uniform sampling techniques often fail to capture rare but informative traces. To address this, the paper introduces Sieve, an online sampling mechanism that utilizes an attention mechanism to prioritize sampling of traces that are structurally and temporally uncommon. Sieve encodes traces as path vectors to capture these differences, employs Robust Random Cut Forest (RRCF) for uncommon pattern detection, and calculates attention scores to determine sampling probabilities. Experimental evaluations on four diverse microservice trace datasets demonstrate that Sieve effectively increases the selection of valuable uncommon traces while substantially reducing storage costs at low sampling rates. Furthermore, Sieve exhibits robustness across varying degrees of uncommonness and parameter configurations, making it suitable for large-scale microservice environments.

/files/papers/6901a5c984ecf5fffe471754/paper.pdf The publication status is unknown from the provided text, but the file path suggests it is likely an officially published paper or an accepted preprint from a reputable academic platform.

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by the paper is the immense volume of end-to-end trace data generated in distributed microservice systems, which creates a heavy burden for storage and analysis. While traces are vital for understanding system behavior, diagnosing issues, and detecting anomalies, most of this data is redundant, representing common, normal system operations. Sampling is a widely adopted technique to reduce data volume.

Existing sampling methods face significant challenges:

  • Head-based sampling (e.g., in Zipkin, Jaeger) makes decisions at the start of a request, lacking awareness of future events, leading to random sampling that often misses informative traces.

  • Tail-based sampling addresses some head-based limitations by deciding at the end of a request, considering metrics like latency or HTTP status codes. However, it still focuses on individual traces and overlooks the broader context of differences between the current trace and previous traces.

  • Previous methods focusing on differences (e.g., [3], [4]) primarily consider only structural differences, neglecting temporal differences, which are equally crucial for identifying unusual behavior. For example, a trace with a normal structure but abnormally high latency indicates an issue.

    The paper's entry point and innovative idea revolve around the concept of "attention." It proposes that informative traces are those that are uncommon – differing either structurally or temporally (or both) from the majority of other traces. By explicitly incorporating both structural and temporal attention, Sieve aims to bias sampling towards these rare, valuable traces, which are often indicative of anomalies or interesting corner cases.

2.2. Main Contributions / Findings

The paper makes four primary contributions:

  1. Introduction of the Attention Mechanism for Uncommon Trace Capture: Sieve introduces an attention mechanism that focuses on both temporal and structural differences to identify and capture uncommon traces. This is a significant improvement over prior work that typically focused on only one aspect.

  2. Path Vector Encoding with Temporal and Structural Information: The paper proposes a novel path vector encoding method to represent traces. This method effectively incorporates both temporal (e.g., latency of tail spans) and structural (e.g., sequence of services, number of spans) information into a single vector, making traces processable by algorithms and allowing for comprehensive comparison.

  3. Biased Sampling Scheme based on Attention Score: Sieve develops a biased sampling scheme that leverages attention scores. This scheme assigns high sampling probabilities to uncommon traces (those with high attention scores) and significantly lower probabilities to common traces, leading to substantial storage space reduction while preserving critical data.

  4. Online Sampling Implementation and Evaluation: The paper details the design and implementation of Sieve as an online sampler with low computational cost. Its effectiveness is rigorously evaluated across several real-world and experimental microservice systems, demonstrating its suitability for large-scale production environments.

    Key conclusions and findings reached by the paper include:

  • Sieve effectively detects and increases sampling probabilities for both structurally and temporally uncommon traces.
  • It substantially reduces storage costs at low sampling rates by discarding a large volume of redundant common traces.
  • The method demonstrates robustness across varying degrees of uncommonness and different parameter settings (e.g., RRCT size, number of RRCTs, threshold).
  • Sieve's sampling latency is low (typically between 3ms and 33ms), making it efficient enough for real-time online sampling in large-scale microservice systems.
  • Compared to baseline methods like hierarchical clustering (HC) and random sampling, Sieve consistently achieves a higher proportion of uncommon traces sampled for a given budget.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following foundational concepts:

  • Microservice Architecture: A software development approach where a large application is built as a suite of small, independently deployable services, each running in its own process and communicating with lightweight mechanisms (e.g., HTTP APIs). This contrasts with monolithic architectures where all components are tightly coupled.
  • Distributed Systems: Systems where components located on different networked computers communicate and coordinate their actions by passing messages. Microservice systems are a type of distributed system.
  • End-to-End Tracing (Distributed Tracing): A technique used to monitor and profile requests as they propagate through a distributed system. It records the path of a request across multiple services, capturing details like latency, service calls, and errors. A trace typically consists of multiple spans.
    • Trace: Represents a single end-to-end request or workflow. It's often visualized as a Directed Acyclic Graph (DAG) of spans.
    • Span: Represents a logical unit of work within a trace, such as a single service call or an operation within a service. Each span has a name, start time, duration (latency), and potentially tags or logs. Spans are nested to show parent-child relationships, forming the causal chain of a request.
    • Root Span: The first span in a trace, representing the entry point of the request into the system.
  • Sampling: The process of selecting a subset of data from a larger dataset. In distributed tracing, sampling is used to reduce the volume of traces collected, stored, and analyzed, as collecting all traces is often prohibitively expensive.
    • Uniform Sampling: A method where each trace (or data point) has an equal probability of being selected.
    • Head-based Sampling: A sampling decision is made at the very beginning of a trace's execution (when the root span is created). This decision is then propagated to all downstream services, meaning either the entire trace is collected or entirely dropped. Its disadvantage is that the decision is made without knowing the trace's future characteristics (e.g., if it will become anomalous).
    • Tail-based Sampling: The sampling decision is deferred until the entire trace has completed. This allows the sampler to inspect the full trace (e.g., its total latency, presence of errors, or specific service calls) before deciding whether to keep it. This method is generally more effective at capturing informative traces but requires buffering traces until completion.
  • Attention Mechanism (General Concept): In machine learning, an attention mechanism allows a model to focus on the most relevant parts of its input when processing data. Instead of treating all parts of the input equally, it assigns "attention weights" or "scores" to different parts, highlighting their relative importance. In Sieve, this means giving higher scores (and thus higher sampling probability) to traces that are deemed uncommon or anomalous.
  • Outlier Detection (Anomaly Detection): The process of identifying data points that deviate significantly from the majority of the data. These outliers are often anomalies that could indicate problems or interesting events.
  • Isolation Forest (iForest): An anomaly detection algorithm that works by isolating outliers rather than profiling normal data points. It builds multiple isolation trees (similar to decision trees) by recursively partitioning data. Anomalies, being few and different, are typically isolated with fewer partitions (i.e., closer to the root of the tree), resulting in shorter average path lengths.
  • Robust Random Cut Forest (RRCF): A variant of Isolation Forest specifically designed for streaming data. It builds multiple random cut trees (RCTs) and maintains them over time, adapting to changes in data distribution. RRCF is robust to noisy data and works well for high-dimensional streams. Sieve leverages RRCF to detect uncommon traces by treating them as outliers in a stream of path vectors.
  • Path Vector: A numerical representation of a trace. Sieve encodes traces into path vectors by capturing structural (which paths exist) and temporal (latency values along those paths) information. This vector serves as the input to the RRCF for anomaly detection.

3.2. Previous Works

The paper discusses several related works, mainly focusing on distributed tracing systems and sampling techniques:

  • Distributed Tracing Systems (Dapper, Zipkin, Jaeger, X-Trace, Magpie):

    • Dapper [8]: Google's production distributed systems tracing platform. It's a foundational work that established many concepts for distributed tracing. It uses random sampling (e.g., < 0.1% sampling rate) to reduce storage costs.
    • Zipkin and Jaeger: Open-source distributed tracing systems that follow Dapper's model. They primarily use head-based sampling, making sampling decisions at the start of a request.
    • Magpie [17] and X-Trace [19]: Earlier tracing frameworks that aim to capture control flow, resource consumption, and provide comprehensive views of system behavior.
    • Limitation: While these systems are crucial for observability, their default uniform or head-based sampling mechanisms are limited in their ability to capture uncommon or informative traces.
  • Performance Diagnosis and Anomaly Detection (Draco, Seer, Sipta, MEPFL, AutoMap):

    • Works like Draco [9], Seer [10], Sage [11], Sipta [12], MEPFL [15], AutoMap [23], and others [13], [14], [20]-[22], [24] focus on using trace data to detect performance issues, anomalies, or pinpoint root causes.
    • MEPFL [15]: Trains prediction models using trace-level and microservice-level features to predict latent errors and locate faults. Sieve could potentially incorporate some of these features to improve its uncommonness detection.
    • Limitation: These works focus on analysis after traces are collected, rather than optimizing the collection process through intelligent sampling.
  • Biased Sampling Techniques:

    • JCallGraph [25]: Samples only successful traces and preserves all failure invocation traces. While reducing network bandwidth, it loses traces with high latency that might not be failures but still indicate problems.
    • Martin Bauer et al. [26]: Introduces trace sampling into conformance checking, continuously sampling until no new information for conformance is found. This is a domain-specific approach.
    • Hierarchical Clustering (HC) Sampling [3]: A method that biases sampling to maximize the diversity of traces based on the number of spans (a structural characteristic).
      • Limitation: HC sampling primarily focuses on structural diversity and is not designed for online sampling. It also fails to account for temporal uncommonness (e.g., a trace with a normal structure but unusually high latency).
    • Sifter [4]: Builds a low-dimension model to approximate common-case behavior and biases sampling towards traces poorly captured by this model.
      • Limitation: Similar to HC, Sifter mainly focuses on the structure of traces and overlooks temporal characteristics. This means traces with uncommon temporal behavior would be ignored.
    • Error-bounded stratified sampling [27]: From database literature, this scheme reduces sample size with knowledge of data distribution. It's a different domain but shows the general need for intelligent sampling.

3.3. Technological Evolution

The evolution of distributed tracing and sampling has progressed from basic random sampling (e.g., Dapper) to more sophisticated methods that attempt to capture informative traces. Initially, the focus was on simply reducing data volume. As systems grew more complex, the need to retain anomalous or uncommon traces became critical for debugging and monitoring. Early biased sampling methods started by considering explicit error codes or high latency (tail-based) or structural diversity (HC, Sifter). However, a holistic view combining both structural and temporal deviations, especially in an online and adaptive manner, remained a challenge.

This paper's work, Sieve, fits into this evolutionary timeline by addressing the limitations of previous biased sampling approaches. It advances the state-of-the-art by proposing a method that explicitly considers both temporal and structural attention for uncommon trace detection and applies it in an online sampling context using an adapted RRCF algorithm.

3.4. Differentiation Analysis

Compared to the main methods in related work, Sieve offers several core differences and innovations:

  • Comprehensive Attention (Structural AND Temporal): Unlike HC [3] and Sifter [4] which primarily focus on structural differences (e.g., number of spans, call graphs), Sieve explicitly incorporates both structural and temporal information into its path vector encoding. This allows it to identify uncommon traces that might have a normal structure but abnormal latencies, or vice-versa.

  • Online and Adaptive Sampling: Sieve is designed as an online sampler that processes trace data in real-time. It adapts to changes in trace distribution by continuously updating its Robust Random Cut Forest (RRCF) model, removing older traces and incorporating newer ones. This is a significant advantage over offline or batch-processing methods.

  • Adaptive RRCF for Variable-Length Path Vectors: Sieve innovates on the RRCF model by enabling it to handle path vectors of varying lengths (due to new execution paths appearing). It introduces mechanisms like path vector expansion and dimension reduction to manage this complexity, which is crucial for handling the dynamic nature of microservice traces.

  • Attention Score-based Biased Sampling: Sieve uses an attention score derived from RRCF to directly inform sampling probabilities. This score quantifies how "uncommon" a trace is, allowing for a more granular and intelligent biased sampling decision compared to simpler heuristics or fixed rules.

  • Robustness to Uncommonness Degree and Parameters: The evaluation demonstrates Sieve's robustness, meaning its performance is stable even with varying degrees of uncommonness and across a range of parameter settings, which is important for practical deployment.

    In essence, Sieve provides a more holistic, adaptive, and robust approach to biased sampling for distributed trace data by combining structural and temporal insights with an online anomaly detection algorithm.

4. Methodology

4.1. Principles

The core principle of Sieve is to intelligently reduce the volume of end-to-end trace data by prioritizing the sampling of uncommon traces over common or redundant ones. It operates on the intuition that most microservice systems run steadily, producing a large volume of similar traces. Uncommon executions, however, are rare but highly informative for debugging, diagnosis, and anomaly detection. Sieve aims to identify these rare patterns by focusing on temporal and structural differences between traces. It achieves this by using an isolation-based approach (specifically, an adapted Robust Random Cut Forest) to detect outliers in a stream of trace representations, assigning higher attention scores to these uncommon traces, and then using these scores to bias the sampling probability. The overall goal is to preserve useful information while drastically reducing storage overhead.

4.2. Core Methodology In-depth (Layer by Layer)

Sieve's workflow consists of several key components working in concert, as illustrated in Figure 2. The following figure (Figure 2 from the original paper) shows the overview of Sieve's sampling workflow:

Fig. 7. Illustration of the insertion of path vector with a new dimension. 该图像是图7的示意图,展示了路径向量插入新维度的过程,包括路径向量扩展、边界框扩展和新根节点的创建,突出显示了向量在不同步骤的变化和组成。

The Trace Data (raw traces) are first processed by the Path Vector Encoder to convert them into a numerical format called path vectors. These path vectors are then fed into the Adaptive Scorer, which uses an enhanced Robust Random Cut Forest (RRCF) model to calculate an attention score for each vector. Finally, the Biased Sampler uses this attention score (along with previous scores) to decide whether to sample (keep) or discard the trace. The sampled traces are stored persistently.

4.2.1. Path Vector Encoder

The Path Vector Encoder is responsible for transforming a raw trace (which is a DAG of spans) into a numerical path vector x=(x1,...,xn)\mathbf{x} = (x_1, ..., x_n). This encoding is crucial because algorithms cannot directly process human-readable text traces. The encoder incorporates both structural and temporal information into the vector representation.

  • Trace Traversal and Path Set Generation:

    • A trace represents the execution path of a request. It starts with a root span.
    • From the root span, Sieve recursively traverses all spans to identify unique execution paths. A path is defined as a sequence of spans from the root span to any subsequent span (e.g., A>B>CA -> B -> C).
    • For each path pp identified, it is associated with the latency (ltl_t) of the tail span (the last span) of that path.
    • This process generates a path set PP and a corresponding latency set LL for a given trace.
  • Path Vector Construction:

    • The path vector x\mathbf{x} is built such that each index ii is associated with a unique path pip_i that Sieve has encountered across all traces seen so far.

    • If a trace contains a specific path pip_i, the value xix_i at its corresponding index ii in the path vector is assigned the latency lil_i of the tail span of that path.

    • If a trace does not contain the path pip_i associated with index ii, then xix_i is assigned a special value of -1. This -1 signifies an invalid index, indicating the absence of that particular path in the current trace.

    • Handling Multiple Identical Paths: If a trace contains multiple instances of the same path, Sieve selects the maximum latency among these paths for encoding into the path vector. This ensures that potentially problematic high latencies are captured.

      The following figure (Figure 3 from the original paper) shows an example of path vector encoding of two traces:

      Fig. 8. The number of invalid dimensions and weak dimensions. Different lines represent different RRCT sizes. The \(\\mathbf { X }\) -axis shows the number of traces that are processed by Sieve. 该图像是图表,展示了图8中不同RRCT大小下处理轨迹数量变化时无效维度数和弱维度数的趋势,横轴为处理的轨迹数量,左图纵轴表示无效维度数,右图纵轴表示弱维度数。

Figure 3 illustrates this:

  • Trace 1 (left): Contains paths AA, A>BA->B, A>B>CA->B->C, A>B>DA->B->D. The path vector would encode the latencies of the tail spans for these paths.

  • Trace 2 (right): Contains paths AA, A>BA->B, A>B>EA->B->E. If path A>B>CA->B->C corresponds to index 2 in the global path set, and Trace 2 does not have this path, then x2x_2 for Trace 2 would be -1.

  • Distinguishability: This encoding scheme allows Sieve to distinguish traces based on:

    • Structural Differences: Traces with different structures will have different sets of valid indices (non--1 values), making them distinguishable.
    • Temporal Differences: Traces with the same structure but different latencies for common paths will have different values in their valid indices, making them distinguishable.
  • Extensibility: Sieve can extend this path vector to include other meaningful features by concatenating them. Examples include:

    • Number of spans: Provides additional structural information.
    • Request status code: Important for discovering anomalous traces (e.g., HTTP 404/500 errors). This creates a more informative trace encoding for the subsequent isolation process.

4.2.2. Adaptive Scorer

The Adaptive Scorer takes the path vectors as input and determines an attention score for each, reflecting its uncommonness. It achieves this using an enhanced Robust Random Cut Forest (RRCF).

4.2.2.1. Isolation

  • Core Idea: Since traces are encoded as path vectors, finding uncommon traces becomes equivalent to finding uncommon path vectors. Uncommonness implies deviation from the majority distribution.
  • Partitioning Procedure: Sieve uses an isolation-based approach (like Isolation Forest) to repeatedly partition the set of path vectors until each vector is isolated. Uncommon path vectors are expected to be isolated earlier and more easily because they are fewer in number and differ significantly from the common majority.
  • Tree Structure: A tree is used to represent this partitioning process. The length of the path from the root to a leaf node (where a single path vector resides) indicates the number of partitions required to isolate that vector. Uncommon path vectors will have shorter paths (shallower depths) than common path vectors.
  • RRCF for Streaming Data: Sieve adopts Robust Random Cut Forest (RRCF) because it is well-suited for outlier detection in streaming data and can adapt to changing data distributions. Sieve builds an enhanced RRCF model to calculate attention scores.

4.2.2.2. RRCF Adaptation for Variable Dimensions

The original RRCF model typically processes data with fixed dimensions. Sieve introduces innovations to handle path vectors whose dimensions can vary (e.g., when a trace contains paths never seen before).

  • RRCF Structure: An RRCF is composed of many Robust Random Cut Trees (RRCTs). Each RRCT is built independently and has a fixed size mm (number of path vectors it holds).

  • Two Stages of an RRCT:

    • Construction Stage:
      • When initially building an RRCT, it takes a set XX of mm path vectors of dimension nn: X={xi=(xi1,...,xin)xiX,1im}X = \{ \mathbf{x}_i = (x_{i1}, ..., x_{in}) | \mathbf{x}_i \in X, 1 \leq i \leq m \}.

      • Each RRCT is built by recursively partitioning XX. Starting from the root, Sieve selects a cutting dimension jj (1jn1 \leq j \leq n) and a cutting value qjq_j.

      • Cutting Dimension Selection (Sieve's Innovation): Sieve prioritizes structural distinction.

        • First, it checks for invalid dimensions: a dimension jj is invalid if, for some path vector in the current subset, the index jj is -1 (meaning that path is absent). Sieve prefers to select one of these invalid dimensions as the cutting dimension to quickly separate traces with different structures.
        • If no invalid dimensions exist, Sieve selects the cutting dimension based on its weight. The weight wjw_j of dimension jj is calculated as: $ w _ { j } = \frac { \underset { 1 \leq i \leq m } { \mathrm{max} } \ x _ { i j } - \underset { 1 \leq i \leq m } { \mathrm{min} } \ x _ { i j } } { \sum _ { j = 1 } \left( \underset { 1 \leq i \leq m } { \mathrm{max} } \ x _ { i j } - \underset { 1 \leq i \leq m } { \mathrm{min} } \ x _ { i j } \right) } $
          • Symbol Explanation:
            • wjw_j: The weight of dimension jj.
            • max1im xij\underset{1 \leq i \leq m}{\mathrm{max}} \ x_{ij}: The maximum value of xx in dimension jj across all mm path vectors in the current subset.
            • min1im xij\underset{1 \leq i \leq m}{\mathrm{min}} \ x_{ij}: The minimum value of xx in dimension jj across all mm path vectors in the current subset.
            • j=1()\sum_{j=1} (\cdot): Summation over all dimensions.
          • Purpose: This formula calculates the relative range (difference between max and min values) of a dimension compared to the sum of ranges across all dimensions. A dimension with a larger range indicates greater variability, making it more effective for partitioning. The probability of selecting dimension jj is proportional to wjw_j, thus prioritizing dimensions with the largest differences (often indicative of temporal uncommonness).
      • Cutting Value Selection:

        • If the selected cutting dimension jj is an invalid index for some path vectors, the cutting value qjq_j is fixed at -0.5. This ensures that path vectors with -1 (invalid) are separated from those with actual latency values.
        • Otherwise (if it's a valid dimension), qjq_j is randomly selected between the maximum and minimum values in dimension jj from the current subset.
      • Partitioning: Based on qjq_j, the subset XX is split into XL (where xijqjx_{ij} \leq q_j) and XR (where xij>qjx_{ij} > q_j), forming the left and right children. This process continues recursively until each leaf contains only one path vector.

        The following figure (Figure 4 from the original paper) shows the workflow of the construction stage (left) and the maintenance stage (right):

        该图像是四个折线图和散点图组合的图表,展示了四个微服务系统(VWR、AIOps、Boutique、TC)中Trace的得分与稀有性的关系,上部为得分,红圈突出显示高得分Trace,下部为二值稀有性指标,反映Trace的异常性分布。 该图像是四个折线图和散点图组合的图表,展示了四个微服务系统(VWR、AIOps、Boutique、TC)中Trace的得分与稀有性的关系,上部为得分,红圈突出显示高得分Trace,下部为二值稀有性指标,反映Trace的异常性分布。

    The following figure (Figure 5 from the original paper) shows the illustration of the building procedure of one RRCT:

    该图像是对比四个微服务系统(VWR、AIOps、Boutique、TC)追踪数据中得分与采样概率关系的图表。图中上半部分为分数趋势,下半部分展示对应采样概率,红圈标记了采样点,反映了不同系统中采样方法对重要追踪的选择效果。 该图像是对比四个微服务系统(VWR、AIOps、Boutique、TC)追踪数据中得分与采样概率关系的图表。图中上半部分为分数趋势,下半部分展示对应采样概率,红圈标记了采样点,反映了不同系统中采样方法对重要追踪的选择效果。

    • Maintenance Stage: After construction, RRCTs are dynamically updated for streaming data.
      • Leaf Removal: When a new path vector arrives, the leaf containing the oldest path vector in the RRCT is removed to maintain a constant size mm. The RRCT is adjusted by replacing the parent of the removed leaf with its sibling and updating the bounding box (which records the upper and lower bounds of values on each dimension for a node's children) of the internal node above the sibling.

        The following figure (Figure 6 from the original paper) shows the illustration of the removal of the oldest leaf:

        Fig. 11. Sieve's sensitivity to the degree of uncommonness. 该图像是一张柱状图,展示了不同延迟区间下Sieve采样方法的真实正例(TP)和假正例(FP)数量分布,反映了系统对不同延迟段的检测效果。

      • Path Vector Insertion: The incoming path vector is then inserted into each RRCT.

        • Handling New Dimensions (Sieve's Innovation): If the incoming path vector introduces new dimensions (paths never seen before):

          1. All existing path vectors currently in the RRCT are extended by appending -1 for these new dimensions.

          2. The bounding boxes of internal nodes are also extended similarly.

          3. A new root is generated. Its cutting dimension is set to one of the new dimensions, and its cutting value to -0.5. The old root becomes its left child, and the incoming path vector (representing the new dimension) becomes its right child. The new root's bounding box is updated.

            The following figure (Figure 7 from the original paper) shows the illustration of the insertion of path vector with a new dimension:

            Fig. 12. Sieve's sensitivity to the its parameters. 该图像是图表,展示了论文中Sieve算法对参数敏感性的分析,包含三个子图分别展示RRCT的大小、数量和阈值对真阳性(TP)与假阳性(FP)的影响。

        • Handling Existing Dimensions: If the incoming path vector has no new dimensions, Sieve uses the original RRCF insertion method [5].

      • Adaptability: By maintaining a constant size mm of the latest path vectors and dynamically updating, Sieve evaluates new traces in the context of recent data distribution, enabling it to adapt to evolving system behavior.

4.2.2.3. Attention Score

After RRCTs are built and maintained, an attention score is assigned to each path vector.

  • Intuition: Uncommon path vectors are isolated at shallower depths in the tree (closer to the root) because fewer splits are needed to separate them.

  • Individual RRCT Score: For a path vector xi\mathbf{x}_i ending up in leaf ll' of RRCT tjt_j, its attention score sis_i from that specific tree is calculated as: $ s _ { i } = \frac { \underset { \mathit { \theta } } { \mathrm{max} } \mathrm{depth}(l, t_i) } { \mathrm{depth}(l', t_j) } $

    • Symbol Explanation:
      • sis_i: The attention score for path vector xi\mathbf{x}_i from a single RRCT tjt_j.
      • maxθdepth(l,ti)\underset{\mathit{\theta}}{\mathrm{max}} \mathrm{depth}(l, t_i): The maximum possible depth in RRCT tit_i. This acts as a normalization factor.
      • depth(l,tj)\mathrm{depth}(l', t_j): The actual depth of the leaf ll' containing path vector xi\mathbf{x}_i in RRCT tjt_j.
    • Purpose: A smaller depth(l,tj)\mathrm{depth}(l', t_j) (shallower leaf) results in a higher score, indicating greater uncommonness.
  • Final Attention Score: Since an RRCF consists of kk RRCTs, the final attention score ss for a path vector is the average of the scores from all RRCTs: $ s = \frac { s _ { 1 } + \ldots + s _ { k } } { k } $

    • Symbol Explanation:
      • ss: The final attention score for the path vector.
      • s1,,sks_1, \ldots, s_k: Individual attention scores obtained from each of the kk RRCTs.
      • kk: The total number of RRCTs in the RRCF.
    • Purpose: Averaging scores from multiple trees reduces bias and provides a more robust measure of uncommonness. A high final score from most trees strongly suggests an uncommon trace.

4.2.2.4. Dimension Reduction

The path vector expansion mechanism, while necessary for adaptability, can lead to a curse of dimensionality as the number of unique paths grows indefinitely. Sieve implements dimension reduction to mitigate this:

  • Problem: Sieve records all execution paths it has ever seen, even if the RRCTs only contain the latest mm path vectors. This leads to many dimensions being invalid (value -1) for all path vectors currently in the RRCT.
  • Safe Removal of Invalid Dimensions: Sieve safely removes dimensions that are invalid for all path vectors within the RRCT because they do not influence the partitioning decision.
  • Aggressive Removal of Weak Dimensions: Sieve takes a more aggressive approach by temporarily removing weak dimensions that meet two criteria:
    1. The dimension is not a cutting dimension.
    2. The variance of values in that dimension is lower than 0.1.
    • Rationale: Weak dimensions have very little variability and contribute almost nothing to the partitioning process, hence they are unlikely to be chosen as cutting dimensions.

    • Temporary Removal and Re-introduction: Sieve monitors the variance of these weak dimensions. If a weak dimension's variance later increases above 0.1, it is re-classified as a normal dimension, and all path vectors in the RRCT are extended to include this dimension again.

      The following figure (Figure 8 from the original paper) shows the number of invalid dimensions and weak dimensions:

      Fig. 13.The proportion of uncommon traces sampled by three different methods in different trace datasets. 该图像是图表,展示了图13中四个不同微服务追踪数据集(VWR、AIOps、Boutique、TC)中三种采样方法(Sieve、HC、Random)随着预算变化,采样到的罕见追踪比例。图表显示Sieve在各种预算下均优于其他方法,能有效提升罕见追踪的采样比例。

Figure 8 illustrates the growth of invalid and weak dimensions over time and how their numbers change with different RRCT sizes, highlighting the necessity of this reduction.

4.2.3. Biased Sampler

The Biased Sampler makes the final decision to sample or drop a trace based on its attention score.

  • Sliding Window: Sieve maintains a sliding window containing the kk most recent attention scores, along with the current score sk+1s_{k+1}.

  • Variance Difference: It calculates two variances:

    • varkvar_k: The variance of the past kk scores in the window.
    • vark+1var_{k+1}: The variance of all k+1k+1 scores (past kk plus the current one).
  • Deviation Detection: The difference (vark+1vark)(var_{k+1} - var_k) indicates how much the current score sk+1s_{k+1} deviates from the distribution of the previous kk scores. A significant increase suggests the current trace is uncommon.

  • Sampling Probability Calculation: The sampling probability f(sk+1)f(s_{k+1}) for the current trace is determined as follows: $ f ( s _ { k + 1 } ) = \left{ \begin{array} { l l } { \displaystyle \frac { 1 } { 1 + e ^ { 2 \overline { { W } } - s _ { k + 1 } } } } & { \mathrm { i f } \ v a r _ { k + 1 } - v a r _ { k } > h , } \ { \displaystyle \frac { s _ { k + 1 } } { \sum _ { i = 1 } ^ { k + 1 } s _ { i } } } & { \mathrm { i f } \ v a r _ { k + 1 } - v a r _ { k } \leq h } \end{array} \right. $

    • Symbol Explanation:

      • f(sk+1)f(s_{k+1}): The sampling probability for the current trace with attention score sk+1s_{k+1}.
      • W\overline{W}: The mean value of the scores in the sliding window W=[s1,...,sk+1]W = [s_1, ..., s_{k+1}].
      • vark+1var_{k+1}: Variance of scores in the sliding window including the current score.
      • varkvar_k: Variance of scores in the sliding window excluding the current score (i.e., only the previous kk scores).
      • hh: A predefined threshold.
      • ee: Euler's number (base of the natural logarithm).
      • i=1k+1si\sum_{i=1}^{k+1} s_i: Sum of all scores in the sliding window.
    • Interpretation of the formula:

      • If (vark+1vark>h)(var_{k+1} - var_k > h): This condition means the current score sk+1s_{k+1} significantly deviates from the distribution of previous scores, indicating a strong uncommonness. In this case, Sieve uses a sigmoid function 1/(1+e(sk+12W))1 / (1 + e^{-(s_{k+1} - 2\overline{W})}) (rearranged from the paper's formula as 1/(1+e2Wsk+1)1 / (1 + e^{2\overline{W} - s_{k+1}})). This function rapidly pushes the sampling probability close to 1 for high attention scores, ensuring uncommon traces are almost certainly sampled.
      • If (vark+1varkh)(var_{k+1} - var_k \leq h): If the current score does not cause a significant deviation, the sampling probability is calculated proportionally to its attention score relative to the sum of all scores in the sliding window. This means common traces with low attention scores will have very low sampling probabilities.
  • Sampling Decision: After calculating f(sk+1)f(s_{k+1}), Sieve generates a random number between [0, 1]. If f(sk+1)f(s_{k+1}) is greater than or equal to this random number, the trace is sampled; otherwise, it is dropped.

4.2.4. Online Sampling

Sieve is designed for real-time online sampling.

Algorithm 1: Online Sampling Algorithm

Input:

  • The trace tt
  • The RRCF model rrcf
  • The sliding window sw of the previous kk scores
  • The threshold hh

Output:

  • The sampling decision decision (True if sampled, False if dropped)
  • The updated sliding window sw
  • The updated RRCF model rrcf

Steps:

  1. Encode Trace: x=encode(t)x = encode(t): The incoming trace tt is encoded into a path vector x\mathbf{x}.
  2. Initialize Scores List: scores=[]scores = []: An empty list to hold individual RRCT scores.
  3. Iterate through RRCTs: For each rrct in rrcf: a. rrct.remove_oldest(): The oldest path vector (leaf) is removed from the rrct to maintain its fixed size. b. rrct.insert(x): The new path vector x\mathbf{x} is inserted into the rrct. This handles new dimensions as described in Section 4.2.2.2. c. s=rrct.score(x)s = rrct.score(x): The attention score ss for x\mathbf{x} is calculated from this specific rrct. d. scores.append(s): The individual rrct score is added to the scores list.
  4. Calculate Average Score: avg=average(scores)avg = average(scores): The final attention score (avg) for the path vector x\mathbf{x} is computed by averaging the scores from all RRCTs.
  5. Update Sliding Window: sw.append(avg): The calculated avg score is added to the sliding window.
  6. Calculate Sampling Probability: a. ifvariance(sw)variance(sw[1:k])>hthenif variance(sw) - variance(sw[1:k]) > h then: This checks if the current score significantly deviates from previous scores. * p=1/(1+e(2average(sw)avg))p = 1 / (1 + e^(2 * average(sw) - avg)): The sampling probability pp is calculated using the sigmoid function, making it close to 1. b. else: * p=avg/sum(sw[i]foriinrange(k+1))p = avg / sum(sw[i] for i in range(k+1)): The sampling probability pp is calculated proportionally to the current average score within the sliding window.
  7. Remove Oldest Score: sw.remove(sw[0])sw.remove(sw[0]): The oldest score is removed from the sliding window to maintain its fixed length.
  8. Make Sampling Decision: a. ifrandom(0,1)<pthenif random(0, 1) < p then: If a randomly generated number is less than the calculated sampling probability pp. * decision=Truedecision = True: The trace is sampled. b. else: * decision=Falsedecision = False: The trace is dropped.
  9. Return: return decision, sw, rrcf: The sampling decision, updated sliding window, and updated RRCF model are returned.

4.2.5. Implementation Details

The authors implemented Sieve in Python based on an open-source RRCF implementation. Key modifications include:

  • Cutting Dimension Selection: Prioritizing invalid dimensions and new dimensions to quickly identify structural uncommonness.
  • Scoring Scheme: Replacing the original RRCF scoring with their attention score calculation (Section 4.2.2.3).
  • Path Vector Expansion: Enabling RRCF to handle variable-length path vectors (Section 4.2.2.2).
  • Dimension Reduction: Implementing mechanisms to remove invalid and weak dimensions to combat the curse of dimensionality (Section 4.2.2.4).

5. Experimental Setup

5.1. Datasets

Sieve was evaluated on four different microservice trace datasets:

  1. VWR (Virtual War Room):

    • Source: Traces generated by Virtual War Room (VWR) [16], a simulation framework.
    • Characteristics: Simulated microservice system with 6 microservices. Two types of faults were injected: network delay and early stop. Traces collected using OpenTracing.
    • Scale: 34,167 total traces; 32,592 normal, 66 with high network delay, 1,509 with early stop.
    • Purpose: To test Sieve's ability to detect both temporal (network delay) and structural (early stop) uncommonness in a controlled environment.
  2. AIOps (Real-world microservice system):

    • Source: Generated by a real microservice application deployed in a private cloud environment of an ISP (Internet Service Provider) and used for an international AIOps challenge.
    • Characteristics: 13 microservices, some with multiple instances. Various faults injected into different instances to create diverse anomalies.
    • Scale: 164,340 normal traces and 4,092 abnormal traces.
    • Purpose: To evaluate Sieve's performance on a larger, more complex real-world dataset with varied anomaly types.
  3. Boutique (Online Boutique microservice benchmark):

    • Source: An open-source microservice demo application (Online Boutique) consisting of a 10-tier microservice. Instrumented with OpenCensus and deployed on a Kubernetes cluster with 8 nodes.
    • Characteristics: Workloads run to emit traces. Two types of faults injected: delay in email microservice response and breakdown in product catalog microservice.
    • Scale: 2,000 traces collected; 99 anomalous traces.
    • Purpose: To test Sieve on a standard benchmark application with known fault injection scenarios.
  4. TC (Real production traces):

    • Source: 6,561 traces from a large telecommunication enterprise.
    • Characteristics: Traces grouped into 10 different API types. The longest trace contained 451 spans.
    • Purpose: To assess Sieve's performance on highly complex production traces, particularly its representative sampling capabilities across different API types and its ability to handle long traces.

Latency Settings for Temporal Attention Experiments (Table I):

The following are the results from Table I of the original paper:

DatasetSpansLatency
CommonUncommon
VWR6200~400ms>100000ms
AIOps58100~200ms500 ∼ 1000ms
Boutique28< 200ms> 500ms
TC1522 ∼ 30ms50 ~ 66ms

This table provides concrete examples of the latency characteristics of common and uncommon traces used in the temporal attention experiments for each dataset. For example, in VWR, common traces have 6 spans and latency between 200-400ms, while uncommon ones also have 6 spans but latency over 100,000ms.

Trace Structure Settings for Structural Attention Experiments (Table II):

The following are the results from Table II of the original paper:

DatasetLatencySpans
CommonUncommon
VWR< 200ms6{4,5}
AIOps400 ~ 450ms5859
Boutique< 200ms2818
TC25 ~ 35ms1514

This table specifies the structural characteristics (number of spans) for common and uncommon traces used in the structural attention experiments, alongside their latency ranges. For instance, in VWR, common traces have 6 spans and <200mslatency< 200ms latency, while uncommon ones have 4 or 5 spans and similar latency.

5.2. Evaluation Metrics

The paper evaluates Sieve using several metrics, implicitly derived from its goals:

  1. Proportion of Uncommon Traces Sampled:

    • Conceptual Definition: This metric measures the effectiveness of a sampling method in retaining informative traces (i.e., uncommon traces) within the final sampled set, especially when compared to the total number of uncommon traces in the original dataset. A higher proportion indicates better sampling quality.
    • Mathematical Formula: Not explicitly provided in the paper, but can be inferred as: $ \text{Proportion} = \frac{\text{Number of Uncommon Traces Sampled}}{\text{Total Number of Uncommon Traces in Dataset}} $
    • Symbol Explanation:
      • Number of Uncommon Traces Sampled: The count of traces identified as uncommon by the ground truth and successfully selected by the sampling algorithm.
      • Total Number of Uncommon Traces in Dataset: The total count of traces identified as uncommon in the entire input dataset (ground truth).
  2. Sampling Rate / Storage Cost Reduction:

    • Conceptual Definition: This metric quantifies the reduction in data volume achieved by the sampling method. A lower sampling rate (percentage of traces kept) implies higher storage cost reduction.
    • Mathematical Formula: $ \text{Sampling Rate} = \frac{\text{Number of Sampled Traces}}{\text{Total Number of Traces in Dataset}} \times 100% $ $ \text{Storage Cost Reduction} = 1 - \text{Sampling Rate} $
    • Symbol Explanation:
      • Number of Sampled Traces: The total count of traces selected by the sampling algorithm.
      • Total Number of Traces in Dataset: The total count of all traces in the input dataset.
  3. True Positive (TP):

    • Conceptual Definition: In the context of Sieve's sensitivity evaluation, a True Positive occurs when an uncommon trace (as identified by ground truth or predefined criteria) is correctly given an attention score that causes its sampling probability to exceed the threshold, leading to it being considered for sampling with high likelihood.
    • Mathematical Formula: $ \text{TP} = \text{Count of (Ground Truth Uncommon Trace AND Sampler Flags as Uncommon)} $
    • Symbol Explanation:
      • Ground Truth Uncommon Trace: A trace that is genuinely uncommon based on the experimental setup (e.g., specific latency range, structural deviation).
      • Sampler Flags as Uncommon: A trace whose attention score makes the variance difference (vark+1varkvar_{k+1} - var_k) exceed the threshold hh.
  4. False Positive (FP):

    • Conceptual Definition: A False Positive occurs when a common trace (as identified by ground truth) is incorrectly given an attention score that causes its sampling probability to exceed the threshold, making it appear uncommon to the sampler.
    • Mathematical Formula: $ \text{FP} = \text{Count of (Ground Truth Common Trace AND Sampler Flags as Uncommon)} $
    • Symbol Explanation:
      • Ground Truth Common Trace: A trace that is genuinely common based on the experimental setup.
      • Sampler Flags as Uncommon: A trace whose attention score makes the variance difference (vark+1varkvar_{k+1} - var_k) exceed the threshold hh.
  5. Sampling Latency:

    • Conceptual Definition: This metric measures the time taken by Sieve to process a single incoming trace and make a sampling decision. It's crucial for evaluating the online sampler's efficiency and suitability for real-time deployment.
    • Mathematical Formula: Not explicitly provided, but refers to the time duration in milliseconds (ms) from receiving a trace to outputting a sampling decision.
    • Symbol Explanation: Measured in milliseconds (ms).

5.3. Baselines

The paper compares Sieve against two main baseline models:

  1. Hierarchical Clustering (HC) Method [3]:

    • Description: This method biases sampling to maximize the diversity of traces based on structural characteristics, specifically the number of spans. It groups traces into clusters and aims to sample representatives from each cluster.
    • Why Representative: It is a state-of-the-art biased sampling method that attempts to capture a wider variety of trace types than uniform sampling, particularly focusing on structural differences.
    • Limitations (addressed by Sieve): It is generally an offline method (not designed for online sampling) and primarily focuses on structural diversity, failing to account for temporal uncommonness.
  2. Random Sampling:

    • Description: A uniform sampling method where each trace has an equal probability of being selected. This is the simplest and most common baseline.
    • Why Representative: It represents the default or naive approach used by many existing distributed tracing systems (like Dapper's low-rate random sampling).
    • Limitations (addressed by Sieve): It is highly likely to miss uncommon but informative traces because these represent a tiny fraction of the total data.

6. Results & Analysis

6.1. Core Results Analysis

The experimental evaluation demonstrates Sieve's effectiveness in biased sampling, its robustness, and its efficiency compared to baselines.

6.1.1. Biased Sampling

Sieve's primary goal is to bias sampling towards uncommon traces by utilizing both temporal and structural attention.

6.1.1.1. Temporal Attention

Experiments replayed 1000 traces (990 common, 10 uncommon) with identical structures but differing latencies. The following figure (Figure 9 from the original paper) shows the relationship between the score and rarity of traces in four microservice systems (VWR, AIOps, Boutique, TC):

Fig. 14. The cumulative distribution of sampling latency. 该图像是一个图表,展示了图14中不同微服务系统数据集(VWR、AIOps、Boutique、TC)的采样延迟累积分布函数(CDF),反映了各系统延迟在不同毫秒数上的分布差异。

  • VWR (Figure 9a): Uncommon traces had significantly higher attention scores (easier to isolate). Their sampling probability averaged 0.990, while common traces averaged 0.018. The sigmoid function rapidly pushes probabilities towards 1 for high-scoring uncommon traces.
  • AIOps (Figure 9b): Similar trend, with uncommon traces having an average sampling probability of 0.999 versus 0.019 for common ones. A false positive (common trace with sampling probability close to 1) was observed, attributed to a sub-optimal threshold, but its impact on overall sampling was deemed minimal.
  • Boutique (Figure 9c) and TC (Figure 9d): Both datasets showed Sieve effectively detecting all uncommon traces and raising their sampling probabilities enormously, even for subtle temporal differences (e.g., TC's small latency gap between common and uncommon).

6.1.1.2. Structural Attention

Experiments replayed 1000 traces (990 structurally common, 10 structurally uncommon) with latencies restricted to avoid temporal deviation. The following figure (Figure 10 from the original paper) shows the relationship between scores and sampling probabilities in trace data from four microservice systems (VWR, AIOps, Boutique, TC):

该图像是Sieve方法流程示意图,展示了从Trace数据经过编码器提取时序和结构信息,生成路径向量,经RRCT构建并计算注意力分数,最终通过采样器决定存储或丢弃。 该图像是Sieve方法流程示意图,展示了从Trace数据经过编码器提取时序和结构信息,生成路径向量,经RRCT构建并计算注意力分数,最终通过采样器决定存储或丢弃。

  • VWR (Figure 10a): Common traces averaged 0.020 sampling probability, while uncommon averaged 0.996. Two false positives were found, which upon inspection, had extremely low latencies (<10ms), making them temporally uncommon and thus scored high by Sieve.
  • AIOps (Figure 10b): No false positives were observed. Average sampling probabilities were 0.018 for common and 0.997 for uncommon.
  • Boutique (Figure 10c): All uncommon traces were detected and sampled with high probability. Averages were 0.018 for common and 0.997 for uncommon.
  • TC (Figure 10d): Even for subtle structural differences (e.g., 15 spans vs. 14 spans), Sieve successfully detected all uncommon traces. Conclusion: Sieve is highly effective at distinguishing and biasing sampling towards both temporal and structural uncommonness.

6.1.2. Sensitivity

Sieve's sensitivity was evaluated against the degree of uncommonness and its internal parameters.

6.1.2.1. Sensitivity to Degree of Uncommonness

Experiments used 1000 traces from AIOps, varying the latency range of 10 uncommon traces while keeping 990 common traces below 200ms latency. The following figure (Figure 11 from the original paper) shows Sieve's sensitivity to the degree of uncommonness:

Fig. 3. The example of path vector encoding of two traces. Sieve extracts all the paths that starting from the root span. In the path vector, each index associates with one path, and the value in the… 该图像是图示,展示了论文中图3的路径向量编码示例。图中两个调用链跟踪图分别对应路径及其最大尾跨度延迟,路径向量中无对应路径的值设为-1,标注了每条路径是否存在于轨迹中。

  • Sieve detected all 10 uncommon traces (high True Positives) in most scenarios where the uncommon latency range was clearly distinct.
  • Performance was slightly worse when uncommon latency was close to common latency (e.g., 200-250ms range for uncommon, vs. <200ms for common), as the boundary was fuzzy.
  • As the distinction grew, detection rapidly improved, and the False Positive rate remained low. This suggests Sieve is robust to varying degrees of uncommonness, especially when the difference is clear.

6.1.2.2. Sensitivity to Parameters

Experiments used 1000 traces from AIOps with a mixed composition of common and uncommon traces (Table III). The following are the results from Table III of the original paper:

TypeSpansLatencyAmount
common580 ∼ 300ms990
uncommon58> 400ms5
uncommon590 ~300ms2
uncommon70 ~300ms3

The following figure (Figure 12 from the original paper) shows Sieve's sensitivity to its parameters:

Fig. 4. The workflow of the construction stage (left) and the maintenance stage (right). \(X\) is a set of path vectors. `X L` and `X R` are the subsets of \(X\) `P V` is the path vector. 该图像是图4的示意图,展示了构建阶段(左侧)和维护阶段(右侧)的工作流程。左侧以集合XX(路径向量集合)为输入,通过查找无效维度并依权重划分为子集XLXR,直到每个子集仅含一个路径向量。右侧输入为路径向量PV,移除最旧叶子,判断是否有新维度,进而拓展叶子、内部节点或插入,最终完成操作。

  • RRCT Size (Figure 12a) and Number of RRCTs (Figure 12b): Varying RRCT size (32 to 512) and number of RRCTs (10 to 90) had little effect on Sieve's accuracy. Performance remained stable, indicating robustness to these configuration changes.
  • Threshold (Figure 12c): The threshold hh had a direct influence.
    • Below h=0.55h=0.55: False Positives (FP) decreased significantly, while True Positives (TP) remained at maximum.
    • Between h=0.6h=0.6 and h=0.85h=0.85: TP remained stable (with a slight decrease), and FP reduced to 0.
    • Above h=1h=1: TP decreased to 7 and stabilized. Conclusion: Sieve is not overly sensitive to the threshold within a certain range. A low threshold is recommended to achieve high TP coverage with a small increase in FP.

6.1.3. Performance Comparison

Sieve was compared against random sampling and hierarchical clustering (HC) based on the proportion of uncommon traces sampled for a given sampling budget. 1000 traces with specific compositions for each dataset were used (Table IV). The following are the results from Table IV of the original paper:

TypeVWRAIOpsBoutiqueTCAmount
SpansLatencySpansLatencySpansLatencySpansLatency
common60 ∼ 200ms580 ∼ 300ms280 ∼ 200ms290 ∼ 40ms990
uncommon6> 100000ms58> 400ms28> 500ms2960 ∼ 90ms5
uncommon50 ∼ 200ms590 ∼ 300ms180 ∼ 200ms300 ~ 40ms5

The following figure (Figure 13 from the original paper) shows the proportion of uncommon traces sampled by three different methods in different trace datasets:

Fig. 5. Illustration of the building procedure of one RRCT. The blocks with different colors represent different spans. The number inside a block represents the latency. The dimension denoted by \$\\bu… 该图像是论文中的示意图,展示了一个RRCT的构建过程。图中用不同颜色的块表示不同跨度,块内数字代表延迟,路径向量和对应的切割维度及边界框关系被清晰说明。

  • Superior Performance: Sieve consistently achieved the best performance across all four datasets in terms of proportion of uncommon traces sampled.
  • Random Sampling: As expected, random sampling's proportion was roughly equal to the sampling rate (very low for uncommon traces).
  • Hierarchical Clustering (HC): HC's proportion grew slowly and generally did not exceed 60%. This is because HC primarily detects structural uncommonness, missing the 5 temporally uncommon traces in the mixed composition.
  • Sieve's Behavior: Initially, Sieve's proportion might be lower than HC at very small budgets due to its online nature (common traces occurring early might consume some budget). However, as the budget increased, Sieve's proportion converged rapidly and closely to 100%, indicating it successfully captures almost all uncommon traces.
  • TC Dataset: Sieve still showed ideal performance on the TC dataset, even where HC's effectiveness was significantly reduced because neither temporal nor structural uncommonness was explicitly considered by HC for this specific configuration. Conclusion: Sieve is significantly more effective at capturing uncommon traces than random sampling and HC, especially when both temporal and structural differences are important.

6.1.4. Representative Sampling

Sieve's ability to obtain a representative sample (capturing traces of different patterns) was tested on the TC dataset, which has 10 API types. The following are the results from Table V of the original paper:

API-1API-2API-3API-4API-5API-6API-7API-8API-9API-10
Population1116481191352103255716074519
Sieve111641.2645131.29.415.210.865.4
HC5.9165.4417.2219.4432.6442.2445.9242.8487.56
Random0.420.682.35.386.3410.7816.1227.2829.7217
  • Comparison on TC (Table V): Sieve preserved most traces from API-1 (11/11) and API-2 (16/16), which are very few in the population. For other API types, it sampled a reasonable number.
  • HC sampled more uniformly across groups, but Sieve's strength lies in preserving uncommon patterns, which likely include minority API types.
  • Random sampling captures very few from minority groups. Conclusion: Sieve is more effective at preserving uncommon traces than HC, especially for minority groups, making it suitable for representative sampling focused on rare patterns.

6.1.5. Storage Saving

The sampling rates of Sieve were recorded for different datasets. The following are the results from Table VI of the original paper:

PopulationSampleSamplingRate
VWR34167852.5%
AIOps16843276024.5%
Boutique20001145.9%
  • The sampling rates observed for Sieve were low: 2.5% for VWR, 4.5% for AIOps, and 5.9% for Boutique.
  • This translates to a reduction of traces up to 97.5% (for VWR). Conclusion: Sieve enables considerable storage savings by effectively discarding a large majority of common traces.

6.1.6. Overhead

The sampling latency of Sieve was measured. The number of RRCTs was set to 20 for this experiment, showing comparable performance to 50 RRCTs. The following figure (Figure 14 from the original paper) shows the cumulative distribution of sampling latency:

Fig. 6. Illustration of the removal of the oldest leaf. 该图像是论文中关于去除最旧叶子节点及其父节点的示意图,展示了更新边界框的过程,明确显示了节点属性和树结构的变化过程。

  • Latency Breakdown: The sampling latency comprises time for path vector encoding, attention score calculation, and sampling probability calculation.
  • Efficiency: The latency varied between 3ms and 33ms.
  • Even for production traces with hundreds of spans (like TC), Sieve proved to be efficient. Conclusion: Sieve exhibits low overhead, making it practical for real-time online sampling in demanding production environments.

6.2. Data Presentation (Tables)

All relevant tables (Table I, II, III, IV, V, VI) have been transcribed completely within the Results & Analysis section under their respective subsections.

6.3. Ablation Studies / Parameter Analysis

The paper conducted a sensitivity analysis (Section V-B) which serves as a form of parameter analysis, though not a strict ablation study (which typically removes components).

  • RRCT Size (Figure 12a) and Number of RRCTs (Figure 12b): These parameters showed that Sieve's accuracy (TP and FP counts) was largely stable across a wide range of values. This suggests that Sieve is robust to the exact choice of RRCF configuration, making it easier to deploy without extensive tuning. For example, using 20 RRCTs instead of 50 yielded approximate performance, indicating flexibility.
  • Threshold (Figure 12c): This parameter directly controls the aggressiveness of biased sampling. A lower threshold tends to increase True Positives (capturing more uncommon traces) but might also lead to a slight increase in False Positives (sampling some common traces incorrectly). Conversely, a higher threshold reduces False Positives but might miss some uncommon traces. The findings suggest a sweet spot where TP is maximized and FP is minimized or acceptably low, encouraging operators to set a low threshold for higher coverage of uncommon traces.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper effectively addresses the critical challenge of managing vast volumes of trace data in distributed microservice systems by introducing Sieve, an innovative attention-based online sampler. Sieve's core contribution lies in its ability to explicitly leverage both temporal and structural differences to identify and prioritize uncommon traces for sampling. By encoding traces into path vectors, adapting Robust Random Cut Forest (RRCF) for variable dimensions and streaming data, and using attention scores to inform a biased sampling scheme, Sieve significantly enhances the selection of informative traces while drastically reducing storage costs. Experimental evaluations across diverse datasets confirm Sieve's high effectiveness in capturing uncommon traces, its robustness to varying conditions and parameter settings, and its low computational overhead, making it a practical solution for real-time monitoring in large-scale microservice environments.

7.2. Limitations & Future Work

The paper doesn't explicitly dedicate a section to limitations and future work, but some can be inferred:

  • False Positives in Temporal Attention: While generally robust, the paper noted false positives in temporal attention experiments (e.g., in VWR and AIOps datasets) due to sub-optimal threshold settings or common traces exhibiting extreme (but not anomalous) latency values. This suggests Sieve might occasionally misclassify common traces with unusual but non-anomalous characteristics.

  • Curse of Dimensionality: The paper explicitly mentions the curse of dimensionality caused by the ever-growing number of execution paths and the need for dimension reduction techniques. While Sieve implements solutions like removing invalid and weak dimensions, the long-term scalability and potential for information loss during aggressive dimension reduction could be a concern for extremely dynamic systems with constantly evolving paths.

  • Feature Enrichment: The paper hints at the possibility of extending path vectors with more meaningful features like number of spans or request status codes. This implies that Sieve's current feature set, while effective, could be further enhanced to capture more nuanced uncommonness.

  • Threshold Tuning: Although Sieve is robust to threshold changes within a range, finding the optimal threshold still involves some tuning, especially considering its impact on the trade-off between True Positives and False Positives. Dynamic, adaptive threshold adjustment could be a future improvement.

    Potential future research directions implied or directly stated:

  • Further exploration of feature engineering for path vectors to include more context-rich information from traces (e.g., specific tags, resource utilization, error messages) to improve uncommonness detection.

  • Developing adaptive threshold adjustment mechanisms to automatically optimize the balance between sampling rate and uncommon trace coverage without manual parameter tuning.

  • Investigating more advanced dimension reduction techniques that are loss-less or minimize information loss while maintaining efficiency, especially for very high-dimensional trace spaces.

  • Exploring the integration of Sieve with other anomaly detection or root cause analysis systems, leveraging its biased sampling to provide higher-quality inputs for downstream analysis.

7.3. Personal Insights & Critique

Sieve presents a compelling and practically relevant solution for a pervasive problem in microservice observability. The explicit combination of temporal and structural attention is a key strength, moving beyond the limitations of previous biased sampling methods. The adaptation of Robust Random Cut Forest for streaming, variable-dimensional trace data is a clever technical innovation that makes the approach feasible for dynamic microservice environments.

The emphasis on online sampling with low overhead is crucial. In real-world microservice systems, traces arrive continuously and in high volume, so any effective sampling solution must be able to process them in real-time without becoming a bottleneck. Sieve demonstrates this capability, which is a significant advantage for practical deployment.

One area for critical reflection could be the choice of -1 as an indicator for an invalid index (missing path). While effective for distinguishing presence/absence, it introduces a distinct value that RRCF would naturally separate. This is leveraged for structural distinction, but it also creates a fixed boundary. Exploring alternative ways to handle missing values that might be more robust to varying semantic interpretations of "missing" could be interesting. For instance, sometimes a missing path might be normal for a specific request type, while other times it's anomalous. The current mechanism relies on the RRCF to learn this distinction from data distribution.

The sigmoid function for sampling probability when variance exceeds the threshold is a good way to aggressively sample perceived anomalies. However, the linear proportionality for scores below the threshold might still result in sampling more common traces than strictly necessary if there's a continuous distribution of "mildly uncommon" traces that don't cross the threshold. This is a common trade-off in biased sampling – balancing the desire for comprehensive anomaly capture with aggressive data reduction.

Overall, Sieve is a well-designed and evaluated system that offers a robust and practical approach to intelligent trace sampling, directly addressing a bottleneck in distributed system monitoring. Its methodologies could potentially be transferred to other domains dealing with streaming structured data where outlier detection for resource optimization is critical.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.