Paper status: completed

Beyond Interpretability: Exploring the Comprehensibility of Adaptive Video Streaming through Large Language Models

Published:08/22/2025

Adaptive Video Streaming Technology (1)Application of Large Language Models in Video Streaming (1)Interpretability and Comprehensibility (1)Decision Tree Generation Framework (1)Bitrate Adaptation Algorithm (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces `ComTree`, a bitrate adaptation framework aimed at enhancing comprehensibility in adaptive video streaming. By generating performance-compliant decision trees and leveraging large language models for evaluation, it significantly improves comprehensibility whi

Abstract

Over the past decade, adaptive video streaming technology has witnessed significant advancements, particularly driven by the rapid evolution of deep learning techniques. However, the black-box nature of deep learning algorithms presents challenges for developers in understanding decision-making processes and optimizing for specific application scenarios. Although existing research has enhanced algorithm interpretability through decision tree conversion, interpretability does not directly equate to developers' subjective comprehensibility. To address this challenge, we introduce \texttt{ComTree}, the first bitrate adaptation algorithm generation framework that considers comprehensibility. The framework initially generates the complete set of decision trees that meet performance requirements, then leverages large language models to evaluate these trees for developer comprehensibility, ultimately selecting solutions that best facilitate human understanding and enhancement. Experimental results demonstrate that \texttt{ComTree} significantly improves comprehensibility while maintaining competitive performance, showing potential for further advancement. The source code is available at https://github.com/thu-media/ComTree.

Mind Map

In-depth Reading

English Analysis~37 min read · 48,548 chars

1. Bibliographic Information

1.1. Title

Beyond Interpretability: Exploring the Comprehensibility of Adaptive Video Streaming through Large Language Models

1.2. Authors

Lianchen Jia (Tsinghua University)
Chaoyang Li (Tsinghua University)
Ziqi Yuan (Tsinghua University)
Jiahui Chen (Tsinghua University)
Tianchi Huang (Tsinghua University)
Jiangchuan Liu (Simon Fraser University)
Lifeng Sun (Tsinghua University)

The authors are primarily affiliated with Tsinghua University, a prominent research institution in China, with one author from Simon Fraser University in Canada. Their research backgrounds appear to be in computer science, specifically in multimedia, networking, and potentially machine learning, given the paper's focus on adaptive video streaming, deep learning, and large language models.

1.3. Journal/Conference

The paper is published in the Proceedings of the 33rd ACM International Conference on Multimedia (MM '25), October 27-31, 2025, Dublin, Ireland. The ACM International Conference on Multimedia (ACM MM) is a highly reputable and influential conference in the field of multimedia research, known for publishing cutting-edge work across various aspects of multimedia computing, including streaming, processing, and understanding. Its selective nature indicates the high quality and significance of papers accepted.

1.4. Publication Year

2025

1.5. Abstract

The abstract introduces the evolution of adaptive video streaming, largely propelled by deep learning. It highlights a key challenge: the black-box nature of deep learning algorithms hinders developers' understanding of decision-making and optimization. While existing research has improved algorithm interpretability (e.g., through decision tree conversion), the paper argues that interpretability does not directly translate to developers' subjective comprehensibility. To address this, the authors propose ComTree, the first framework for bitrate adaptation algorithm generation that explicitly considers comprehensibility. ComTree operates by first generating a complete set of decision trees that meet performance requirements (a Rashomon set), and then uses Large Language Models (LLMs) to evaluate these trees for developer comprehensibility. The framework ultimately selects solutions that are easier for humans to understand and enhance. Experimental results indicate that ComTree significantly improves comprehensibility while maintaining competitive performance, suggesting its potential for further development. Source code is made available.

1.6. Original Source Link

https://arxiv.org/abs/2508.16448 This is a preprint published on arXiv.

1.7. PDF Link

https://arxiv.org/pdf/2508.16448v1.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the lack of developer comprehensibility in advanced adaptive video streaming (ABR) algorithms, particularly those leveraging deep learning. While deep learning has significantly improved ABR performance, its black-box nature makes it difficult for developers to understand how decisions are made, why certain strategies are adopted, and critically, how to optimize these algorithms for specific, evolving application scenarios (e.g., new network conditions, different user preferences).

This problem is important because video streaming is a dominant consumer of internet traffic, and ensuring a high Quality of Experience (QoE) is crucial. As ABR algorithms become more complex, especially with deep learning, developers lose the ability to debug, maintain, and adapt them effectively. Previous research focused on interpretability (making decisions transparent, often by converting black-box models into decision trees), but the paper argues that interpretability alone is insufficient. Complex decision trees, even if technically interpretable step-by-step, can still be hard for developers to comprehend and modify.

The paper's entry point or innovative idea is to introduce comprehensibility as a novel optimization objective for ABR algorithm generation. It proposes a framework, ComTree, that moves "beyond interpretability" by specifically seeking algorithms that are not just transparent but also intuitively understandable and amenable to human enhancement. It leverages the Rashomon set concept to find multiple good-performing models and Large Language Models (LLMs) to evaluate their human comprehensibility, a subjective metric traditionally hard to quantify.

2.2. Main Contributions / Findings

The paper makes the following primary contributions:

Revealing Limitations and Importance of Comprehensibility: It empirically validates the limitations of current interpretable ABR methods in terms of developer comprehensibility and highlights the critical importance of improving this aspect for practical deployment and optimization. This is demonstrated through experiments showing the difficulty of modifying a complex existing decision tree (Pitree) for new network conditions.
Introducing ComTree Framework: It proposes ComTree, the first framework that integrates comprehensibility into the generation process of adaptive video streaming algorithms. This framework uses a Rashomon set approach to identify multiple high-performing decision trees and then employs LLMs to assess and select the most developer-friendly ABR algorithms from this set.
Demonstrating Competitive Performance and Comprehensibility: Through extensive experiments, ComTree is shown to generate algorithms that achieve competitive performance (often matching or exceeding existing baselines) while significantly improving comprehensibility. Furthermore, the framework demonstrates that comprehensibility correlates with an algorithm's potential for further optimization and improvement by engineers (even LLM-simulated engineers).

3.1. Foundational Concepts

Adaptive Bitrate (ABR) Streaming: This is a technology used for streaming multimedia content (like video) over the internet. Instead of sending video at a fixed quality, ABR algorithms dynamically adjust the video quality (bitrate) based on the user's current network conditions (bandwidth, latency) and device capabilities. The goal is to maximize the user's Quality of Experience (QoE) by delivering the highest possible quality without interruptions (rebuffering) or drastic quality changes. ABR algorithms typically monitor factors like available bandwidth, buffer occupancy, and historical data to make decisions about the next video segment's quality level.
Quality of Experience (QoE): QoE is a subjective measure of a user's overall satisfaction with a service. In video streaming, it encompasses several factors beyond just technical performance, including:
- Video Quality: Higher resolution and bitrate generally lead to better visual quality.
- Playback Smoothness: Absence of pauses, freezes, or rebuffering events.
- Quality Stability: Minimizing abrupt changes in video quality, which can be jarring to the viewer.
- Startup Delay: The time it takes for video playback to begin. The paper uses specific QoE metrics, which will be detailed in the experimental setup.
Deep Learning (DL): A subfield of machine learning that uses artificial neural networks with multiple layers (hence "deep") to learn complex patterns from data. DL models have achieved state-of-the-art results in various domains, including image recognition, natural language processing, and control tasks. However, these models are often referred to as black-box models because their internal workings and decision-making processes are not easily understandable by humans, making it hard to debug or explain their behavior.

Interpretability vs. Comprehensibility: These two terms are central to the paper's thesis and are carefully distinguished:

Interpretability: Refers to making the decision-making process of an algorithm transparent. It answers the question: "How did the model decide?" For example, converting a complex black-box model into a decision tree makes each decision step traceable and thus interpretable.

Comprehensibility: Goes beyond interpretability. It focuses on whether a human (specifically a developer in this context) can understand the overall design logic of an algorithm, not just individual decisions. It answers: "Why was the model designed this way?" and "Can I easily modify and optimize it?" Comprehensibility is a higher-level, subjective goal related to a developer's ability to grasp, maintain, and improve the algorithm. An algorithm can be interpretable (you can trace its path) but still lack comprehensibility if its logic is overly complex, fragmented, or counter-intuitive.

The following are the results from Table 1 of the original paper:

Feature	Focus	Hierarchy	Example
Interpretability	Makes decision processes transparent. Addresses:How did the model decide?	Foundation for compre-hensibility	Black-box model transformed intotraceable decision tree
Comprehensibility	Focuses on understanding overall design logic. Ad-dresses: Why was the model designed this way?	High-level goal relatedto developers' subjec-tive understanding	Algorithms with high comprehensi-bility have greater potential for im-provement after adjustment

Decision Trees: A type of supervised learning algorithm that builds a model in the form of a tree structure. Each internal node represents a "test" on an attribute (e.g., "buffer size > 2 seconds?"), each branch represents the outcome of the test, and each leaf node represents a class label or a decision (e.g., "choose bitrate X"). Decision trees are inherently interpretable because their logic can be easily visualized and followed as a series of if-then-else rules. They are often used to explain black-box models by approximating their behavior.
Large Language Models (LLMs): A class of deep learning models trained on vast amounts of text data to understand, generate, and process human language. LLMs can perform a wide range of tasks, including translation, summarization, question answering, and even complex reasoning. The paper leverages their ability to simulate human judgment and understanding, particularly in subjective tasks like evaluating comprehensibility, due to their learned world knowledge and human preference alignment.
Rashomon Set: In machine learning, a Rashomon set refers to a collection of multiple models that perform nearly identically well on a given task, despite having potentially different internal structures or decision-making processes. The concept is named after Akira Kurosawa's film "Rashomon," where different characters provide equally plausible but conflicting accounts of the same event. In this paper's context, the Rashomon set consists of many decision trees that achieve similar QoE performance for ABR, allowing the selection of one that is also highly comprehensible.

3.2. Previous Works

The paper categorizes previous ABR algorithms based on their approach to performance and comprehensibility:

Traditional Heuristic Algorithms (e.g., BBA, RobustMPC):
- Description: These algorithms use predefined rules and mathematical models to make bitrate adaptation decisions. Examples include Buffer-Based Approach (BBA) [19], which primarily uses buffer occupancy, and RobustMPC [59], which uses model predictive control.
- Characteristics: They have high comprehensibility because their rules are explicit and human-designed. However, their rigid rules make them less adaptable to the highly dynamic and unpredictable nature of real-world network conditions, often resulting in suboptimal QoE performance in complex environments.
- Formula Example (General QoE Metric - commonly used in ABR papers, though specific coefficients vary): The paper uses the following QoE metric, which is a common form in ABR research: $QoE = \sum _ { n } q ( R _ { n } ) - \mu \sum _ { n } T _ { n } - \sum _ { n } | q ( R _ { n + 1 } ) - q ( R _ { n } ) |$ Where:
  - $n$ : Index for the video segment.
  - $q(R_n)$ : Represents the video quality (utility) obtained when the $n$ -th video segment is played at bitrate level $R_n$ . This function maps a bitrate to a quality score.
  - $\sum_n q(R_n)$ : The total accumulated quality over all played segments.
  - $\mu$ : A penalty coefficient for rebuffering.
  - $T_n$ : The duration of the rebuffering event that occurs before playing the $n$ -th segment.
  - $\mu \sum_n T_n$ : The total penalty due to rebuffering events.
  - $|q(R_{n+1}) - q(R_n)|$ : The absolute difference in quality between consecutive segments. This term penalizes frequent or large quality changes, promoting smoothness.
  - $\sum_n |q(R_{n+1}) - q(R_n)|$ : The total penalty for quality variations, ensuring smoothness. This formula aims to maximize perceived video quality while minimizing annoying rebuffering and disruptive quality switches.
Black-Box Methods (e.g., Pensieve, MERINA, Genet, NetLLM):
- Description: These algorithms often employ deep reinforcement learning (DRL) or other advanced machine learning techniques (like LLMs for networking) to learn optimal bitrate adaptation policies directly from data or through interaction with simulated environments. Examples include Pensieve [32] (DRL-based), MERINA [26] (meta-reinforcement learning), Genet [56] (curriculum learning), and NetLLM [55] (leveraging LLMs).
- Characteristics: They typically achieve state-of-the-art QoE performance by automatically adapting to complex and dynamic network conditions. However, their complex internal structures (e.g., deep neural networks) result in very low comprehensibility. Developers find it extremely challenging to understand their internal decision logic, debug issues in unforeseen scenarios (bad cases), or make targeted improvements.
Interpretability-Focused ABR (e.g., Pitree):
- Description: These methods aim to bridge the gap between high-performing black-box models and human understanding. They often involve post-hoc analysis or model distillation techniques to convert complex black-box models (like DRL agents) into white-box models, such as decision trees. Pitree [36] is a notable example that transforms black-box ABR models into decision trees for deployment.
- Characteristics: These algorithms improve interpretability by making each decision step traceable. However, as the paper demonstrates with Pitree (Figure 1), the resulting decision trees can still be highly complex (deep, many nodes, many features), making them difficult for developers to comprehend and effectively modify for new scenarios.
  
  The following figure (Figure 1 from the original paper) shows an example of a complex decision tree generated by Pitree:
  
  Figure 1: Decision Tree Generated by Pitree (Some details omitted)
Hybrid Methods (e.g., Oboe):
- Description: These methods attempt to balance performance and comprehensibility by using parameterized rule tables or simpler, structured models. Oboe [4] is an example that uses auto-tuning for ABR algorithms by adapting parameters to network conditions.
- Characteristics: They offer a middle ground, often providing better comprehensibility than black-box models and better performance than purely heuristic approaches. However, their structured nature might still limit their ultimate performance potential compared to fully adaptive black-box models, and their comprehensibility might not be as high as purely simple heuristic rules.

3.3. Technological Evolution

The evolution of ABR algorithms has progressed through several stages:

Early Heuristics (e.g., BBA, MPC): Focused on simple, rule-based logic primarily using buffer status or simplified network models. These were highly comprehensible but limited in adaptability.
Model-Based Approaches (e.g., RobustMPC): Incorporated more sophisticated control theory or predictive models to make decisions, improving performance over simple heuristics but still relying on explicit models.
Deep Reinforcement Learning (DRL) (e.g., Pensieve): Leveraged the power of deep learning to learn complex policies directly from interaction with environments, leading to significant QoE improvements. This introduced the black-box problem.
Interpretability-focused Methods (e.g., Pitree): Addressed the black-box issue by converting DRL policies into interpretable forms like decision trees, aiming for transparency.
Comprehensibility-focused Methods (ComTree): This paper marks a new stage by specifically targeting comprehensibility beyond mere interpretability. It recognizes that even interpretable models can be too complex for human understanding and optimization, proposing methods to actively select for human-friendly structures.

This paper's work (ComTree) fits within the latest stage, aiming to provide ABR algorithms that are not only performant and interpretable but also genuinely comprehensible and amenable to developer-driven enhancements.

3.4. Differentiation Analysis

Compared to the main methods in related work, ComTree offers several core differences and innovations:

Beyond Interpretability to Comprehensibility: While Pitree also uses decision trees for interpretability, ComTree explicitly defines and optimizes for comprehensibility, a subjective metric reflecting a developer's ability to understand and modify the algorithm. This is a fundamental shift from merely explaining how a decision is made to understanding why the overall algorithm was designed that way and how to improve it.
Leveraging Rashomon Sets: ComTree does not just find an optimal decision tree but explores the Rashomon set – a collection of nearly equally performing decision trees. This allows for a choice based on comprehensibility among high-performing alternatives, a capability not present in methods that yield a single "best" model.
Utilizing Large Language Models (LLMs) for Subjective Evaluation: ComTree innovatively employs LLMs to evaluate the subjective comprehensibility of decision trees. This is a novel application of LLMs in ABR, simulating human judgment in a way that traditional objective metrics cannot. Previous methods rely on structural metrics (like tree depth or node count) that may not fully capture human understanding.
Enhanced Generative Process: Unlike Pitree which uses greedy algorithms to generate decision trees, ComTree employs a dynamic programming algorithm (TreeFarms) with optimality guarantees, allowing it to generate a more diverse and globally optimized Rashomon set of trees within specified performance bounds. It also incorporates a feature processing step to handle high-dimensional ABR state spaces, making optimal tree generation feasible.
Focus on Developer Potential: The ultimate goal of ComTree is to generate algorithms that are easier for developers to modify and optimize in new or unforeseen bad cases. This directly addresses a critical practical challenge in production environments that black-box models and even complex interpretable models struggle with.

4. Methodology

4.1. Principles

The core idea behind ComTree is to overcome the limitations of current ABR algorithms (especially deep learning-based ones) that are highly performant but lack comprehensibility, hindering developer understanding and optimization. ComTree addresses this by embracing comprehensibility as a key design objective. Its principle is to:

Generate a diverse set of high-performing ABR policies: Instead of seeking a single optimal policy, ComTree aims to find a Rashomon set – a collection of many decision trees that achieve nearly the same high QoE performance. This set provides options.
Evaluate comprehensibility using LLMs: Acknowledge that comprehensibility is a subjective human judgment. ComTree employs Large Language Models (LLMs) to simulate developer assessment, choosing the most understandable and modifiable decision tree from the Rashomon set.
Ensure practical feasibility: The framework incorporates techniques like feature processing to make the generation of complex Rashomon sets computationally tractable in the high-dimensional ABR state space.

4.2. Core Methodology In-depth

The ComTree framework comprises two main modules: the Rashomon Set Construction Module and the LLM Assessment Module.

4.2.1. Rashomon Set Construction Module

This module is responsible for generating a collection of high-performing, yet structurally diverse, decision trees that meet specific accuracy requirements.

4.2.1.1. Feature Processing

The first step in constructing the Rashomon set is to address the high dimensionality of ABR state spaces, which makes optimal decision tree generation computationally intractable. ABR states can have many features, and when binarized for decision tree algorithms, the number of potential splits becomes enormous. ComTree tackles this through a feature processing step that reduces the number of relevant features using column elimination based on black-box model knowledge.

The process is as follows:

Initial Ensemble Tree Generation: An XGBoost model (a parallelized gradient boosting decision tree algorithm) is trained on the original dataset (S, A) to serve as a reference black-box algorithm. This provides an initial accuracy (acc_ori) and feature importances (importance).
Iterative Feature Elimination:
- Features with importance less than a predefined threshold $δ$ are filtered out.
- The feature with the lowest importance among the remaining features is identified (c_min).
- This least important feature is temporarily removed from the dataset.
- A new XGBoost ensemble tree is trained on the reduced dataset, and its accuracy (acc_new) is compared to the acc_ori and a reference accuracy (acc_rec).
- If acc_new is not lower than the minimum of acc_ori and acc_rec, it means the feature can be eliminated without significantly impacting accuracy. The elimination is kept, and the process continues.
- If acc_new drops too much, the last eliminated feature (c_min) is reverted, as its removal negatively affected accuracy.
Output: This iterative process significantly reduces the number of encoded features, making dynamic programming for decision tree generation feasible.

The Feature Processing Algorithm is detailed below:

Algorithm 2: Feature Processing Algorithm

Input: state-action dataset (S, A), importance threshold δ, reference accuracy acc_rec
Output: new dataset (S_guess, A)

1  // Obtain initial accuracy and column importance
2  acc_ori, importance := Xgboost(S, A)
3  S_guess = S.copy, acc_new = acc_ori //init dataset
4  /*Column elimination while preserving accuracy
5  while acc_new >= min(acc_ori, acc_rec) and len(S_guess.column) > 1 do
6    // Column elimination based on importance threshold
7    S_guess = fliter_importance(S_guess, importance, δ)
8    // Column elimination based on least important columns
9    c_min = min(importance)
10   S_guess = S_guess.delete(c_min)
11   acc_new, importance = Xgboost(S_guess, A)
12 // Revert column elimination that degrades accuracy
13 if c_min != empty then
14   L S_guess = S_guess.add(c_min)
15 return (S_guess, A)

Explanation of Algorithm 2:

Line 1: Comment indicating the purpose of the following line: to get initial accuracy and feature importances.
Line 2: acc_ori, importance := Xgboost(S, A): An XGBoost model is trained on the initial state-action dataset ( $S$ , $A$ ). This yields the baseline accuracy acc_ori and a measure of importance for each feature (column) in $S$ .
Line 3: S_guess = S.copy, acc_new = acc_ori: Initializes S_guess as a copy of the original state features $S$ (this will be the dataset where columns are eliminated) and sets acc_new to the initial accuracy.
Line 4: Comment indicating the start of the column elimination loop.
Line 5: while acc_new >= min(acc_ori, acc_rec) and len(S_guess.column) > 1 do: This loop continues as long as the new accuracy (acc_new) remains above a certain threshold (either the original accuracy or a predefined reference accuracy) and there is more than one column left in S_guess.
Line 6: Comment for filtering by importance threshold.
Line 7: S_guess = filter_importance(S_guess, importance, δ): Filters out features from S_guess whose importance (from the previous XGBoost run) is below the $importance threshold δ$ .
Line 8: Comment for eliminating the least important column.
Line 9: c_min = min(importance): Identifies the feature with the minimum importance among the currently considered features.
Line 10: S_guess = S_guess.delete(c_min): Removes the feature c_min from the S_guess dataset.
Line 11: acc_new, importance = Xgboost(S_guess, A): Retrains XGBoost on the modified S_guess (with c_min removed) and $A$ , obtaining the new accuracy acc_new and updated feature importance values.
Line 12: Comment for reverting if accuracy degrades.
Line 13: $if c_min != empty then$ : Checks if a column was actually removed in the current iteration.
Line 14: S_guess = S_guess.add(c_min): If the acc_new condition in the while loop (Line 5) was not met, meaning removing c_min degraded accuracy too much, then c_min is added back to S_guess to revert the change.
Line 15: $return (S_guess, A)$ : Returns the final feature-processed dataset (S_guess, $A$ ).

4.2.1.2. Generate the Rashomon Set

After feature processing reduces the state space, ComTree generates the Rashomon set. This involves two sub-steps: first, collecting a dataset that reflects realistic ABR playback probabilities using a teacher-student learning framework, and second, using a dynamic programming algorithm (TreeFarms) to find all decision trees within a specified performance error range.

a. Teacher-Student Learning Framework:

Purpose: To create a training dataset (S, A) where $S$ represents states encountered during actual video playback and $A$ represents optimal actions (bitrate choices) for those states. This avoids cascading errors that can occur if a student model is trained on data generated solely by itself, leading to unrealistic state distributions.
Process:
1. An initial black-box teacher network (e.g., Pensieve) is rolled out in a simulated environment to generate an initial dataset of state-action pairs.
2. An initial student decision tree is trained using this dataset.
3. The student decision tree is then deployed in the simulated video playback environment and interacts with a virtual player under various network conditions. This generates student network execution states ( $s_s$ ) and actions ( $a_s$ ).
4. For each student state $s_s$ , the black-box teacher network is queried to generate the teacher's optimal action ( $a_t$ ) for that state.
5. The pair $(s_s, a_t)$ is then added to the training dataset. This process "corrects" the student's actions with the teacher's optimal actions, ensuring the dataset reflects optimal behavior in realistically encountered states. This iterative refinement helps the student learn a policy that mimics the teacher's high performance.

b. Rashomon Set Generation using TreeFarms:

TreeFarms: This is a dynamic programming algorithm (based on GOSDT [30]) designed to find all sparse decision trees that are nearly optimal. Unlike greedy approaches that stop at the first good solution, TreeFarms explores the solution space more thoroughly.
GOSDT Objective: GOSDT optimizes a combination of misclassification loss and a sparsity penalty on the number of leaves. The objective function is: $obj = loss_{mis} + \lambda H_t$ Where:
- obj: The objective value to be minimized.
- $loss_{mis}$ : The misclassification loss (or QoE loss in this context) of the decision tree $t$ . This measures how well the tree predicts the correct actions.
- $\lambda$ : A regularization parameter that controls the trade-off between minimizing loss and sparsity. A higher $\lambda$ encourages simpler trees with fewer leaves.
- $H_t$ : The number of leaves in the decision tree $t$ . This term penalizes complexity, promoting sparse (simpler) trees.
Rashomon Set Bound ( $\theta_\epsilon$ ): TreeFarms doesn't just find the single optimal tree ( $obj_{opt}$ ). It defines a bound to include all trees whose objective value is within a small error margin $\epsilon$ of the optimal: $\theta _ { \epsilon } = obj _ { opt } * ( 1 + \epsilon )$ Where:
- $\theta_\epsilon$ : The maximum allowed objective value for a tree to be included in the Rashomon set.
- $obj_{opt}$ : The objective value of the single optimal decision tree found by GOSDT.
- $\epsilon$ : A small positive value (e.g., 0.05) defining the acceptable performance degradation from the optimal. All trees with an obj value less than or equal to $\theta_\epsilon$ are considered part of the Rashomon set.
Dynamic Programming with Pruning: TreeFarms keeps track of upper and lower bounds for the objective of subproblems. It only removes subproblems whose lower bound is greater than $\theta_\epsilon$ , effectively pruning branches that cannot lead to a tree within the Rashomon set. This allows it to efficiently store and return all models within the defined error range.

The Rashomon Set Construction Algorithm is detailed below:

Algorithm 3: Rashomon Set Construction Algorithm

Input: ABR Algorithm π*, maximum iteration number M, regularization parameter λ, Rashomon set bounds parameter ε, max depth d
Output: Rashomon set R_set

1 //*Construct a dataset that conforms to real occurrence probabilities using the teacher-student learning framework
2 (S, A) = VirtualPlay(π*) //init dataset
3 for i ∈ [1, ..., M] do
4   // Reduce feature number via column elimination
5   (S_guess, A) = Feature_Processing(S, A)
6   // Train decision tree via the reduced dataset
7   π_i = TrainDT(S_guess, A, d)
8   // Update dataset through the teacher-student learning
9   (S_s, A_s) = VirtualPlay(π_i)
10  A_t = Predict(π*, S_s)
11  (S, A) = (S, A) ∪ (S_s, A_t)
12 /*Generate Rashomon set via TreeFarms
13 (S_guess, A) = Feature_Processing(S, A)
14 // Get the best decision tree via GOSDT algorithm
15 obj_opt = GOSDT(S_guess, A, λ, d)
16 // Set the threshold of the Rashomon set
17 θ_ε = obj_opt * (1 + ε)
18 // Generate Rashomon set via TreeFarms
19 R_set = TreeFarms(S_guess, A, λ, θ_ε, d)
20 return R_set

Explanation of Algorithm 3:

Line 1: Comment indicating the first major phase: constructing the dataset for teacher-student learning.
Line 2: $(S, A) = VirtualPlay(π*)$ : Initializes the dataset by performing a virtual playback using the teacher ABR algorithm ( $\pi^*$ ) to collect initial state-action pairs (S, A).
Line 3: for i ∈ [1, ..., M] do: Loop for $M$ iterations of the teacher-student learning process.
Line 4: Comment for feature processing.
Line 5: (S_guess, A) = Feature_Processing(S, A): Applies Algorithm 2 (Feature Processing) to the current dataset (S, A) to reduce the feature space.
Line 6: Comment for training a decision tree.
Line 7: π_i = TrainDT(S_guess, A, d): Trains a decision tree ( $\pi_i$ ) on the feature-processed dataset (S_guess, A) with a maximum depth $d$ .
Line 8: Comment for updating the dataset via teacher-student learning.
Line 9: $(S_s, A_s) = VirtualPlay(π_i)$ : The trained student decision tree ( $\pi_i$ ) is used in a virtual playback to generate new student states ( $S_s$ ) and its corresponding actions ( $A_s$ ).
Line 10: A_t = Predict(π*, S_s): The teacher ABR algorithm ( $\pi^*$ ) predicts the optimal actions ( $A_t$ ) for the student states ( $S_s$ ).
Line 11: $(S, A) = (S, A) ∪ (S_s, A_t)$ : The new state-teacher action pairs are added to the training dataset, correcting the student's actions with the teacher's.
Line 12: Comment indicating the second major phase: generating the Rashomon set.
Line 13: (S_guess, A) = Feature_Processing(S, A): Applies feature processing one more time to the final, refined dataset.
Line 14: Comment for getting the best decision tree via GOSDT.
Line 15: obj_opt = GOSDT(S_guess, A, λ, d): Runs the GOSDT algorithm on the feature-processed dataset to find the single optimal decision tree and its objective value obj_opt (considering $regularization parameter λ$ and max depth d).
Line 16: Comment for setting the Rashomon set threshold.
Line 17: $θ_ε = obj_opt * (1 + ε)$ : Calculates the Rashomon set upper bound $θ_ε$ based on the optimal objective and the error parameter $ε$ .
Line 18: Comment for generating the Rashomon set via TreeFarms.
Line 19: R_set = TreeFarms(S_guess, A, λ, θ_ε, d): Uses TreeFarms to generate the complete Rashomon set ( $R_{set}$ ) of decision trees that meet the performance criteria (within $θ_ε$ ).
Line 20: return R_set: Returns the generated Rashomon set.

4.2.2. LLM Assessment Module

This module takes the Rashomon set of decision trees and evaluates their comprehensibility using Large Language Models (LLMs).

4.2.2.1. Pairwise Comparison

Rationale: LLMs are trained with human preference alignment (e.g., Reinforcement Learning from Human Feedback - RLHF), meaning they learn to mimic human subjective judgments by comparing alternatives. Therefore, asking LLMs to perform pairwise comparisons (e.g., "Which of these two trees is more comprehensible?") aligns better with their training paradigm and leverages their strengths in relative ranking rather than absolute scoring.

4.2.2.2. Model Ensemble

Rationale: To enhance the reliability and reduce bias or randomness from a single LLM, ComTree employs an ensemble strategy.
Process: Multiple LLMs (e.g., GPT-4o and Claude-3.7-Sonnet) are used to compare the comprehensibility of two decision trees. A less comprehensible tree is only eliminated from the Rashomon set if all LLMs in the ensemble reach a consensus on which tree is less comprehensible. This consensus-based approach increases the credibility of the evaluation.

4.2.2.3. Two-Phase Comparison

Rationale: This mechanism addresses challenges like LLM inconsistencies and handling decision trees with very similar comprehensibility. It aims to find the most comprehensible tree with the theoretical minimum number of comparisons, ensuring both connectivity (all trees are compared) and minimality (no redundant comparisons, similar to finding a minimum spanning tree in a graph).
Phase 1 (Grouped Pairwise Comparison):
1. The current Rashomon set ( $R_{set}$ ) is shuffled ( $R_{set}'$ ).
2. Decision trees are grouped into pairs (e.g., $(R_{set}'(0), R_{set}'(1))$ , $(R_{set}'(2), R_{set}'(3))$ ).
3. LLMs (in ensemble) evaluate the relative comprehensibility for each pair.
4. If a consensus is reached, the less comprehensible tree is removed. If no consensus or the trees are deemed indistinguishable, both trees might remain.
Phase 2 (Alternative Comparison Order):
1. If Phase 1 fails to establish any relative relationships (i.e., the set of remaining trees $R_{set}'$ is identical to the input $R_{set}$ ), Phase 2 is initiated.
2. LLMs perform comparisons with an alternative grouping (e.g., $(R_{set}'(1), R_{set}'(0))$ , $(R_{set}'(2), R_{set}'(1))$ ). This changes the comparison pairs to break potential local optima or resolve ambiguities from the first grouping.
3. If after both phases, decision trees remain indistinguishable, they are considered to have sufficiently similar comprehensibility and are grouped into the same equivalence class. The algorithm continues until only one (or an indistinguishable set) remains.
  
  The Comprehensibility Assessment Algorithm is detailed below:

Algorithm 1: Comprehensibility Assessment Algorithm

Input: Rashomon set R_set, LLMs f_GPT, f_Claude
Output: Most comprehensible decision tree t_opt

1 while |R_set| > 1 do
2   R_set' = shuffle(R_set)
3   //* Pairwise comparison phase */
4   for i ∈ [0, ⌊(|R_set'| - 1) / 2⌋] do
5     T_i = R_set'[2i], T_j = R_set'[2i + 1]
6     // Ensemble evaluation using both LLMs
7     r_1 = f_GPT(T_i, T_j), r_2 = f_Claude(T_i, T_j)
8     if r_1 == r_2 then
9       Remove less comprehensible tree from R_set'
10  //* Alternative comparison order */
11  if R_set == R_set' then
12    for i ∈ [1, ⌊(|R_set'| / 2⌋)] do
13      T_i = R_set'[i * 2], T_j = R_set'[i * 2 - 1]
14      r_1 = f_GPT(T_i, T_j), r_2 = f_Claude(T_i, T_j)
15      if r_1 == r_2 then
16        Remove less comprehensible tree from R_set'
17  // Local optimum detection
18  if R_set == R_set' then
19    return R_set
20  R_set = R_set'
21 return t_opt

Explanation of Algorithm 1:

Line 1: $while |R_set| > 1 do$ : The main loop continues as long as there is more than one decision tree left in the Rashomon set to compare.
Line 2: R_set' = shuffle(R_set): The current Rashomon set is shuffled to create R_set' for randomized pairing.
Line 3: Comment for the first pairwise comparison phase.
Line 4: $for i ∈ [0, ⌊(|R_set'| - 1) / 2⌋] do$ : Loop through R_set' to form pairs of trees.
Line 5: $T_i = R_set'[2i], T_j = R_set'[2i + 1]$ : Selects two trees to compare.
Line 6: Comment for ensemble evaluation.
Line 7: r_1 = f_GPT(T_i, T_j), r_2 = f_Claude(T_i, T_j): Both LLMs (f_GPT, f_Claude) are used to compare the comprehensibility of $T_i$ and $T_j$ . Their responses ( $r_1$ , $r_2$ ) indicate their preference (which tree is more comprehensible or if they are equal).
Line 8: $if r_1 == r_2 then$ : Checks if both LLMs agree on which tree is less comprehensible.
Line 9: Remove less comprehensible tree from R_set': If LLMs agree, the less comprehensible tree is removed from the temporary set R_set'.
Line 10: Comment for the alternative comparison order phase.
Line 11: if R_set == R_set' then: This condition checks if any trees were removed in the first pairwise comparison phase. If R_set (before shuffling) is still equal to R_set' (after the first phase), it means no consensus was reached for any pair, or all pairs were considered equally comprehensible. In this case, an alternative comparison order is attempted.
Line 12: $for i ∈ [1, ⌊(|R_set'| / 2⌋)] do$ : Loop through R_set' with a shifted pairing.
Line 13: $T_i = R_set'[i * 2], T_j = R_set'[i * 2 - 1]$ : Selects two trees with an offset compared to the first phase's pairing.
Line 14-16: Similar ensemble evaluation and removal process as in Lines 7-9.
Line 17: Comment for local optimum detection.
Line 18: if R_set == R_set' then: This checks if, after both phases, R_set' is still identical to R_set (meaning no trees were removed in either phase). If so, it implies the remaining trees are indistinguishable in comprehensibility by the LLMs.
Line 19: return R_set: In this local optimum case, the algorithm returns the current (indistinguishable) R_set as the most comprehensible.
Line 20: R_set = R_set': If trees were removed, the R_set is updated with the reduced R_set' for the next round of comparisons.
Line 21: return t_opt: After the loop terminates (only one tree or an indistinguishable group remains), the most comprehensible tree (or set) is returned as t_opt.

Prompt Engineering (from Appendix D): The LLMs are provided with decision trees represented as Python code. To optimize their performance in comprehensibility assessment, two prompt engineering methods are used:

Few-Shot Learning: Before evaluation, LLMs are given prior knowledge on how developers understand decision trees in ABR, specifying aspects like fewer layers, intuitive features (e.g., last_quality, buffer, tput), and well-organized node structure as indicators of comprehensibility. This guides their judgment.
Self-Consistency: In each round of comparison, the query is repeated to each LLM multiple times (e.g., 3 times). The preference that appears most frequently (e.g., $T_i$ > $T_j$ , $T_j$ > $T_i$ , or $T_i$ == $T_j$ ) is taken as the LLM's final answer for that round, improving stability.

5. Experimental Setup

5.1. Datasets

Video Sample: The experiments use the "EnvivoDash3" video from the MPEG-DASH reference videos.
- Length: 193 seconds.
- Segmentation: Divided into 4-second segments.
- Bitrates: Six discrete bitrate levels are available: 300, 750, 1200, 1850, 2850, 4300 kbps.
Network Trace Datasets: These datasets simulate varying network conditions (bandwidth, latency) for training and testing.
- FCC [48]: Used for training the ABR algorithms. It represents low-bandwidth scenarios (average 1.31 Mbps, std 1.00 Mbps).
- Norway [49], Oboe [4], Puffer [58] (two datasets: Oct. 17-21 and Feb. 18-22): Used for testing the performance of the generated algorithms in simulation. These represent diverse network conditions, including varying bandwidths and instability.
- 5G [38]: Used specifically for evaluating the comprehensibility potential. This dataset differs significantly from the training trace (FCC), representing high-bandwidth environments (average 347.46 Mbps, std 378.16 Mbps), which serves as a novel environment to test the adaptability and modifiability of algorithms.

5.2. Evaluation Metrics

The primary metric used is Quality of Experience (QoE). The general formula for QoE is provided as: $QoE = \sum _ { n } q ( R _ { n } ) - \mu \sum _ { n } T _ { n } - \sum _ { n } | q ( R _ { n + 1 } ) - q ( R _ { n } ) |$ Where:

$n$ : Index for the video segment being played.
$q(R_n)$ : Represents the video quality (utility) when the $n$ -th video segment selects bitrate level $R_n$ . This is a function that maps bitrates to a quality score.
$\sum_n q(R_n)$ : The sum of quality values for all played segments, aiming to maximize overall video quality.
$\mu$ : A penalty coefficient for rebuffering.
$T_n$ : The rebuffering duration (in seconds) that occurs before the $n$ -th segment is played.
$\mu \sum_n T_n$ : The total penalty incurred due to rebuffering events, aiming to minimize playback interruptions.
$|q(R_{n+1}) - q(R_n)|$ : The absolute difference in quality between the $(n+1)$ -th and $n$ -th segments. This term penalizes sudden or frequent changes in video quality, promoting smoothness.
$\sum_n |q(R_{n+1}) - q(R_n)|$ : The total penalty for quality variations, aiming to minimize disruptive quality switches.

The paper uses two specific instantiations of q(R) for different QoE metrics:

$QoE_{lin}$ (Linear QoE): This is one of the most commonly used evaluation indicators.
- Conceptual Definition: $QoE_{lin}$ typically assigns quality scores that are directly proportional to the bitrate, making it a linear utility function. It provides a straightforward measure of how much bitrate is delivered to the user, balanced against rebuffering and smoothness penalties.
- Mathematical Formula: $q(R) = R$ (where $R$ is the bitrate)
- Symbol Explanation:
  - $R$ : The selected bitrate for a segment.
- Penalty Parameter: $\mu=4.3$ (for the rebuffering penalty).
$QoE_{hd}$ (High-Definition QoE): This metric uses a step-wise function for quality, reflecting that certain bitrate thresholds might correspond to significant jumps in perceived quality (e.g., reaching an HD resolution).
- Conceptual Definition: $QoE_{hd}$ uses discrete quality levels for specific bitrates, which might better reflect human perception of quality where quality gains are not always linear with bitrate increases. For example, moving from a standard definition bitrate to a high definition bitrate might provide a large jump in perceived quality.
- Mathematical Formula: $q(R) = \{0.3 : 1, 0.75 : 2, 1.2 : 3, 1.85 : 12, 2.85 : 15, 4.3 : 20\}$ (Here, the values are likely normalized or mapped to specific quality scores, with the bitrates provided in Mbps or similar, corresponding to the video bitrates used in the experiment).
- Symbol Explanation:
  - 0.3, 0.75, ..., 4.3: These are the bitrate levels (likely in Mbps, corresponding to the 300, 750,..., 4300 kbps options).
  - 1, 2, ..., 20: These are the assigned quality scores for each respective bitrate level.
- Penalty Parameter: $\mu=8$ (for the rebuffering penalty).
  
  The following are the results from Table 3 of the original paper:
  
  QE q(R) = R μ=4.3
  
  QoEhd q(R) = {0.3 : 1, 0.75 : 2, 1.2 : 3, 1.85 : 12, 2.85 : 15, 4.3 : 20} μ=8

QE q(R) = R	μ=4.3
QoEhd q(R) = {0.3 : 1, 0.75 : 2, 1.2 : 3, 1.85 : 12, 2.85 : 15, 4.3 : 20}	μ=8

Additionally, the paper uses QoE Improvement Ratio (QoE_{impro}) to compare performance differences: $QoE _ { impro } = ( QoE - QoE _ { baseline } ) / | QoE _ { baseline } |$ Where:

$QoE_{impro}$ : The percentage improvement in QoE relative to a baseline.
QoE: The QoE achieved by the algorithm being evaluated.
$QoE_{baseline}$ : The QoE achieved by a reference baseline algorithm.
$|QoE_{baseline}|$ : The absolute value of the baseline QoE, used in the denominator to handle cases where baseline QoE might be negative.

5.3. ABR Baselines

ComTree is compared against a comprehensive set of representative ABR algorithms from various design paradigms:

Heuristic Approaches:
- BBA (Buffer-Based Approach) [19]: A classic buffer-based algorithm that primarily uses the current buffer occupancy to decide the next bitrate. Simple and widely used.
- Bola (Buffer-Occupancy-based Lyapunov-Approach) [53]: Another buffer-based algorithm that uses Lyapunov optimization to achieve near-optimal bitrate adaptation.
- RobustMPC [59]: A model-based approach using Model Predictive Control, which forecasts future network conditions and optimizes decisions over a time horizon.
Learning-Based Methods:
- Pensieve [32]: A pioneering deep reinforcement learning (DRL) based ABR algorithm that learns policies through neural networks. Often used as a high-performance benchmark.
- Genet [56]: An ABR algorithm utilizing curriculum learning to improve adaptation.
- NetLLM [55]: An emerging approach that leverages Large Language Models for networking tasks, specifically ABR.
Interpretability-Focused Method:
- Pitree [36]: An algorithm that transforms black-box models (like Pensieve) into decision trees to improve interpretability. Pitree(P) specifically refers to Pitree using Pensieve as its teacher network.

5.4. Implementation Details of ComTree

Parameters:
- $Importance Regularization Parameter (λ)$ : 0.0005 (for GOSDT objective function).
- $Rashomon Set Boundary (ε)$ : 0.05 (means trees within 5% of the optimal objective value are included).
- Maximum Tree Depth (d): 6 (limits the complexity of decision trees).
Hardware: Experiments are conducted on Ubuntu 18 with dual AMD EPYC 7742 processors.
Naming Conventions:
- ComTree(P): Refers to the optimal decision tree generated by ComTree using Pensieve as the teacher network.
- $ComTree_C$ : Refers to the most comprehensible tree selected from the Rashomon set by the LLM Assessment Module.
- '-L' suffix: Indicates an algorithm that has been adjusted by LLMs (e.g., $ComTree_C-L$ ).
  
  The following are the results from Table 4 of the original paper:
  
  ComTree(P) The optimal tree generated by teacher Pensieve
  
  ComTree_C The most comprehensible tree in Rashomon
  
  '-L' suffx Adjusted by LLM

ComTree(P)	The optimal tree generated by teacher Pensieve
ComTree_C	The most comprehensible tree in Rashomon
'-L' suffx	Adjusted by LLM

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Performance of The Optimal Tree (`ComTree(P)`)

6.1.1.1. Trace-driven Simulation Experiment

The ComTree(P) algorithm, which represents the optimal decision tree generated by the framework (using Pensieve as the teacher), was evaluated across various network traces.

The following figure (Figure 4 from the original paper) illustrates the CDF curves of $QoE_{lin}$ for different algorithms in various network traces:

$Figure 4: CDF of `Q o E _ { l i n }` in different traces$ 该图像是图表，展示了不同追踪数据下的 Q o E _ { l i n } 的累积分布函数 (CDF)。图中有四个子图，分别为 (a) Norway, (b) Oboe, (c) Puffer-Oct.17-21 和 (d) Puffer-Feb.18-22，比较了多种自适应视频流算法的性能，包括 ComTree。 Figure 4: CDF of $QoE_{lin}$ in different traces

The CDF (Cumulative Distribution Function) plots in Figure 4 show the distribution of $QoE_{lin}$ values over many playback sessions. A curve shifted to the right generally indicates better performance (higher QoE). ComTree(P) consistently shows high QoE distributions, often overlapping with or surpassing other top-performing algorithms like Pensieve and Pitree(P).

The following are the results from Table 5 of the original paper:

	ComTree(P)	Pitree(P)	RobustMPC	NetLLM	Bola	Pensieve	Genet	BBA
Norway	0.93 (± 0.58)	0.94* (± 0.57)	0.85 (± 0.64)	0.87 (± 0.52)	0.83 (± 0.47)	0.92‡ (± 0.56)	0.46 (± 0.25)	0.64 (± 0.65)
Oboe	2.11 (± 1.06)	2.08 (± 1.10)	2.16* (± 1.15)	1.81 (± 0.90)	1.80 (± 0.93)	2.10† (± 1.08)	0.95 (± 0.62)	1.84 (± 1.12)
Puffer-2110	0.76`‡}` (± 2.34)	0.72 (± 2.36)	0.68 (± 1.85)	0.83 (± 2.12)	0.79* (± 1.81)	-0.13 (± 12.67)	0.63 (± 1.30)	-0.18 (± 2.92)
Puffer-2202	0.78* (± 2.98)	0.73† (± 3.05)	0.67‡ (± 2.92)	0.66 (± 3.22)	0.60 (± 3.01)	0.29 (± 10.47)	0.25 (± 2.85)	0.25 (± 3.18)
Average	1.15* (1st)	1.11 (2nd)	1.10‡ (3rd)	1.04 (4th)	1.00 (5th)	0.71 (6th)	0.58 (7th)	0.58 (8th)

*Note: * indicates the best performance, † the second best, and ‡ the third best in each row (dataset).

Table 5 presents the mean and standard deviation of $QoE_{lin}$ for all algorithms across four testing datasets.

Competitive Performance: ComTree(P) consistently maintains top-three performance across various datasets. It achieves the overall best average performance (1.15) and ranks first in Puffer-2202, and third in Norway and Puffer-2110.
Improvements over Baselines: On average, ComTree(P) shows improvements ranging from $4\%$ to $98\%$ compared to classical baseline algorithms (BBA, Bola, RobustMPC).
Comparison to Pitree(P): ComTree(P) often performs similarly or slightly better than Pitree(P), which is significant because ComTree achieves this with substantially fewer nodes (as discussed later, typically one-fifth).
Teacher Network Performance: ComTree(P) generally performs on par with its teacher network Pensieve in stable traces like Norway and Oboe, but significantly outperforms Pensieve in more challenging Puffer traces, suggesting effective knowledge distillation.

The following figure (Figure 5 from the original paper) shows the QoE Improvement Ratio of ComTree compared to Pensieve and Pitree(P) under different metrics and network traces:

Figure 5: Improvement Ratio of ComTree on Different ABR Algorithms, Network Traces and QoE Metrics
Figure 5(a) compares ComTree(P) with Pensieve (its teacher), and Figure 5(b) compares ComTree(P) with Pitree(P). The upper box plot regions extending beyond the lower regions indicate that ComTree generally performs better.
The analysis using $QoE_{impro}$ confirms ComTree(P)'s advantages across different network tracks and QoE metrics ( $QoE_{lin}$ and $QoE_{hd}$ ). This indicates that by using dynamic programming (TreeFarms) instead of greedy algorithms, ComTree can extract key decision logic from complex black-box models and achieve smoother decision boundaries, leading to better generalization.

6.1.1.2. Real-world Experiments

Real-world experiments were conducted by integrating algorithms into dash.js and using Selenium for automated browser testing over a public WiFi network. The following figure (Figure 6 from the original paper) shows the $QoE_{lin}$ and its components in real-world scenarios:

Figure 5: Improvement Ratio of ComTree on Different ABR Algorithms, Network Traces and QoE Metrics 该图像是图表，展示了 ComTree 与其他自适应比特率算法在 QoE（体验质量）、比特率、重缓冲惩罚和平滑惩罚等指标上的改进情况。不同算法以不同颜色和样式表示，横坐标为各指标，纵坐标为 QoE 值。 Figure 6: $QoE_{lin}$ and its Components in Real-World

ComTree(P) demonstrates strong performance in real-world conditions, achieving the highest $QoE_{lin}$ and bitrate.
It also outperforms its teacher algorithm (Pensieve) in terms of stall (rebuffering) and smoothness metrics, highlighting the benefits of distilling the black-box knowledge into a refined decision tree form.

6.1.2. Characteristics and Performance of the Rashomon Set

The Rashomon set generated by ComTree contains a diverse collection of decision trees that are all nearly optimal in performance.

6.1.2.1. Characteristics of the Rashomon Set

An analysis of 64 distinct instances from a Rashomon set (containing 1.6e5 trees, sorted by objective values) reveals its properties.

The following figure (Figure 7 from the original paper) illustrates the characteristics of ABR algorithms within the Rashomon set:

Figure 7: Characteristics of ABR in the Rashomon Set

Figure 7(a) - Distribution of Tree Quantities: Shows that the number of decision trees varies significantly among different instances of the Rashomon set, with some instances containing over 5,000 trees. This highlights the richness and diversity of the set.
Figure 7(b) - Feature Utilization Distribution: Indicates that feature utilization rates vary widely. Some features appear in less than 1% of trees, while others are used multiple times within individual trees on average. This diversity in feature usage provides options for selecting comprehensible trees based on feature relevance.
Figure 7(c) - Leaf Node Counts Distribution: Over 95% of the decision trees in the Rashomon set contain between 19 and 21 leaves. This suggests that despite structural diversity, the models maintain a relatively consistent level of structural complexity in terms of leaf nodes.
Figure 7(d) - Accuracy Rates: The accuracy rates on both training and testing sets show minimal variation, ranging from 0.906 to 0.909 across all trees. This confirms that all trees in the Rashomon set indeed meet the performance requirements within the specified $\epsilon$ bounds.

6.1.2.2. $QoE_{lin}$ in the Rashomon Set

While accuracy is maintained, the primary focus for ABR algorithms is QoE. The evaluation of all Rashomon set instances using four network datasets confirms their competitive QoE performance.

The following figure (Figure 8 from the original paper) shows the $QoE_{lin}$ distribution of ABR algorithms within the Rashomon set:

$Figure 8: `Q o E _ { l i n }` of ABR in the Rashomon Set$ 该图像是图表，展示了不同场景下ComTree和其他算法在 $QoE_{lin}$ 上的累计分布函数（CDF）。图中包含四个子图：(a) Norway, (b) Oboe, (c) Puffer-Oct.17-21, 和(d) Puffer-Feb.18-22，各自显示了不同情况下的性能比较。 Figure 8: $QoE_{lin}$ of ABR in the Rashomon Set

Figure 8 illustrates the distribution of $QoE_{lin}$ across the Rashomon set for different network traces, with the positions of Pensieve (teacher) and PiTree(P) (baseline) marked.
The variance in $QoE_{lin}$ within the Rashomon set is dataset-dependent, ranging from 0.06 (Norway) to 0.12 (Puffer-Oct. 17-21).
Even the lowest-performing instances within the Rashomon set remain competitive, often outperforming several baseline algorithms. For example, in the Puffer-Oct. 17-21 dataset, the worst Rashomon set instance still outperforms all algorithms except RobustMPC and PiTree(P).
Over 60% of instances in the Rashomon set surpass PiTree(P)'s performance on the Oboe and Puffer-Feb. 18-22 datasets. This demonstrates that the entire Rashomon set offers strong competitive QoE performance, providing a rich pool from which to select for comprehensibility.

6.1.3. Comprehensibility Analysis of ComTree

6.1.3.1. Process Analysis of Comprehensibility Assessment

The LLM Assessment Module uses pairwise comparisons with ensemble evaluation and two-phase comparison to select the most comprehensible tree.

The following figure (Figure 9 from the original paper) shows the results of different rounds of LLM comparison:

Figure 9: Results of Different Rounds

Figure 9(a) - Number of Remaining Decision Trees: This plot shows how the number of decision trees in the Rashomon set decreases over multiple rounds of LLM comparisons. The eight experiments required 12 to 18 rounds to reach a final selection. Both Self-Consistency and Few-Shot prompt engineering methods helped reduce the number of required rounds (from 18 to 12-14), indicating improved LLM efficiency in reaching consensus.
Figure 9(b) - Proportion of Two Winners: This shows the proportion of rounds where both LLMs (or their ensemble output) returned "two winners," meaning they couldn't distinguish which tree was more comprehensible, or they disagreed. The proportion of such inconsistent judgments tends to increase in later rounds as the remaining decision trees become more similar in comprehensibility. In four tests, the LLMs eventually selected a set of decision trees that they could no longer differentiate.

The consistency of LLM evaluation across different runs (i.e., repeating the entire comprehensibility assessment process) is also analyzed.

The following figure (Figure 10 from the original paper) shows the results of different runs:

Figure 10: Results of Different Runs

Figure 10(a) - Overlap Ratio of Winners: This plot indicates the consistency of selection between two different runs of the LLM assessment process. The curves consistently exceed the reference curve $y=0.6^x$ , suggesting over 60% overlap in the selected "winners" at each round. This demonstrates a certain degree of reproducibility and consistency in LLM judgments.
Figure 10(b) - CDF of Ranking Differences: This plot shows the distribution of differences in the final ranking of decision trees between two runs. For 50% of the instances, the final ranking difference is within 10 positions. When the Self-Consistency method is used, 40% of instances show ranking differences within 5 positions. This indicates reasonable stability in the LLMs' ability to rank comprehensibility.

6.1.3.2. Decision Basis Analysis of LLMs Assessment

To understand why LLMs judged certain trees as more comprehensible, their reasoning responses (from the first 32 rounds, totaling 296k words) were analyzed.

The following are the results from Table 6 of the original paper:

Rank	Reason	Percentage (%)
1	Tree size	29.2
2	Feature Count	23.8
3	Feature Selection	21.1
4	Threshold Selection	18.9
5	Node Organization	7.0

Table 6: Basis for Comprehensibility Judgments by LLM (Top 5)

Table 6 reveals the top five factors LLMs cited for their comprehensibility judgments:

Tree size (29.2%): Number of code lines, leaf nodes, tree depth. Smaller trees are preferred.
Feature Count (23.8%): Fewer distinct features used.
Feature Selection (21.1%): Preference for intuitive and crucial features (e.g., buffer, last_quality, throughput).
Threshold Selection (18.9%): Simpler, more intuitive thresholds for splits.
Node Organization (7.0%): Well-organized structure, less feature crossing, consistent features within layers.

This analysis highlights $ComTree_C$ (the most comprehensible tree selected by LLMs) as superior to ComTree(P) (the optimal tree) and Pitree(P) (baseline) in these aspects:

Tree Size: Pitree(P) was significantly larger than ComTree(P), which was larger than $ComTree_C$ .
Feature Count: Pitree(P) used 14 features, ComTree(P) used 4, and $ComTree_C$ used only 3 (buffer, last_quality, last_throughput) – the most crucial ABR features.
Node Organization: $ComTree_C$ tended to reuse features for initial splits with less feature crossing compared to ComTree(P), contributing to its perceived comprehensibility.

The following figure (Figure 11 from the original paper) shows the $ComTree_C$ decision tree (most comprehensible):

该图像是插图，展示了ComTree算法生成的决策树结构，用于比特率适应。图中包含的变量有通量、延迟、质量等，并通过条件判断（如 $lq <= 0.23$ ）展示决策路径，以优化视频流的质量。 Figure 11: ComTree_C (Some details omitted)

Comparing Figure 11 ( $ComTree_C$ ) with Figure 1 (Pitree from Section 2.1), $ComTree_C$ is visibly simpler, shallower, and uses fewer distinct features, aligning with the LLMs' judgment criteria.

6.1.4. The Potential for Comprehensibility in ComTree

To validate the practical value of comprehensibility, experiments were conducted to simulate developer enhancement in a novel network environment (5G traces), which significantly differs from the training data. Instead of human developers, LLMs were employed to modify the algorithms, providing a controlled and reproducible environment.

The following are the results from Table 2 of the original paper, showing $QoE_{lin}$ in FCC and 5G traces:

	BBA	RobustMPC	Pitree	Pitree_change
FCC	0.61 (± 1.28)	0.83(± 1.30)	0.88 (± 1.30)	0.87 (± 1.30)
5G	3.99 (± 2e-15)	4.12 (± 0.19)	3.81 (± 3e-3)	3.77 (± 8e-16)

Table 2: Results of $QoE_{lin}$ Table 2 from the background section already highlighted how Pitree performed poorly in 5G (3.81) compared to RobustMPC (4.12), and how a naive Pitree_change (a developer's attempt to modify it, as described in Section 2.1) further degraded performance (3.77). This sets the stage for demonstrating the optimization potential of comprehensible trees.

The following figure (Figure 12 from the original paper) shows the $QoE_{lin}$ after LLM optimization in 5G:

$Figure 12: `Q o E _ { l i n }` after LLM Optimization in 5G$ 该图像是图表，展示了不同算法在 QoE、比特率和流畅惩罚方面的性能比较。图中包含了对 ComTree 和 Pitree 算法的评估，以及各种判断基础的频率数据。这些数据为理解算法的表现和优化方向提供了重要参考。 Figure 12: $QoE_{lin}$ after LLM Optimization in 5G

Figure 12 displays the $QoE_{lin}$ and its components (bitrate, smooth penalty, rebuffering) for $ComTree_C$ , ComTree(P), and Pitree(P) before and after LLM optimization (indicated by the -L suffix) in the 5G network environment.
Correlation with Comprehensibility: The results show a strong positive correlation between an algorithm's comprehensibility and its optimization potential. The performance of the LLM-adjusted algorithms followed the order: $ComTree_C-L$ > ComTree(P)-L > Pitree(P)-L. This exactly matches the LLMs' comprehensibility ranking of the original trees. The most comprehensible tree ( $ComTree_C$ ) yielded the greatest improvement after LLM adjustment, while the least comprehensible (Pitree(P)) showed the least improvement, and even degraded further.
Practical Value: After LLM optimization, $ComTree_C-L$ achieved a $QoE_{lin}$ of 4.16, successfully surpassing the classical RobustMPC algorithm (4.12) in this challenging 5G scenario. This demonstrates that ComTree, by generating comprehensible algorithms, empowers developers (or LLM-simulated engineers) to effectively modify and tune performance, potentially exceeding even highly engineered baselines in new environments.

6.2. Data Presentation (Tables)

6.2.1. Comparison with Parameter Adjustment and Meta-Learning Algorithms

The paper further positions ComTree by comparing it with meta-RL algorithms (MERINA) and parameter adaptation algorithms (Oboe) in Appendix B.

The following are the results from Table 7 of the original paper:

Dataset	MERINA	ComTree(M)	Oboe
Norway	0.97(± 0.61)	0.94(± 0.65)	0.88(± 0.68)
Oboe	2.20(± 1.09)	2.19(± 1.10)	2.19(± 1.15)
Puffer-2110	0.77(± 2.01)	0.76(± 1.98)	0.77(± 1.89)
Puffer-2202	0.85(± 2.98)	0.82(± 2.99)	0.77(± 3.10)

Table 7: Performance comparison with parameter adjustment algorithms and meta-learning algorithms

ComTree(M) refers to ComTree using MERINA as its teacher network.
Table 7 shows that ComTree(M) achieves performance very close to its advanced teacher (MERINA) and generally outperforms Oboe. For example, in the Oboe dataset, ComTree(M) matches MERINA and Oboe at 2.19. In Puffer-2202, ComTree(M) (0.82) is very close to MERINA (0.85) and significantly better than Oboe (0.77).
This reinforces the idea that ComTree can effectively distill knowledge from highly performant black-box models while maintaining competitive performance.

6.3. Ablation Studies / Parameter Analysis

While the paper doesn't present formal "ablation studies" in the traditional sense (e.g., removing a component from ComTree to see its impact), the analysis of the Rashomon set characteristics and the impact of prompt engineering on LLM evaluation serves a similar purpose, demonstrating the robustness and design choices of the framework.

Rashomon Set Diversity (Figure 7): Shows that ComTree successfully generates a diverse set of models (varying in tree quantity, feature utilization, and organization) that all maintain high accuracy (Figure 7d) and competitive QoE (Figure 8). This validates the design choice of using TreeFarms to explore the Rashomon set, providing a rich pool for comprehensibility selection.
LLM Prompt Engineering Impact (Figure 9 & 10): The comparison of LLM performance with and without Few-Shot and Self-Consistency prompt engineering demonstrates their effectiveness. These techniques reduce the number of rounds needed for consensus and improve the consistency of LLM judgments across runs. This validates the importance of careful prompt engineering in leveraging LLMs for subjective evaluation tasks.
Comprehensibility Factors (Table 6): The analysis of LLM judgment criteria (tree size, feature count, selection, thresholds, organization) acts as an implicit validation of ComTree's design. The framework implicitly prioritizes these aspects during decision tree generation and explicitly uses them for LLM selection. $ComTree_C$ being smaller and using fewer, more intuitive features compared to Pitree(P) confirms that ComTree's output aligns with what LLMs (simulating developers) deem comprehensible.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces ComTree, a pioneering framework for generating adaptive bitrate (ABR) algorithms that prioritize comprehensibility alongside performance. Recognizing the limitations of interpretability in fostering true developer understanding and optimization, ComTree first constructs a Rashomon set of high-performing decision trees using feature processing and dynamic programming. It then leverages Large Language Models (LLMs) to evaluate and select the most comprehensible algorithm from this set, employing a pairwise comparison and ensemble assessment strategy. Experimental results demonstrate that ComTree consistently achieves competitive QoE performance compared to existing baselines, often with significantly smaller decision tree structures. Crucially, it establishes a strong correlation between comprehensibility (as judged by LLMs) and the potential for an algorithm to be further optimized and adapted by engineers, highlighting its practical value in dynamic production environments.

7.2. Limitations & Future Work

The authors openly discuss several limitations and suggest future research directions:

Optimal Solution for Comprehensibility: While ComTree improves comprehensibility, it might not be the absolute "optimal" solution. A key reason is feature selection. ComTree borrows knowledge from black-box algorithms for feature partitioning to reduce computational overhead. However, feature selection and threshold selection are among the top factors influencing comprehensibility (Table 6). Future work needs to balance computational efficiency with the comprehensibility of the chosen features and thresholds, potentially exploring methods that prioritize human-interpretable features from the outset rather than relying on black-box importance.
Capability Boundaries of LLMs: The paper acknowledges that LLMs are better suited for macro-level planning and content understanding rather than specific decision-making. While ComTree adopts methods to enhance the reliability of LLM assessments, the precise gap between LLMs' evaluations and actual human subjective perception of comprehensibility remains an open question. This requires extensive large-scale subjective experiments involving domain-knowledgeable human experts, which is a significant challenge not only for streaming media but for the broader AI community.
Generalization to Other Domains: The framework is specifically designed for ABR. Applying it to other domains would require adapting the feature processing, teacher-student learning, and LLM prompting to the specific characteristics of those fields.

7.3. Personal Insights & Critique

This paper presents a truly innovative and pragmatic approach to a critical problem in applying advanced AI to real-world systems. The shift from "interpretability" to "comprehensibility" is profound and highly relevant, especially for engineering domains where continuous iteration, debugging, and adaptation are essential.

Inspirations:

Reframing the Problem: The distinction between interpretability and comprehensibility is a crucial insight. Many interpretable AI efforts stop at simply making decisions traceable. This paper highlights that true utility for human experts lies in understanding the why and how to improve, which requires a higher level of cognitive grasp.
Creative use of LLMs: Employing LLMs for subjective evaluation of comprehensibility is a novel and powerful application. It's a clever way to bridge the gap between machine-generated algorithms and human preferences, circumventing the immense cost and logistical challenges of large-scale user studies. This opens avenues for LLMs in other subjective quality assessment tasks in diverse fields.
The Rashomon Set Advantage: The concept of the Rashomon set is brilliantly utilized. Instead of being stuck with a single, potentially opaque optimal model, ComTree provides a buffet of nearly optimal choices, allowing for explicit selection based on a non-performance criterion (comprehensibility). This emphasizes that "best" isn't always singular or purely performance-driven.
Practical Relevance: The demonstration of LLM-adjusted $ComTree_C-L$ outperforming RobustMPC in a novel 5G environment is a compelling validation of the framework's practical utility. It shows that investing in comprehensibility is not just an academic exercise but yields tangible benefits in adaptability and performance optimization.

Potential Issues/Critique:

Subjectivity of Comprehensibility: While LLMs simulate human judgment, the ground truth for "comprehensibility" remains inherently subjective and context-dependent. The LLM's "comprehensibility" might align with a specific type of developer (e.g., one who values conciseness and intuitive features) but not universally. The Few-Shot prompt engineering partially addresses this by guiding the LLM, but the definition of what makes an ABR decision tree "comprehensible" is still implicitly defined by the researchers' input to the LLM. Rigorous human studies, as acknowledged by the authors, would be the gold standard to validate the LLM's judgments.
LLM Hallucinations/Bias: Although ensemble methods and self-consistency are used to mitigate LLM issues, LLMs can still exhibit biases or "hallucinate" reasoning. The reliability of their explanations for comprehensibility (Table 6) relies on the LLMs genuinely understanding these concepts rather than merely generating plausible-sounding rationales based on their training data.
Feature Engineering for Comprehensibility: The reliance on XGBoost's feature importance for initial feature processing (Algorithm 2) is a practical choice for computational reasons. However, XGBoost's importance might prioritize predictive power over human intuition. A truly "comprehensibility-first" approach might involve a more human-driven or constraint-based feature selection process from the very beginning, even if it adds computational complexity. This is an area the authors themselves point to for future work.
Scalability of Rashomon Set Generation: While TreeFarms improves efficiency, generating and searching extremely large Rashomon sets (e.g., for very complex ABR scenarios with many possible features and depths) can still be computationally intensive. The fixed max depth (d=6) and $epsilon (ε=0.05)$ parameters limit the search space, but pushing these boundaries might reintroduce scalability challenges.

Overall, ComTree is a forward-thinking contribution that pushes the boundaries of explainable AI towards actionable AI. By integrating LLMs into the loop of algorithm design and evaluation, it sets a precedent for how we might build more human-centered intelligent systems in the future.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Beyond Interpretability: Exploring the Comprehensibility of Adaptive Video Streaming through Large Language Models

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~37 min read · 48,548 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Rashomon Set Construction Module

4.2.1.1. Feature Processing

4.2.1.2. Generate the Rashomon Set

4.2.2. LLM Assessment Module

4.2.2.1. Pairwise Comparison

4.2.2.2. Model Ensemble

4.2.2.3. Two-Phase Comparison

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. ABR Baselines

5.4. Implementation Details of ComTree

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Performance of The Optimal Tree (ComTree(P))

6.1.1.1. Trace-driven Simulation Experiment

6.1.1.2. Real-world Experiments

6.1.2. Characteristics and Performance of the Rashomon Set

6.1.2.1. Characteristics of the Rashomon Set

6.1.2.2. QoElinQoE_{lin}QoElin​ in the Rashomon Set

6.1.3. Comprehensibility Analysis of ComTree

6.1.3.1. Process Analysis of Comprehensibility Assessment

6.1.3.2. Decision Basis Analysis of LLMs Assessment

6.1.4. The Potential for Comprehensibility in ComTree

6.2. Data Presentation (Tables)

6.2.1. Comparison with Parameter Adjustment and Meta-Learning Algorithms

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.1.1. Performance of The Optimal Tree (`ComTree(P)`)

6.1.2.2. $QoE_{lin}$ in the Rashomon Set