Beyond Interpretability: Exploring the Comprehensibility of Adaptive Video Streaming through Large Language Models
TL;DR Summary
The paper introduces `ComTree`, a bitrate adaptation framework aimed at enhancing comprehensibility in adaptive video streaming. By generating performance-compliant decision trees and leveraging large language models for evaluation, it significantly improves comprehensibility whi
Abstract
Over the past decade, adaptive video streaming technology has witnessed significant advancements, particularly driven by the rapid evolution of deep learning techniques. However, the black-box nature of deep learning algorithms presents challenges for developers in understanding decision-making processes and optimizing for specific application scenarios. Although existing research has enhanced algorithm interpretability through decision tree conversion, interpretability does not directly equate to developers' subjective comprehensibility. To address this challenge, we introduce \texttt{ComTree}, the first bitrate adaptation algorithm generation framework that considers comprehensibility. The framework initially generates the complete set of decision trees that meet performance requirements, then leverages large language models to evaluate these trees for developer comprehensibility, ultimately selecting solutions that best facilitate human understanding and enhancement. Experimental results demonstrate that \texttt{ComTree} significantly improves comprehensibility while maintaining competitive performance, showing potential for further advancement. The source code is available at https://github.com/thu-media/ComTree.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Beyond Interpretability: Exploring the Comprehensibility of Adaptive Video Streaming through Large Language Models
1.2. Authors
-
Lianchen Jia (Tsinghua University)
-
Chaoyang Li (Tsinghua University)
-
Ziqi Yuan (Tsinghua University)
-
Jiahui Chen (Tsinghua University)
-
Tianchi Huang (Tsinghua University)
-
Jiangchuan Liu (Simon Fraser University)
-
Lifeng Sun (Tsinghua University)
The authors are primarily affiliated with Tsinghua University, a prominent research institution in China, with one author from Simon Fraser University in Canada. Their research backgrounds appear to be in computer science, specifically in multimedia, networking, and potentially machine learning, given the paper's focus on adaptive video streaming, deep learning, and large language models.
1.3. Journal/Conference
The paper is published in the Proceedings of the 33rd ACM International Conference on Multimedia (MM '25), October 27-31, 2025, Dublin, Ireland. The ACM International Conference on Multimedia (ACM MM) is a highly reputable and influential conference in the field of multimedia research, known for publishing cutting-edge work across various aspects of multimedia computing, including streaming, processing, and understanding. Its selective nature indicates the high quality and significance of papers accepted.
1.4. Publication Year
2025
1.5. Abstract
The abstract introduces the evolution of adaptive video streaming, largely propelled by deep learning. It highlights a key challenge: the black-box nature of deep learning algorithms hinders developers' understanding of decision-making and optimization. While existing research has improved algorithm interpretability (e.g., through decision tree conversion), the paper argues that interpretability does not directly translate to developers' subjective comprehensibility. To address this, the authors propose ComTree, the first framework for bitrate adaptation algorithm generation that explicitly considers comprehensibility. ComTree operates by first generating a complete set of decision trees that meet performance requirements (a Rashomon set), and then uses Large Language Models (LLMs) to evaluate these trees for developer comprehensibility. The framework ultimately selects solutions that are easier for humans to understand and enhance. Experimental results indicate that ComTree significantly improves comprehensibility while maintaining competitive performance, suggesting its potential for further development. Source code is made available.
1.6. Original Source Link
https://arxiv.org/abs/2508.16448 This is a preprint published on arXiv.
1.7. PDF Link
https://arxiv.org/pdf/2508.16448v1.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the lack of developer comprehensibility in advanced adaptive video streaming (ABR) algorithms, particularly those leveraging deep learning. While deep learning has significantly improved ABR performance, its black-box nature makes it difficult for developers to understand how decisions are made, why certain strategies are adopted, and critically, how to optimize these algorithms for specific, evolving application scenarios (e.g., new network conditions, different user preferences).
This problem is important because video streaming is a dominant consumer of internet traffic, and ensuring a high Quality of Experience (QoE) is crucial. As ABR algorithms become more complex, especially with deep learning, developers lose the ability to debug, maintain, and adapt them effectively. Previous research focused on interpretability (making decisions transparent, often by converting black-box models into decision trees), but the paper argues that interpretability alone is insufficient. Complex decision trees, even if technically interpretable step-by-step, can still be hard for developers to comprehend and modify.
The paper's entry point or innovative idea is to introduce comprehensibility as a novel optimization objective for ABR algorithm generation. It proposes a framework, ComTree, that moves "beyond interpretability" by specifically seeking algorithms that are not just transparent but also intuitively understandable and amenable to human enhancement. It leverages the Rashomon set concept to find multiple good-performing models and Large Language Models (LLMs) to evaluate their human comprehensibility, a subjective metric traditionally hard to quantify.
2.2. Main Contributions / Findings
The paper makes the following primary contributions:
- Revealing Limitations and Importance of Comprehensibility: It empirically validates the limitations of current interpretable ABR methods in terms of developer
comprehensibilityand highlights the critical importance of improving this aspect for practical deployment and optimization. This is demonstrated through experiments showing the difficulty of modifying a complex existing decision tree (Pitree) for new network conditions. - Introducing ComTree Framework: It proposes
ComTree, the first framework that integratescomprehensibilityinto the generation process of adaptive video streaming algorithms. This framework uses aRashomon setapproach to identify multiple high-performingdecision treesand then employsLLMsto assess and select the most developer-friendly ABR algorithms from this set. - Demonstrating Competitive Performance and Comprehensibility: Through extensive experiments,
ComTreeis shown to generate algorithms that achieve competitive performance (often matching or exceeding existing baselines) while significantly improvingcomprehensibility. Furthermore, the framework demonstrates thatcomprehensibilitycorrelates with an algorithm's potential for further optimization and improvement by engineers (even LLM-simulated engineers).
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
-
Adaptive Bitrate (ABR) Streaming: This is a technology used for streaming multimedia content (like video) over the internet. Instead of sending video at a fixed quality, ABR algorithms dynamically adjust the video quality (bitrate) based on the user's current network conditions (bandwidth, latency) and device capabilities. The goal is to maximize the user's
Quality of Experience (QoE)by delivering the highest possible quality without interruptions (rebuffering) or drastic quality changes. ABR algorithms typically monitor factors like available bandwidth, buffer occupancy, and historical data to make decisions about the next video segment's quality level. -
Quality of Experience (QoE):
QoEis a subjective measure of a user's overall satisfaction with a service. In video streaming, it encompasses several factors beyond just technical performance, including:- Video Quality: Higher resolution and bitrate generally lead to better visual quality.
- Playback Smoothness: Absence of pauses, freezes, or
rebufferingevents. - Quality Stability: Minimizing abrupt changes in video quality, which can be jarring to the viewer.
- Startup Delay: The time it takes for video playback to begin.
The paper uses specific
QoEmetrics, which will be detailed in the experimental setup.
-
Deep Learning (DL): A subfield of machine learning that uses artificial neural networks with multiple layers (hence "deep") to learn complex patterns from data. DL models have achieved state-of-the-art results in various domains, including image recognition, natural language processing, and control tasks. However, these models are often referred to as
black-boxmodels because their internal workings and decision-making processes are not easily understandable by humans, making it hard to debug or explain their behavior. -
Interpretability vs. Comprehensibility: These two terms are central to the paper's thesis and are carefully distinguished:
-
Interpretability: Refers to making the decision-making process of an algorithm transparent. It answers the question: "How did the model decide?" For example, converting a complex
black-boxmodel into adecision treemakes each decision step traceable and thusinterpretable. -
Comprehensibility: Goes beyond
interpretability. It focuses on whether a human (specifically a developer in this context) can understand the overall design logic of an algorithm, not just individual decisions. It answers: "Why was the model designed this way?" and "Can I easily modify and optimize it?"Comprehensibilityis a higher-level, subjective goal related to a developer's ability to grasp, maintain, and improve the algorithm. An algorithm can beinterpretable(you can trace its path) but still lackcomprehensibilityif its logic is overly complex, fragmented, or counter-intuitive.The following are the results from Table 1 of the original paper:
Feature Focus Hierarchy Example Interpretability Makes decision processes transparent. Addresses:How did the model decide? Foundation for compre-hensibility Black-box model transformed intotraceable decision tree Comprehensibility Focuses on understanding overall design logic. Ad-dresses: Why was the model designed this way? High-level goal relatedto developers' subjec-tive understanding Algorithms with high comprehensi-bility have greater potential for im-provement after adjustment
-
-
Decision Trees: A type of supervised learning algorithm that builds a model in the form of a tree structure. Each internal node represents a "test" on an attribute (e.g., "buffer size > 2 seconds?"), each branch represents the outcome of the test, and each leaf node represents a class label or a decision (e.g., "choose bitrate X").
Decision treesare inherentlyinterpretablebecause their logic can be easily visualized and followed as a series of if-then-else rules. They are often used to explainblack-boxmodels by approximating their behavior. -
Large Language Models (LLMs): A class of deep learning models trained on vast amounts of text data to understand, generate, and process human language.
LLMscan perform a wide range of tasks, including translation, summarization, question answering, and even complex reasoning. The paper leverages their ability to simulate human judgment and understanding, particularly in subjective tasks like evaluatingcomprehensibility, due to their learnedworld knowledgeandhuman preference alignment. -
Rashomon Set: In machine learning, a
Rashomon setrefers to a collection of multiple models that perform nearly identically well on a given task, despite having potentially different internal structures or decision-making processes. The concept is named after Akira Kurosawa's film "Rashomon," where different characters provide equally plausible but conflicting accounts of the same event. In this paper's context, theRashomon setconsists of manydecision treesthat achieve similarQoEperformance for ABR, allowing the selection of one that is also highlycomprehensible.
3.2. Previous Works
The paper categorizes previous ABR algorithms based on their approach to performance and comprehensibility:
-
Traditional Heuristic Algorithms (e.g., BBA, RobustMPC):
- Description: These algorithms use predefined rules and mathematical models to make bitrate adaptation decisions. Examples include
Buffer-Based Approach (BBA)[19], which primarily uses buffer occupancy, andRobustMPC[59], which uses model predictive control. - Characteristics: They have high
comprehensibilitybecause their rules are explicit and human-designed. However, their rigid rules make them less adaptable to the highly dynamic and unpredictable nature of real-world network conditions, often resulting in suboptimalQoEperformance in complex environments. - Formula Example (General QoE Metric - commonly used in ABR papers, though specific coefficients vary):
The paper uses the following
QoEmetric, which is a common form in ABR research: Where:- : Index for the video segment.
- : Represents the video quality (utility) obtained when the -th video segment is played at bitrate level . This function maps a bitrate to a quality score.
- : The total accumulated quality over all played segments.
- : A penalty coefficient for
rebuffering. - : The duration of the
rebufferingevent that occurs before playing the -th segment. - : The total penalty due to
rebufferingevents. - : The absolute difference in quality between consecutive segments. This term penalizes frequent or large quality changes, promoting
smoothness. - : The total penalty for quality variations, ensuring
smoothness. This formula aims to maximize perceived video quality while minimizing annoyingrebufferingand disruptive quality switches.
- Description: These algorithms use predefined rules and mathematical models to make bitrate adaptation decisions. Examples include
-
Black-Box Methods (e.g., Pensieve, MERINA, Genet, NetLLM):
- Description: These algorithms often employ deep reinforcement learning (DRL) or other advanced machine learning techniques (like
LLMsfor networking) to learn optimal bitrate adaptation policies directly from data or through interaction with simulated environments. Examples includePensieve[32] (DRL-based),MERINA[26] (meta-reinforcement learning),Genet[56] (curriculum learning), andNetLLM[55] (leveragingLLMs). - Characteristics: They typically achieve state-of-the-art
QoEperformance by automatically adapting to complex and dynamic network conditions. However, their complex internal structures (e.g., deep neural networks) result in very lowcomprehensibility. Developers find it extremely challenging to understand their internal decision logic, debug issues in unforeseen scenarios (bad cases), or make targeted improvements.
- Description: These algorithms often employ deep reinforcement learning (DRL) or other advanced machine learning techniques (like
-
Interpretability-Focused ABR (e.g., Pitree):
-
Description: These methods aim to bridge the gap between high-performing
black-boxmodels and human understanding. They often involvepost-hoc analysisormodel distillationtechniques to convert complexblack-boxmodels (like DRL agents) intowhite-boxmodels, such asdecision trees.Pitree[36] is a notable example that transformsblack-boxABR models intodecision treesfor deployment. -
Characteristics: These algorithms improve
interpretabilityby making each decision step traceable. However, as the paper demonstrates withPitree(Figure 1), the resultingdecision treescan still be highly complex (deep, many nodes, many features), making them difficult for developers to comprehend and effectively modify for new scenarios.The following figure (Figure 1 from the original paper) shows an example of a complex decision tree generated by
Pitree:
Figure 1: Decision Tree Generated by Pitree (Some details omitted)
-
-
Hybrid Methods (e.g., Oboe):
- Description: These methods attempt to balance performance and
comprehensibilityby using parameterized rule tables or simpler, structured models.Oboe[4] is an example that uses auto-tuning for ABR algorithms by adapting parameters to network conditions. - Characteristics: They offer a middle ground, often providing better
comprehensibilitythanblack-boxmodels and better performance than purely heuristic approaches. However, their structured nature might still limit their ultimate performance potential compared to fully adaptiveblack-boxmodels, and theircomprehensibilitymight not be as high as purely simple heuristic rules.
- Description: These methods attempt to balance performance and
3.3. Technological Evolution
The evolution of ABR algorithms has progressed through several stages:
-
Early Heuristics (e.g., BBA, MPC): Focused on simple, rule-based logic primarily using buffer status or simplified network models. These were highly
comprehensiblebut limited in adaptability. -
Model-Based Approaches (e.g., RobustMPC): Incorporated more sophisticated control theory or predictive models to make decisions, improving performance over simple heuristics but still relying on explicit models.
-
Deep Reinforcement Learning (DRL) (e.g., Pensieve): Leveraged the power of deep learning to learn complex policies directly from interaction with environments, leading to significant
QoEimprovements. This introduced theblack-boxproblem. -
Interpretability-focused Methods (e.g., Pitree): Addressed the
black-boxissue by converting DRL policies intointerpretableforms likedecision trees, aiming for transparency. -
Comprehensibility-focused Methods (ComTree): This paper marks a new stage by specifically targeting
comprehensibilitybeyond mereinterpretability. It recognizes that eveninterpretablemodels can be too complex for human understanding and optimization, proposing methods to actively select for human-friendly structures.This paper's work (
ComTree) fits within the latest stage, aiming to provide ABR algorithms that are not only performant andinterpretablebut also genuinelycomprehensibleand amenable to developer-driven enhancements.
3.4. Differentiation Analysis
Compared to the main methods in related work, ComTree offers several core differences and innovations:
- Beyond Interpretability to Comprehensibility: While
Pitreealso usesdecision treesforinterpretability,ComTreeexplicitly defines and optimizes forcomprehensibility, a subjective metric reflecting a developer's ability to understand and modify the algorithm. This is a fundamental shift from merely explaining how a decision is made to understanding why the overall algorithm was designed that way and how to improve it. - Leveraging Rashomon Sets:
ComTreedoes not just find an optimaldecision treebut explores theRashomon set– a collection of nearly equally performingdecision trees. This allows for a choice based oncomprehensibilityamong high-performing alternatives, a capability not present in methods that yield a single "best" model. - Utilizing Large Language Models (LLMs) for Subjective Evaluation:
ComTreeinnovatively employsLLMsto evaluate the subjectivecomprehensibilityofdecision trees. This is a novel application ofLLMsin ABR, simulating human judgment in a way that traditional objective metrics cannot. Previous methods rely on structural metrics (like tree depth or node count) that may not fully capture human understanding. - Enhanced Generative Process: Unlike
Pitreewhich uses greedy algorithms to generatedecision trees,ComTreeemploys a dynamic programming algorithm (TreeFarms) with optimality guarantees, allowing it to generate a more diverse and globally optimizedRashomon setof trees within specified performance bounds. It also incorporates afeature processingstep to handle high-dimensional ABR state spaces, making optimal tree generation feasible. - Focus on Developer Potential: The ultimate goal of
ComTreeis to generate algorithms that are easier for developers to modify and optimize in new or unforeseenbad cases. This directly addresses a critical practical challenge in production environments thatblack-boxmodels and even complexinterpretablemodels struggle with.
4. Methodology
4.1. Principles
The core idea behind ComTree is to overcome the limitations of current ABR algorithms (especially deep learning-based ones) that are highly performant but lack comprehensibility, hindering developer understanding and optimization. ComTree addresses this by embracing comprehensibility as a key design objective. Its principle is to:
- Generate a diverse set of high-performing ABR policies: Instead of seeking a single optimal policy,
ComTreeaims to find aRashomon set– a collection of manydecision treesthat achieve nearly the same highQoEperformance. This set provides options. - Evaluate comprehensibility using LLMs: Acknowledge that
comprehensibilityis a subjective human judgment.ComTreeemploysLarge Language Models (LLMs)to simulate developer assessment, choosing the most understandable and modifiabledecision treefrom theRashomon set. - Ensure practical feasibility: The framework incorporates techniques like
feature processingto make the generation of complexRashomon setscomputationally tractable in the high-dimensional ABR state space.
4.2. Core Methodology In-depth
The ComTree framework comprises two main modules: the Rashomon Set Construction Module and the LLM Assessment Module.
4.2.1. Rashomon Set Construction Module
This module is responsible for generating a collection of high-performing, yet structurally diverse, decision trees that meet specific accuracy requirements.
4.2.1.1. Feature Processing
The first step in constructing the Rashomon set is to address the high dimensionality of ABR state spaces, which makes optimal decision tree generation computationally intractable. ABR states can have many features, and when binarized for decision tree algorithms, the number of potential splits becomes enormous. ComTree tackles this through a feature processing step that reduces the number of relevant features using column elimination based on black-box model knowledge.
The process is as follows:
-
Initial Ensemble Tree Generation: An
XGBoostmodel (a parallelized gradient boostingdecision treealgorithm) is trained on the original dataset (S, A) to serve as a referenceblack-boxalgorithm. This provides an initial accuracy (acc_ori) and feature importances (importance). -
Iterative Feature Elimination:
- Features with
importanceless than a predefined threshold are filtered out. - The feature with the lowest importance among the remaining features is identified (
c_min). - This least important feature is temporarily removed from the dataset.
- A new
XGBoostensemble tree is trained on the reduced dataset, and its accuracy (acc_new) is compared to theacc_oriand areference accuracy (acc_rec). - If
acc_newis not lower than the minimum ofacc_oriandacc_rec, it means the feature can be eliminated without significantly impacting accuracy. The elimination is kept, and the process continues. - If
acc_newdrops too much, the last eliminated feature (c_min) is reverted, as its removal negatively affected accuracy.
- Features with
-
Output: This iterative process significantly reduces the number of encoded features, making
dynamic programmingfordecision treegeneration feasible.The
Feature Processing Algorithmis detailed below:
Algorithm 2: Feature Processing Algorithm
Input: state-action dataset (S, A), importance threshold δ, reference accuracy acc_rec
Output: new dataset (S_guess, A)
1 // Obtain initial accuracy and column importance
2 acc_ori, importance := Xgboost(S, A)
3 S_guess = S.copy, acc_new = acc_ori //init dataset
4 /*Column elimination while preserving accuracy
5 while acc_new >= min(acc_ori, acc_rec) and len(S_guess.column) > 1 do
6 // Column elimination based on importance threshold
7 S_guess = fliter_importance(S_guess, importance, δ)
8 // Column elimination based on least important columns
9 c_min = min(importance)
10 S_guess = S_guess.delete(c_min)
11 acc_new, importance = Xgboost(S_guess, A)
12 // Revert column elimination that degrades accuracy
13 if c_min != empty then
14 L S_guess = S_guess.add(c_min)
15 return (S_guess, A)
Explanation of Algorithm 2:
- Line 1: Comment indicating the purpose of the following line: to get initial accuracy and feature importances.
- Line 2:
acc_ori, importance := Xgboost(S, A): AnXGBoostmodel is trained on the initialstate-action dataset(, ). This yields the baseline accuracyacc_oriand a measure ofimportancefor each feature (column) in . - Line 3:
S_guess = S.copy, acc_new = acc_ori: InitializesS_guessas a copy of the original state features (this will be the dataset where columns are eliminated) and setsacc_newto the initial accuracy. - Line 4: Comment indicating the start of the column elimination loop.
- Line 5:
while acc_new >= min(acc_ori, acc_rec) and len(S_guess.column) > 1 do: This loop continues as long as thenew accuracy(acc_new) remains above a certain threshold (either the original accuracy or a predefinedreference accuracy) and there is more than one column left inS_guess. - Line 6: Comment for filtering by importance threshold.
- Line 7:
S_guess = filter_importance(S_guess, importance, δ): Filters out features fromS_guesswhoseimportance(from the previousXGBoostrun) is below the . - Line 8: Comment for eliminating the least important column.
- Line 9:
c_min = min(importance): Identifies the feature with the minimumimportanceamong the currently considered features. - Line 10:
S_guess = S_guess.delete(c_min): Removes the featurec_minfrom theS_guessdataset. - Line 11:
acc_new, importance = Xgboost(S_guess, A): RetrainsXGBooston the modifiedS_guess(withc_minremoved) and , obtaining the new accuracyacc_newand updated featureimportancevalues. - Line 12: Comment for reverting if accuracy degrades.
- Line 13: : Checks if a column was actually removed in the current iteration.
- Line 14:
S_guess = S_guess.add(c_min): If theacc_newcondition in thewhileloop (Line 5) was not met, meaning removingc_mindegraded accuracy too much, thenc_minis added back toS_guessto revert the change. - Line 15: : Returns the final
feature-processed dataset(S_guess, ).
4.2.1.2. Generate the Rashomon Set
After feature processing reduces the state space, ComTree generates the Rashomon set. This involves two sub-steps: first, collecting a dataset that reflects realistic ABR playback probabilities using a teacher-student learning framework, and second, using a dynamic programming algorithm (TreeFarms) to find all decision trees within a specified performance error range.
a. Teacher-Student Learning Framework:
- Purpose: To create a training dataset
(S, A)where represents states encountered during actual video playback and represents optimal actions (bitrate choices) for those states. This avoidscascading errorsthat can occur if a student model is trained on data generated solely by itself, leading to unrealistic state distributions. - Process:
- An initial
black-box teacher network(e.g.,Pensieve) is rolled out in a simulated environment to generate an initial dataset ofstate-action pairs. - An initial
student decision treeis trained using this dataset. - The
student decision treeis then deployed in the simulated video playback environment and interacts with a virtual player under various network conditions. This generatesstudent network execution states() andactions(). - For each
student state, theblack-box teacher networkis queried to generate theteacher's optimal action() for that state. - The pair is then added to the training dataset. This process "corrects" the student's actions with the teacher's optimal actions, ensuring the dataset reflects optimal behavior in realistically encountered states. This iterative refinement helps the student learn a policy that mimics the teacher's high performance.
- An initial
b. Rashomon Set Generation using TreeFarms:
-
TreeFarms: This is a
dynamic programmingalgorithm (based onGOSDT[30]) designed to find allsparse decision treesthat are nearly optimal. Unlike greedy approaches that stop at the first good solution,TreeFarmsexplores the solution space more thoroughly. -
GOSDT Objective:
GOSDToptimizes a combination ofmisclassification lossand asparsity penaltyon the number of leaves. The objective function is: Where:obj: The objective value to be minimized.- : The
misclassification loss(orQoEloss in this context) of thedecision tree. This measures how well the tree predicts the correct actions. - : A
regularization parameterthat controls the trade-off between minimizinglossandsparsity. A higher encourages simpler trees with fewer leaves. - : The number of leaves in the
decision tree. This term penalizes complexity, promotingsparse(simpler) trees.
-
Rashomon Set Bound ():
TreeFarmsdoesn't just find the single optimal tree (). It defines a bound to include all trees whose objective value is within a small error margin of the optimal: Where:- : The maximum allowed objective value for a tree to be included in the
Rashomon set. - : The objective value of the single optimal
decision treefound byGOSDT. - : A small positive value (e.g., 0.05) defining the acceptable performance degradation from the optimal. All trees with an
objvalue less than or equal to are considered part of theRashomon set.
- : The maximum allowed objective value for a tree to be included in the
-
Dynamic Programming with Pruning:
TreeFarmskeeps track of upper and lower bounds for the objective of subproblems. It only removes subproblems whose lower bound is greater than , effectively pruning branches that cannot lead to a tree within theRashomon set. This allows it to efficiently store and return all models within the definederror range.The
Rashomon Set Construction Algorithmis detailed below:
Algorithm 3: Rashomon Set Construction Algorithm
Input: ABR Algorithm π*, maximum iteration number M, regularization parameter λ, Rashomon set bounds parameter ε, max depth d
Output: Rashomon set R_set
1 //*Construct a dataset that conforms to real occurrence probabilities using the teacher-student learning framework
2 (S, A) = VirtualPlay(π*) //init dataset
3 for i ∈ [1, ..., M] do
4 // Reduce feature number via column elimination
5 (S_guess, A) = Feature_Processing(S, A)
6 // Train decision tree via the reduced dataset
7 π_i = TrainDT(S_guess, A, d)
8 // Update dataset through the teacher-student learning
9 (S_s, A_s) = VirtualPlay(π_i)
10 A_t = Predict(π*, S_s)
11 (S, A) = (S, A) ∪ (S_s, A_t)
12 /*Generate Rashomon set via TreeFarms
13 (S_guess, A) = Feature_Processing(S, A)
14 // Get the best decision tree via GOSDT algorithm
15 obj_opt = GOSDT(S_guess, A, λ, d)
16 // Set the threshold of the Rashomon set
17 θ_ε = obj_opt * (1 + ε)
18 // Generate Rashomon set via TreeFarms
19 R_set = TreeFarms(S_guess, A, λ, θ_ε, d)
20 return R_set
Explanation of Algorithm 3:
- Line 1: Comment indicating the first major phase: constructing the dataset for
teacher-student learning. - Line 2: : Initializes the dataset by performing a virtual playback using the
teacher ABR algorithm() to collect initialstate-action pairs(S, A). - Line 3:
for i ∈ [1, ..., M] do: Loop for iterations of theteacher-student learningprocess. - Line 4: Comment for
feature processing. - Line 5:
(S_guess, A) = Feature_Processing(S, A): AppliesAlgorithm 2 (Feature Processing)to the current dataset (S, A) to reduce the feature space. - Line 6: Comment for training a
decision tree. - Line 7:
π_i = TrainDT(S_guess, A, d): Trains adecision tree() on thefeature-processed dataset(S_guess, A) with a maximum depth . - Line 8: Comment for updating the dataset via
teacher-student learning. - Line 9: : The trained
student decision tree() is used in a virtual playback to generate newstudent states() and its correspondingactions(). - Line 10:
A_t = Predict(π*, S_s): Theteacher ABR algorithm() predicts the optimal actions () for thestudent states(). - Line 11: : The new
state-teacher action pairsare added to the training dataset, correcting the student's actions with the teacher's. - Line 12: Comment indicating the second major phase: generating the
Rashomon set. - Line 13:
(S_guess, A) = Feature_Processing(S, A): Appliesfeature processingone more time to the final, refined dataset. - Line 14: Comment for getting the best
decision treeviaGOSDT. - Line 15:
obj_opt = GOSDT(S_guess, A, λ, d): Runs theGOSDTalgorithm on thefeature-processed datasetto find the single optimaldecision treeand its objective valueobj_opt(considering andmax depth d). - Line 16: Comment for setting the
Rashomon setthreshold. - Line 17: : Calculates the
Rashomon setupper bound based on the optimal objective and the error parameter . - Line 18: Comment for generating the
Rashomon setviaTreeFarms. - Line 19:
R_set = TreeFarms(S_guess, A, λ, θ_ε, d): UsesTreeFarmsto generate the completeRashomon set() ofdecision treesthat meet the performance criteria (within ). - Line 20:
return R_set: Returns the generatedRashomon set.
4.2.2. LLM Assessment Module
This module takes the Rashomon set of decision trees and evaluates their comprehensibility using Large Language Models (LLMs).
4.2.2.1. Pairwise Comparison
- Rationale:
LLMsare trained withhuman preference alignment(e.g.,Reinforcement Learning from Human Feedback - RLHF), meaning they learn to mimic human subjective judgments by comparing alternatives. Therefore, askingLLMsto performpairwise comparisons(e.g., "Which of these two trees is more comprehensible?") aligns better with their training paradigm and leverages their strengths in relative ranking rather than absolute scoring.
4.2.2.2. Model Ensemble
- Rationale: To enhance the reliability and reduce bias or randomness from a single
LLM,ComTreeemploys anensemble strategy. - Process: Multiple
LLMs(e.g.,GPT-4oandClaude-3.7-Sonnet) are used to compare thecomprehensibilityof twodecision trees. Aless comprehensibletree is only eliminated from theRashomon setif allLLMsin theensemblereach a consensus on which tree is less comprehensible. This consensus-based approach increases the credibility of the evaluation.
4.2.2.3. Two-Phase Comparison
- Rationale: This mechanism addresses challenges like
LLMinconsistencies and handlingdecision treeswith very similarcomprehensibility. It aims to find the most comprehensible tree with the theoretical minimum number of comparisons, ensuring bothconnectivity(all trees are compared) andminimality(no redundant comparisons, similar to finding aminimum spanning treein a graph). - Phase 1 (Grouped Pairwise Comparison):
- The current
Rashomon set() is shuffled (). Decision treesare grouped into pairs (e.g., , ).LLMs(inensemble) evaluate the relativecomprehensibilityfor each pair.- If a consensus is reached, the less comprehensible tree is removed. If no consensus or the trees are deemed indistinguishable, both trees might remain.
- The current
- Phase 2 (Alternative Comparison Order):
-
If Phase 1 fails to establish any relative relationships (i.e., the set of remaining trees is identical to the input ), Phase 2 is initiated.
-
LLMsperform comparisons with analternative grouping(e.g., , ). This changes the comparison pairs to break potential local optima or resolve ambiguities from the first grouping. -
If after both phases,
decision treesremain indistinguishable, they are considered to have sufficiently similarcomprehensibilityand are grouped into the sameequivalence class. The algorithm continues until only one (or an indistinguishable set) remains.The
Comprehensibility Assessment Algorithmis detailed below:
-
Algorithm 1: Comprehensibility Assessment Algorithm
Input: Rashomon set R_set, LLMs f_GPT, f_Claude
Output: Most comprehensible decision tree t_opt
1 while |R_set| > 1 do
2 R_set' = shuffle(R_set)
3 //* Pairwise comparison phase */
4 for i ∈ [0, ⌊(|R_set'| - 1) / 2⌋] do
5 T_i = R_set'[2i], T_j = R_set'[2i + 1]
6 // Ensemble evaluation using both LLMs
7 r_1 = f_GPT(T_i, T_j), r_2 = f_Claude(T_i, T_j)
8 if r_1 == r_2 then
9 Remove less comprehensible tree from R_set'
10 //* Alternative comparison order */
11 if R_set == R_set' then
12 for i ∈ [1, ⌊(|R_set'| / 2⌋)] do
13 T_i = R_set'[i * 2], T_j = R_set'[i * 2 - 1]
14 r_1 = f_GPT(T_i, T_j), r_2 = f_Claude(T_i, T_j)
15 if r_1 == r_2 then
16 Remove less comprehensible tree from R_set'
17 // Local optimum detection
18 if R_set == R_set' then
19 return R_set
20 R_set = R_set'
21 return t_opt
Explanation of Algorithm 1:
- Line 1: : The main loop continues as long as there is more than one
decision treeleft in theRashomon setto compare. - Line 2:
R_set' = shuffle(R_set): The currentRashomon setis shuffled to createR_set'for randomized pairing. - Line 3: Comment for the first
pairwise comparison phase. - Line 4: : Loop through
R_set'to form pairs of trees. - Line 5: : Selects two trees to compare.
- Line 6: Comment for
ensemble evaluation. - Line 7:
r_1 = f_GPT(T_i, T_j), r_2 = f_Claude(T_i, T_j): BothLLMs(f_GPT,f_Claude) are used to compare thecomprehensibilityof and . Their responses (, ) indicate their preference (which tree is more comprehensible or if they are equal). - Line 8: : Checks if both
LLMsagree on which tree is less comprehensible. - Line 9:
Remove less comprehensible tree from R_set': IfLLMsagree, the less comprehensible tree is removed from the temporary setR_set'. - Line 10: Comment for the
alternative comparison orderphase. - Line 11:
if R_set == R_set' then: This condition checks if any trees were removed in the first pairwise comparison phase. IfR_set(before shuffling) is still equal toR_set'(after the first phase), it means no consensus was reached for any pair, or all pairs were considered equally comprehensible. In this case, analternative comparison orderis attempted. - Line 12: : Loop through
R_set'with a shifted pairing. - Line 13: : Selects two trees with an
offsetcompared to the first phase's pairing. - Line 14-16: Similar
ensemble evaluationand removal process as in Lines 7-9. - Line 17: Comment for
local optimum detection. - Line 18:
if R_set == R_set' then: This checks if, after both phases,R_set'is still identical toR_set(meaning no trees were removed in either phase). If so, it implies the remaining trees are indistinguishable incomprehensibilityby theLLMs. - Line 19:
return R_set: In thislocal optimumcase, the algorithm returns the current (indistinguishable)R_setas the most comprehensible. - Line 20:
R_set = R_set': If trees were removed, theR_setis updated with the reducedR_set'for the next round of comparisons. - Line 21:
return t_opt: After the loop terminates (only one tree or an indistinguishable group remains), the most comprehensible tree (or set) is returned ast_opt.
Prompt Engineering (from Appendix D):
The LLMs are provided with decision trees represented as Python code. To optimize their performance in comprehensibility assessment, two prompt engineering methods are used:
- Few-Shot Learning: Before evaluation,
LLMsare given prior knowledge on how developers understanddecision treesin ABR, specifying aspects like fewer layers, intuitive features (e.g.,last_quality,buffer,tput), and well-organized node structure as indicators ofcomprehensibility. This guides their judgment. - Self-Consistency: In each round of comparison, the query is repeated to each
LLMmultiple times (e.g., 3 times). The preference that appears most frequently (e.g., > , > , or == ) is taken as theLLM's final answer for that round, improving stability.
5. Experimental Setup
5.1. Datasets
-
Video Sample: The experiments use the "EnvivoDash3" video from the MPEG-DASH reference videos.
- Length: 193 seconds.
- Segmentation: Divided into 4-second segments.
- Bitrates: Six discrete bitrate levels are available: 300, 750, 1200, 1850, 2850, 4300 kbps.
-
Network Trace Datasets: These datasets simulate varying network conditions (bandwidth, latency) for training and testing.
- FCC [48]: Used for training the ABR algorithms. It represents low-bandwidth scenarios (average 1.31 Mbps, std 1.00 Mbps).
- Norway [49], Oboe [4], Puffer [58] (two datasets: Oct. 17-21 and Feb. 18-22): Used for testing the performance of the generated algorithms in simulation. These represent diverse network conditions, including varying bandwidths and instability.
- 5G [38]: Used specifically for evaluating the
comprehensibilitypotential. This dataset differs significantly from the training trace (FCC), representing high-bandwidth environments (average 347.46 Mbps, std 378.16 Mbps), which serves as a novel environment to test the adaptability and modifiability of algorithms.
5.2. Evaluation Metrics
The primary metric used is Quality of Experience (QoE). The general formula for QoE is provided as:
Where:
-
: Index for the video segment being played.
-
: Represents the video quality (utility) when the -th video segment selects bitrate level . This is a function that maps bitrates to a quality score.
-
: The sum of quality values for all played segments, aiming to maximize overall video quality.
-
: A penalty coefficient for
rebuffering. -
: The
rebufferingduration (in seconds) that occurs before the -th segment is played. -
: The total penalty incurred due to
rebufferingevents, aiming to minimize playback interruptions. -
: The absolute difference in quality between the -th and -th segments. This term penalizes sudden or frequent changes in video quality, promoting
smoothness. -
: The total penalty for quality variations, aiming to minimize disruptive quality switches.
The paper uses two specific instantiations of
q(R)for differentQoEmetrics:
-
(Linear QoE): This is one of the most commonly used evaluation indicators.
- Conceptual Definition: typically assigns quality scores that are directly proportional to the bitrate, making it a linear utility function. It provides a straightforward measure of how much bitrate is delivered to the user, balanced against rebuffering and smoothness penalties.
- Mathematical Formula: (where is the bitrate)
- Symbol Explanation:
- : The selected bitrate for a segment.
- Penalty Parameter: (for the rebuffering penalty).
-
(High-Definition QoE): This metric uses a step-wise function for quality, reflecting that certain bitrate thresholds might correspond to significant jumps in perceived quality (e.g., reaching an HD resolution).
-
Conceptual Definition: uses discrete quality levels for specific bitrates, which might better reflect human perception of quality where quality gains are not always linear with bitrate increases. For example, moving from a standard definition bitrate to a high definition bitrate might provide a large jump in perceived quality.
-
Mathematical Formula: (Here, the values are likely normalized or mapped to specific quality scores, with the bitrates provided in Mbps or similar, corresponding to the video bitrates used in the experiment).
-
Symbol Explanation:
0.3, 0.75, ..., 4.3: These are the bitrate levels (likely in Mbps, corresponding to the 300, 750,..., 4300 kbps options).1, 2, ..., 20: These are the assigned quality scores for each respective bitrate level.
-
Penalty Parameter: (for the rebuffering penalty).
The following are the results from Table 3 of the original paper:
QE q(R) = R μ=4.3 QoEhd q(R) = {0.3 : 1, 0.75 : 2, 1.2 : 3, 1.85 : 12, 2.85 : 15, 4.3 : 20} μ=8
-
Additionally, the paper uses QoE Improvement Ratio (QoE_{impro}) to compare performance differences:
Where:
- : The percentage improvement in
QoErelative to a baseline. QoE: TheQoEachieved by the algorithm being evaluated.- : The
QoEachieved by a referencebaseline algorithm. - : The absolute value of the
baseline QoE, used in the denominator to handle cases wherebaseline QoEmight be negative.
5.3. ABR Baselines
ComTree is compared against a comprehensive set of representative ABR algorithms from various design paradigms:
-
Heuristic Approaches:
BBA (Buffer-Based Approach)[19]: A classic buffer-based algorithm that primarily uses the current buffer occupancy to decide the next bitrate. Simple and widely used.Bola (Buffer-Occupancy-based Lyapunov-Approach)[53]: Another buffer-based algorithm that uses Lyapunov optimization to achieve near-optimal bitrate adaptation.RobustMPC[59]: A model-based approach using Model Predictive Control, which forecasts future network conditions and optimizes decisions over a time horizon.
-
Learning-Based Methods:
Pensieve[32]: A pioneering deep reinforcement learning (DRL) based ABR algorithm that learns policies through neural networks. Often used as a high-performance benchmark.Genet[56]: An ABR algorithm utilizingcurriculum learningto improve adaptation.NetLLM[55]: An emerging approach that leveragesLarge Language Modelsfor networking tasks, specifically ABR.
-
Interpretability-Focused Method:
Pitree[36]: An algorithm that transformsblack-boxmodels (likePensieve) intodecision treesto improveinterpretability.Pitree(P)specifically refers toPitreeusingPensieveas its teacher network.
5.4. Implementation Details of ComTree
- Parameters:
- : 0.0005 (for
GOSDTobjective function). - : 0.05 (means trees within 5% of the optimal objective value are included).
Maximum Tree Depth (d): 6 (limits the complexity ofdecision trees).
- : 0.0005 (for
- Hardware: Experiments are conducted on Ubuntu 18 with dual AMD EPYC 7742 processors.
- Naming Conventions:
-
ComTree(P): Refers to the optimaldecision treegenerated byComTreeusingPensieveas the teacher network. -
: Refers to the most comprehensible tree selected from the
Rashomon setby theLLM Assessment Module. -
'-L' suffix: Indicates an algorithm that has beenadjusted by LLMs(e.g., ).The following are the results from Table 4 of the original paper:
ComTree(P) The optimal tree generated by teacher Pensieve ComTree_C The most comprehensible tree in Rashomon '-L' suffx Adjusted by LLM
-
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Performance of The Optimal Tree (ComTree(P))
6.1.1.1. Trace-driven Simulation Experiment
The ComTree(P) algorithm, which represents the optimal decision tree generated by the framework (using Pensieve as the teacher), was evaluated across various network traces.
The following figure (Figure 4 from the original paper) illustrates the CDF curves of for different algorithms in various network traces:
该图像是图表,展示了不同追踪数据下的 Q o E _ { l i n } 的累积分布函数 (CDF)。图中有四个子图,分别为 (a) Norway, (b) Oboe, (c) Puffer-Oct.17-21 和 (d) Puffer-Feb.18-22,比较了多种自适应视频流算法的性能,包括 ComTree。
Figure 4: CDF of in different traces
The CDF (Cumulative Distribution Function) plots in Figure 4 show the distribution of values over many playback sessions. A curve shifted to the right generally indicates better performance (higher QoE). ComTree(P) consistently shows high QoE distributions, often overlapping with or surpassing other top-performing algorithms like Pensieve and Pitree(P).
The following are the results from Table 5 of the original paper:
| ComTree(P) | Pitree(P) | RobustMPC | NetLLM | Bola | Pensieve | Genet | BBA | |
| Norway | 0.93 (± 0.58) | 0.94* (± 0.57) | 0.85 (± 0.64) | 0.87 (± 0.52) | 0.83 (± 0.47) | 0.92‡ (± 0.56) | 0.46 (± 0.25) | 0.64 (± 0.65) |
| Oboe | 2.11 (± 1.06) | 2.08 (± 1.10) | 2.16* (± 1.15) | 1.81 (± 0.90) | 1.80 (± 0.93) | 2.10† (± 1.08) | 0.95 (± 0.62) | 1.84 (± 1.12) |
| Puffer-2110 | 0.76`‡}` (± 2.34) | 0.72 (± 2.36) | 0.68 (± 1.85) | 0.83 (± 2.12) | 0.79* (± 1.81) | -0.13 (± 12.67) | 0.63 (± 1.30) | -0.18 (± 2.92) |
| Puffer-2202 | 0.78* (± 2.98) | 0.73† (± 3.05) | 0.67‡ (± 2.92) | 0.66 (± 3.22) | 0.60 (± 3.01) | 0.29 (± 10.47) | 0.25 (± 2.85) | 0.25 (± 3.18) |
| Average | 1.15* (1st) | 1.11 (2nd) | 1.10‡ (3rd) | 1.04 (4th) | 1.00 (5th) | 0.71 (6th) | 0.58 (7th) | 0.58 (8th) |
*Note: * indicates the best performance, † the second best, and ‡ the third best in each row (dataset).
Table 5 presents the mean and standard deviation of for all algorithms across four testing datasets.
-
Competitive Performance:
ComTree(P)consistently maintains top-three performance across various datasets. It achieves the overall best average performance (1.15) and ranks first inPuffer-2202, and third inNorwayandPuffer-2110. -
Improvements over Baselines: On average,
ComTree(P)shows improvements ranging from to compared to classical baseline algorithms (BBA, Bola, RobustMPC). -
Comparison to Pitree(P):
ComTree(P)often performs similarly or slightly better thanPitree(P), which is significant becauseComTreeachieves this with substantially fewer nodes (as discussed later, typically one-fifth). -
Teacher Network Performance:
ComTree(P)generally performs on par with its teacher networkPensievein stable traces likeNorwayandOboe, but significantly outperformsPensievein more challenging Puffer traces, suggesting effective knowledge distillation.The following figure (Figure 5 from the original paper) shows the
QoE Improvement RatioofComTreecompared toPensieveandPitree(P)under different metrics and network traces:
Figure 5: Improvement Ratio of ComTree on Different ABR Algorithms, Network Traces and QoE Metrics -
Figure 5(a) compares
ComTree(P)withPensieve(its teacher), and Figure 5(b) comparesComTree(P)withPitree(P). The upper box plot regions extending beyond the lower regions indicate thatComTreegenerally performs better. -
The analysis using confirms
ComTree(P)'s advantages across different network tracks andQoEmetrics ( and ). This indicates that by using dynamic programming (TreeFarms) instead of greedy algorithms,ComTreecan extract key decision logic from complexblack-boxmodels and achieve smoother decision boundaries, leading to better generalization.
6.1.1.2. Real-world Experiments
Real-world experiments were conducted by integrating algorithms into dash.js and using Selenium for automated browser testing over a public WiFi network.
The following figure (Figure 6 from the original paper) shows the and its components in real-world scenarios:
该图像是图表,展示了 ComTree 与其他自适应比特率算法在 QoE(体验质量)、比特率、重缓冲惩罚和平滑惩罚等指标上的改进情况。不同算法以不同颜色和样式表示,横坐标为各指标,纵坐标为 QoE 值。
Figure 6: and its Components in Real-World
ComTree(P)demonstrates strong performance in real-world conditions, achieving the highest and bitrate.- It also outperforms its teacher algorithm (
Pensieve) in terms ofstall(rebuffering) andsmoothnessmetrics, highlighting the benefits of distilling theblack-boxknowledge into a refineddecision treeform.
6.1.2. Characteristics and Performance of the Rashomon Set
The Rashomon set generated by ComTree contains a diverse collection of decision trees that are all nearly optimal in performance.
6.1.2.1. Characteristics of the Rashomon Set
An analysis of 64 distinct instances from a Rashomon set (containing 1.6e5 trees, sorted by objective values) reveals its properties.
The following figure (Figure 7 from the original paper) illustrates the characteristics of ABR algorithms within the Rashomon set:
Figure 7: Characteristics of ABR in the Rashomon Set
- Figure 7(a) - Distribution of Tree Quantities: Shows that the number of
decision treesvaries significantly among different instances of theRashomon set, with some instances containing over 5,000 trees. This highlights the richness and diversity of the set. - Figure 7(b) - Feature Utilization Distribution: Indicates that feature utilization rates vary widely. Some features appear in less than 1% of trees, while others are used multiple times within individual trees on average. This diversity in feature usage provides options for selecting
comprehensibletrees based on feature relevance. - Figure 7(c) - Leaf Node Counts Distribution: Over 95% of the
decision treesin theRashomon setcontain between 19 and 21 leaves. This suggests that despite structural diversity, the models maintain a relatively consistent level of structural complexity in terms of leaf nodes. - Figure 7(d) - Accuracy Rates: The accuracy rates on both training and testing sets show minimal variation, ranging from 0.906 to 0.909 across all trees. This confirms that all trees in the
Rashomon setindeed meet the performance requirements within the specified bounds.
6.1.2.2. in the Rashomon Set
While accuracy is maintained, the primary focus for ABR algorithms is QoE. The evaluation of all Rashomon set instances using four network datasets confirms their competitive QoE performance.
The following figure (Figure 8 from the original paper) shows the distribution of ABR algorithms within the Rashomon set:
该图像是图表,展示了不同场景下ComTree和其他算法在上的累计分布函数(CDF)。图中包含四个子图:(a) Norway, (b) Oboe, (c) Puffer-Oct.17-21, 和(d) Puffer-Feb.18-22,各自显示了不同情况下的性能比较。
Figure 8: of ABR in the Rashomon Set
- Figure 8 illustrates the distribution of across the
Rashomon setfor different network traces, with the positions ofPensieve(teacher) andPiTree(P)(baseline) marked. - The variance in within the
Rashomon setis dataset-dependent, ranging from 0.06 (Norway) to 0.12 (Puffer-Oct. 17-21). - Even the lowest-performing instances within the
Rashomon setremain competitive, often outperforming several baseline algorithms. For example, in the Puffer-Oct. 17-21 dataset, the worstRashomon setinstance still outperforms all algorithms exceptRobustMPCandPiTree(P). - Over 60% of instances in the
Rashomon setsurpassPiTree(P)'s performance on the Oboe and Puffer-Feb. 18-22 datasets. This demonstrates that the entireRashomon setoffers strong competitiveQoEperformance, providing a rich pool from which to select forcomprehensibility.
6.1.3. Comprehensibility Analysis of ComTree
6.1.3.1. Process Analysis of Comprehensibility Assessment
The LLM Assessment Module uses pairwise comparisons with ensemble evaluation and two-phase comparison to select the most comprehensible tree.
The following figure (Figure 9 from the original paper) shows the results of different rounds of LLM comparison:
Figure 9: Results of Different Rounds
-
Figure 9(a) - Number of Remaining Decision Trees: This plot shows how the number of
decision treesin theRashomon setdecreases over multiple rounds ofLLMcomparisons. The eight experiments required 12 to 18 rounds to reach a final selection. BothSelf-ConsistencyandFew-Shotprompt engineeringmethods helped reduce the number of required rounds (from 18 to 12-14), indicating improvedLLMefficiency in reaching consensus. -
Figure 9(b) - Proportion of Two Winners: This shows the proportion of rounds where both
LLMs(or theirensembleoutput) returned "two winners," meaning they couldn't distinguish which tree was more comprehensible, or they disagreed. The proportion of such inconsistent judgments tends to increase in later rounds as the remainingdecision treesbecome more similar incomprehensibility. In four tests, theLLMseventually selected a set ofdecision treesthat they could no longer differentiate.The consistency of
LLMevaluation across different runs (i.e., repeating the entirecomprehensibility assessmentprocess) is also analyzed.
The following figure (Figure 10 from the original paper) shows the results of different runs:
Figure 10: Results of Different Runs
- Figure 10(a) - Overlap Ratio of Winners: This plot indicates the consistency of selection between two different runs of the
LLMassessment process. The curves consistently exceed the reference curve , suggesting over 60% overlap in the selected "winners" at each round. This demonstrates a certain degree of reproducibility and consistency inLLMjudgments. - Figure 10(b) - CDF of Ranking Differences: This plot shows the distribution of differences in the final ranking of
decision treesbetween two runs. For 50% of the instances, the final ranking difference is within 10 positions. When theSelf-Consistencymethod is used, 40% of instances show ranking differences within 5 positions. This indicates reasonable stability in theLLMs'ability to rankcomprehensibility.
6.1.3.2. Decision Basis Analysis of LLMs Assessment
To understand why LLMs judged certain trees as more comprehensible, their reasoning responses (from the first 32 rounds, totaling 296k words) were analyzed.
The following are the results from Table 6 of the original paper:
| Rank | Reason | Percentage (%) |
| 1 | Tree size | 29.2 |
| 2 | Feature Count | 23.8 |
| 3 | Feature Selection | 21.1 |
| 4 | Threshold Selection | 18.9 |
| 5 | Node Organization | 7.0 |
Table 6: Basis for Comprehensibility Judgments by LLM (Top 5)
Table 6 reveals the top five factors LLMs cited for their comprehensibility judgments:
-
Tree size (29.2%): Number of code lines, leaf nodes, tree depth. Smaller trees are preferred.
-
Feature Count (23.8%): Fewer distinct features used.
-
Feature Selection (21.1%): Preference for intuitive and crucial features (e.g.,
buffer,last_quality,throughput). -
Threshold Selection (18.9%): Simpler, more intuitive thresholds for splits.
-
Node Organization (7.0%): Well-organized structure, less feature crossing, consistent features within layers.
This analysis highlights (the most comprehensible tree selected by
LLMs) as superior toComTree(P)(the optimal tree) andPitree(P)(baseline) in these aspects:
-
Tree Size:
Pitree(P)was significantly larger thanComTree(P), which was larger than . -
Feature Count:
Pitree(P)used 14 features,ComTree(P)used 4, and used only 3 (buffer,last_quality,last_throughput) – the most crucial ABR features. -
Node Organization: tended to reuse features for initial splits with less
feature crossingcompared toComTree(P), contributing to its perceivedcomprehensibility.The following figure (Figure 11 from the original paper) shows the decision tree (most comprehensible):
该图像是插图,展示了ComTree算法生成的决策树结构,用于比特率适应。图中包含的变量有通量、延迟、质量等,并通过条件判断(如)展示决策路径,以优化视频流的质量。
Figure 11: ComTree_C (Some details omitted)
Comparing Figure 11 () with Figure 1 (Pitree from Section 2.1), is visibly simpler, shallower, and uses fewer distinct features, aligning with the LLMs' judgment criteria.
6.1.4. The Potential for Comprehensibility in ComTree
To validate the practical value of comprehensibility, experiments were conducted to simulate developer enhancement in a novel network environment (5G traces), which significantly differs from the training data. Instead of human developers, LLMs were employed to modify the algorithms, providing a controlled and reproducible environment.
The following are the results from Table 2 of the original paper, showing in FCC and 5G traces:
| BBA | RobustMPC | Pitree | Pitree_change | |
| FCC | 0.61 (± 1.28) | 0.83(± 1.30) | 0.88 (± 1.30) | 0.87 (± 1.30) |
| 5G | 3.99 (± 2e-15) | 4.12 (± 0.19) | 3.81 (± 3e-3) | 3.77 (± 8e-16) |
Table 2: Results of
Table 2 from the background section already highlighted how Pitree performed poorly in 5G (3.81) compared to RobustMPC (4.12), and how a naive Pitree_change (a developer's attempt to modify it, as described in Section 2.1) further degraded performance (3.77). This sets the stage for demonstrating the optimization potential of comprehensible trees.
The following figure (Figure 12 from the original paper) shows the after LLM optimization in 5G:
该图像是图表,展示了不同算法在 QoE、比特率和流畅惩罚方面的性能比较。图中包含了对 ComTree 和 Pitree 算法的评估,以及各种判断基础的频率数据。这些数据为理解算法的表现和优化方向提供了重要参考。
Figure 12: after LLM Optimization in 5G
- Figure 12 displays the and its components (bitrate, smooth penalty, rebuffering) for ,
ComTree(P), andPitree(P)before and afterLLMoptimization (indicated by the-Lsuffix) in the5Gnetwork environment. - Correlation with Comprehensibility: The results show a strong positive correlation between an algorithm's
comprehensibilityand itsoptimization potential. The performance of theLLM-adjustedalgorithms followed the order: >ComTree(P)-L>Pitree(P)-L. This exactly matches theLLMs'comprehensibilityranking of the original trees. The most comprehensible tree () yielded the greatest improvement afterLLMadjustment, while the least comprehensible (Pitree(P)) showed the least improvement, and even degraded further. - Practical Value: After
LLMoptimization, achieved a of4.16, successfully surpassing the classicalRobustMPCalgorithm (4.12) in this challenging5Gscenario. This demonstrates thatComTree, by generatingcomprehensiblealgorithms, empowers developers (orLLM-simulated engineers) to effectively modify and tune performance, potentially exceeding even highly engineered baselines in new environments.
6.2. Data Presentation (Tables)
6.2.1. Comparison with Parameter Adjustment and Meta-Learning Algorithms
The paper further positions ComTree by comparing it with meta-RL algorithms (MERINA) and parameter adaptation algorithms (Oboe) in Appendix B.
The following are the results from Table 7 of the original paper:
| Dataset | MERINA | ComTree(M) | Oboe |
| Norway | 0.97(± 0.61) | 0.94(± 0.65) | 0.88(± 0.68) |
| Oboe | 2.20(± 1.09) | 2.19(± 1.10) | 2.19(± 1.15) |
| Puffer-2110 | 0.77(± 2.01) | 0.76(± 1.98) | 0.77(± 1.89) |
| Puffer-2202 | 0.85(± 2.98) | 0.82(± 2.99) | 0.77(± 3.10) |
Table 7: Performance comparison with parameter adjustment algorithms and meta-learning algorithms
ComTree(M)refers toComTreeusingMERINAas its teacher network.- Table 7 shows that
ComTree(M)achieves performance very close to its advanced teacher (MERINA) and generally outperformsOboe. For example, in theOboedataset,ComTree(M)matchesMERINAandOboeat2.19. InPuffer-2202,ComTree(M)(0.82) is very close toMERINA(0.85) and significantly better thanOboe(0.77). - This reinforces the idea that
ComTreecan effectively distill knowledge from highly performantblack-boxmodels while maintaining competitive performance.
6.3. Ablation Studies / Parameter Analysis
While the paper doesn't present formal "ablation studies" in the traditional sense (e.g., removing a component from ComTree to see its impact), the analysis of the Rashomon set characteristics and the impact of prompt engineering on LLM evaluation serves a similar purpose, demonstrating the robustness and design choices of the framework.
- Rashomon Set Diversity (Figure 7): Shows that
ComTreesuccessfully generates a diverse set of models (varying in tree quantity, feature utilization, and organization) that all maintain high accuracy (Figure 7d) and competitiveQoE(Figure 8). This validates the design choice of usingTreeFarmsto explore theRashomon set, providing a rich pool forcomprehensibilityselection. - LLM Prompt Engineering Impact (Figure 9 & 10): The comparison of
LLMperformance with and withoutFew-ShotandSelf-Consistencyprompt engineeringdemonstrates their effectiveness. These techniques reduce the number of rounds needed for consensus and improve the consistency ofLLMjudgments across runs. This validates the importance of carefulprompt engineeringin leveragingLLMsfor subjective evaluation tasks. - Comprehensibility Factors (Table 6): The analysis of
LLMjudgment criteria (tree size, feature count, selection, thresholds, organization) acts as an implicit validation ofComTree's design. The framework implicitly prioritizes these aspects duringdecision treegeneration and explicitly uses them forLLMselection. being smaller and using fewer, more intuitive features compared toPitree(P)confirms thatComTree's output aligns with whatLLMs(simulating developers) deem comprehensible.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces ComTree, a pioneering framework for generating adaptive bitrate (ABR) algorithms that prioritize comprehensibility alongside performance. Recognizing the limitations of interpretability in fostering true developer understanding and optimization, ComTree first constructs a Rashomon set of high-performing decision trees using feature processing and dynamic programming. It then leverages Large Language Models (LLMs) to evaluate and select the most comprehensible algorithm from this set, employing a pairwise comparison and ensemble assessment strategy. Experimental results demonstrate that ComTree consistently achieves competitive QoE performance compared to existing baselines, often with significantly smaller decision tree structures. Crucially, it establishes a strong correlation between comprehensibility (as judged by LLMs) and the potential for an algorithm to be further optimized and adapted by engineers, highlighting its practical value in dynamic production environments.
7.2. Limitations & Future Work
The authors openly discuss several limitations and suggest future research directions:
- Optimal Solution for Comprehensibility: While
ComTreeimprovescomprehensibility, it might not be the absolute "optimal" solution. A key reason isfeature selection.ComTreeborrows knowledge fromblack-boxalgorithms forfeature partitioningto reduce computational overhead. However,feature selectionandthreshold selectionare among the top factors influencingcomprehensibility(Table 6). Future work needs to balance computational efficiency with thecomprehensibilityof the chosen features and thresholds, potentially exploring methods that prioritize human-interpretable features from the outset rather than relying onblack-boximportance. - Capability Boundaries of LLMs: The paper acknowledges that
LLMsare better suited for macro-level planning and content understanding rather than specific decision-making. WhileComTreeadopts methods to enhance the reliability ofLLMassessments, the precise gap betweenLLMs'evaluations and actual human subjective perception ofcomprehensibilityremains anopen question. This requires extensivelarge-scale subjective experimentsinvolving domain-knowledgeable human experts, which is a significant challenge not only for streaming media but for the broader AI community. - Generalization to Other Domains: The framework is specifically designed for ABR. Applying it to other domains would require adapting the
feature processing,teacher-student learning, andLLMprompting to the specific characteristics of those fields.
7.3. Personal Insights & Critique
This paper presents a truly innovative and pragmatic approach to a critical problem in applying advanced AI to real-world systems. The shift from "interpretability" to "comprehensibility" is profound and highly relevant, especially for engineering domains where continuous iteration, debugging, and adaptation are essential.
Inspirations:
- Reframing the Problem: The distinction between
interpretabilityandcomprehensibilityis a crucial insight. Manyinterpretable AIefforts stop at simply making decisions traceable. This paper highlights that true utility for human experts lies in understanding the why and how to improve, which requires a higher level of cognitive grasp. - Creative use of LLMs: Employing
LLMsfor subjective evaluation ofcomprehensibilityis a novel and powerful application. It's a clever way to bridge the gap between machine-generated algorithms and human preferences, circumventing the immense cost and logistical challenges oflarge-scale user studies. This opens avenues forLLMsin other subjective quality assessment tasks in diverse fields. - The Rashomon Set Advantage: The concept of the
Rashomon setis brilliantly utilized. Instead of being stuck with a single, potentially opaque optimal model,ComTreeprovides a buffet of nearly optimal choices, allowing for explicit selection based on a non-performance criterion (comprehensibility). This emphasizes that "best" isn't always singular or purely performance-driven. - Practical Relevance: The demonstration of
LLM-adjusted outperformingRobustMPCin a novel5Genvironment is a compelling validation of the framework's practical utility. It shows that investing incomprehensibilityis not just an academic exercise but yields tangible benefits in adaptability and performance optimization.
Potential Issues/Critique:
-
Subjectivity of Comprehensibility: While
LLMssimulate human judgment, the ground truth for "comprehensibility" remains inherently subjective and context-dependent. TheLLM's "comprehensibility" might align with a specific type of developer (e.g., one who values conciseness and intuitive features) but not universally. TheFew-Shotprompt engineering partially addresses this by guiding theLLM, but the definition of what makes an ABRdecision tree"comprehensible" is still implicitly defined by the researchers' input to theLLM. Rigorous human studies, as acknowledged by the authors, would be the gold standard to validate theLLM's judgments. -
LLM Hallucinations/Bias: Although
ensemble methodsandself-consistencyare used to mitigateLLMissues,LLMscan still exhibit biases or "hallucinate" reasoning. The reliability of their explanations forcomprehensibility(Table 6) relies on theLLMsgenuinely understanding these concepts rather than merely generating plausible-sounding rationales based on their training data. -
Feature Engineering for Comprehensibility: The reliance on
XGBoost'sfeature importancefor initialfeature processing(Algorithm 2) is a practical choice for computational reasons. However,XGBoost'simportancemight prioritize predictive power over human intuition. A truly "comprehensibility-first" approach might involve a more human-driven or constraint-basedfeature selectionprocess from the very beginning, even if it adds computational complexity. This is an area the authors themselves point to for future work. -
Scalability of Rashomon Set Generation: While
TreeFarmsimproves efficiency, generating and searching extremely largeRashomon sets(e.g., for very complex ABR scenarios with many possible features and depths) can still be computationally intensive. The fixedmax depth (d=6)and parameters limit the search space, but pushing these boundaries might reintroduce scalability challenges.Overall,
ComTreeis a forward-thinking contribution that pushes the boundaries of explainable AI towards actionable AI. By integratingLLMsinto the loop of algorithm design and evaluation, it sets a precedent for how we might build more human-centered intelligent systems in the future.
Similar papers
Recommended via semantic vector search.