Paper status: completed

Large Language Models as Generalist Policies for Network Optimization

Published:12/04/2025

Application of Large Language Models in Network Optimization (1)Generalist Policy Network Optimization (1)Network Alignment Mechanism (1)Lightweight Policy Collaboration Mechanism (1)Robust Network Service Design (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

To ensure robust network services, designing control policies is crucial. Existing methods rely on specialized rules or deep learning, limiting generalization. This paper introduces Trailblazer, a framework leveraging large language models to create generalist network policies ad

Abstract

Designing control policies to ensure robust network services is essential to modern digital infrastructure. However, the dominant paradigm for network optimization relies on designing specialist policies based on handcrafted rules or deep learning models, leading to poor generalization across diverse tasks and environments. In contrast, large language models (LLMs), pretrained on Internet-scale corpora, provide a rich and unified knowledge base that encodes fundamental networking principles. Combined with their emergent abilities in generalization to unseen scenarios, LLMs offer a transformative foundation for generalist network policies that can generalize across diverse tasks and environments with minimal adaptation. In this paper, we present Trailblazer, the first systematic framework to realize such a generalist policy for networking. Trailblazer incorporates a network alignment scheme to ground the LLM in specific networking tasks, and an adaptive policy collaboration mechanism that offloads simple control cases from the LLM to a lightweight policy for computational efficiency. Through extensive simulations and large-scale real-world online evaluation on Douyin (the Chinese version of TikTok), Trailblazer, powered by a single LLM, demonstrates stronger cross-task and cross-environment generalization than conventional specialist policies. Our results validate LLMs as the foundation for generalist network policies, and position Trailblazer as the first step toward the generalist-driven paradigm that enables strong generalization with minimal efforts in policy design.

Mind Map

In-depth Reading

English Analysis~40 min read · 54,810 chars

1. Bibliographic Information

1.1. Title

Large Language Models as Generalist Policies for Network Optimization

1.2. Authors

Duo Wu, Linjia Kang, Zhimin Wang (Shenzhen International Graduate School, Tsinghua University, Shenzhen, Guangdong, China); Fangxin Wang (School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, Shenzhen, Guangdong, China); Wei Zhang, Xuefeng Tao (Bytedance, Shenzhen, Guangdong, China); Wei Yang (Bytedance, Hangzhou, Zhejiang, China); Le Zhang (Bytedance, San Jose, California, USA); Peng Cui (Department of Computer Science and Technology, Tsinghua University, Beijing, China); Zhi Wang (Shenzhen International Graduate School, Tsinghua University, Shenzhen, Guangdong, China).

The corresponding author is Zhi Wang (wangzhi@sz.tsinghua.edu.cn). Wei Zhang, Xuefeng Tao, Wei Yang, and Le Zhang are affiliated with Bytedance. The authors represent a collaboration between academic institutions (Tsinghua University, The Chinese University of Hong Kong, Shenzhen) and a major technology company (Bytedance), indicating a blend of theoretical research and industrial application.

1.3. Journal/Conference

This paper is published as a preprint on arXiv. While the specific journal or conference is not listed in the provided information, its topic, "Network Optimization" and "Large Language Models," suggests it targets top-tier conferences or journals in computer networking (e.g., ACM SIGCOMM, USENIX NSDI, IEEE/ACM Transactions on Networking) or machine learning (e.g., NeurIPS, ICML). The involvement of Bytedance and the real-world deployment on Douyin point towards a strong emphasis on practical systems and applications, often valued in systems-oriented conferences.

1.4. Publication Year

2025 (specifically, the abstract indicates 2025-12-03T16:41:58.000Z).

1.5. Abstract

The paper addresses the critical need for robust control policies in modern digital infrastructure to ensure reliable network services. It highlights that the current dominant approach for network optimization relies on specialist policies, which are typically designed based on handcrafted rules or deep learning models. This specialist paradigm suffers from poor generalization across diverse tasks and environments, requiring substantial effort for policy redesign. In contrast, large language models (LLMs), pretrained on vast Internet-scale corpora, are presented as a promising foundation for generalist network policies. LLMs inherently encode fundamental networking principles and exhibit emergent abilities in generalization to unseen scenarios, enabling them to adapt across diverse tasks and environments with minimal effort.

The authors introduce Trailblazer, the first systematic framework to realize such a generalist policy for networking. Trailblazer consists of two main components:

A network alignment scheme (NIOKA) that grounds the LLM in specific networking tasks by adapting its input/output modalities and injecting domain-specific knowledge.
An adaptive policy collaboration mechanism (APC) that enhances computational efficiency by offloading simple control cases from the LLM to a lightweight policy.

Through extensive simulations and a large-scale real-world online evaluation on Douyin (the Chinese version of TikTok), Trailblazer, powered by a single LLM, demonstrates superior cross-task and cross-environment generalization compared to conventional specialist policies. The results validate LLMs as a powerful foundation for generalist network policies and position Trailblazer as a pioneering step towards a generalist-driven paradigm that achieves strong generalization with minimal policy design effort.

1.6. Original Source Link

https://arxiv.org/abs/2512.11839 (Preprint) PDF Link: https://arxiv.org/pdf/2512.11839v1.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the inefficiency and lack of generalization in designing control policies for network optimization. Modern digital infrastructure, ranging from video streaming to cloud computing, heavily relies on robust network services. However, ensuring these services face challenges due to inherent network limitations like infrastructure disparities and dynamic network environments.

The current dominant paradigm for network optimization relies on specialist policies. These policies are either based on handcrafted rules (requiring domain experts to manually devise rules for each new scenario) or deep learning models (necessitating tuning task-specific architectures and limited training data). This specialist-driven paradigm leads to two significant generalization gaps:

Cross-task generalization: Specialist policies struggle to transfer knowledge across diverse networking tasks, leading to high human effort for policy redesign for each new task.
Cross-environment generalization: They perform poorly in previously unseen network environments, even within the same task, failing to adapt to dynamic or out-of-distribution (OOD) conditions.

The importance of this problem stems from the increasing complexity of network systems and the growing demand for seamless online experiences. The limitations of specialist policies translate into high operational costs, slow adaptation to new challenges, and ultimately, frustrating user experiences (e.g., blurry visuals, delayed responses, outages).

The paper's entry point or innovative idea is to leverage Large Language Models (LLMs) as a transformative foundation for generalist network policies. LLMs, pretrained on vast Internet-scale text corpora (including technical documents and textbooks), implicitly encode a rich, unified knowledge base of fundamental networking principles. Their emergent abilities in pattern recognition and generalization to unseen scenarios offer a path to overcome the dual generalization gaps of the specialist paradigm. The innovative idea is to adapt these powerful general-purpose models for real-time network control to achieve cross-task and cross-environment generalization with minimal adaptation effort, thereby shifting from a specialist-driven to a generalist-driven paradigm.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

First Systematic Framework for LLM-based Generalist Network Policies (Trailblazer): The paper introduces Trailblazer, the first systematic framework that enables LLMs to function as generalist policies for network optimization. This framework addresses fundamental challenges in integrating LLMs with networking tasks.
Network Input-Output-Knowledge Alignment (NIOKA) Scheme: Trailblazer proposes NIOKA to bridge the misalignment between LLMs (text-centric) and networking (multi-modal inputs, deterministic actions, fine-grained control). NIOKA includes:
- A network state encoder to project non-textual network data into the LLM's semantic feature space.
- A network action decoder to translate LLM outputs into actionable network control decisions.
- An offline reinforcement fine-tuning algorithm (using Decision Transformer or Contextual Imitation Learning) to inject domain-specific networking knowledge and enable the LLM to learn high-performance policies from diverse experience datasets.
Adaptive Policy Collaboration (APC) Mechanism: To address the high inference latency of LLMs and stringent real-time requirements of network systems, Trailblazer introduces APC. This mechanism features a lightweight, rule-based scheduler that selectively routes complex requests (under poor network conditions) to the LLM for intelligent control, while offloading simpler, stable cases to a conventional lightweight policy for efficient processing. This selective invocation strategy significantly improves system efficiency.
Extensive Validation of Generalization Capabilities: Through comprehensive simulations on two distinct networking tasks (Adaptive Bitrate Streaming (ABR) and Cluster Job Scheduling (CJS)), Trailblazer demonstrates significantly stronger cross-task and cross-environment generalization compared to state-of-the-art specialist policies.
Large-Scale Real-World Online Deployment and Validation: The framework was deployed in Douyin's (TikTok's Chinese version) real-time Congestion Control (CC) service for three weeks of A/B testing, serving over 150,000 users. Trailblazer consistently outperformed VICC (Douyin's highly optimized specialist CC policy) across key industrial performance metrics like video stall rates, proving its reliability and effectiveness in production-grade, latency-sensitive environments.
Key Insights into LLM Behavior in Networking: The paper identifies two critical insights:
- Early Saturation: LLM performance in network optimization tasks saturates rapidly beyond a certain model scale (e.g., 1B parameters), meaning smaller LLMs can achieve competitive performance efficiently. This contrasts with the typical scaling laws observed in natural language processing (NLP).
- Selective Invocation: Only invoking LLMs for complex, challenging network conditions, while using lightweight policies for common, stable scenarios, is crucial for balancing efficiency and performance in real-time systems.
  
  These findings collectively validate LLMs as a robust foundation for generalist network policies and propose a practical blueprint for their deployment, paving the way for a new generalist-driven paradigm that reduces design effort and enhances generalization across heterogeneous network tasks and environments.

3.1. Foundational Concepts

3.1.1. Large Language Models (LLMs)

Large Language Models (LLMs) are a class of artificial intelligence models, typically based on the Transformer architecture, that are trained on vast amounts of text data (Internet-scale corpora). Their primary function is to understand, generate, and process human language.

Transformer Architecture: This is a neural network architecture introduced by Vaswani et al. (2017) that relies heavily on a mechanism called self-attention. Unlike traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs) for sequence data, Transformers process input sequences in parallel, making them highly efficient for long sequences and capable of capturing long-range dependencies.
- Self-Attention: The core of the Transformer. For each word in an input sentence, self-attention allows the model to weigh the importance of all other words in the sentence relative to that word. This creates a contextual representation for each word. The attention mechanism calculates three vectors for each word: Query ( $Q$ ), Key ( $K$ ), and Value ( $V$ ). The attention score is computed as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where $Q$ is the matrix of query vectors, $K$ is the matrix of key vectors, $V$ is the matrix of value vectors, and $d_k$ is the dimension of the key vectors (used for scaling to prevent very large dot products that push the softmax into regions with extremely small gradients).
Pretraining: LLMs undergo an unsupervised pretraining phase where they learn to predict missing words in sentences or the next word in a sequence. This process allows them to develop a deep understanding of language, grammar, facts, and even reasoning capabilities.
Emergent Abilities: After pretraining on massive datasets, LLMs often exhibit "emergent abilities" - capabilities not explicitly programmed or present in smaller models, such as in-context learning (performing tasks based on instructions and examples within the prompt) and zero-shot generalization (performing tasks without any specific training examples). These abilities are key to their potential as generalist policies.

3.1.2. Network Optimization

Network optimization refers to the process of designing, configuring, and managing computer networks to maximize performance, efficiency, reliability, and security while minimizing costs. This often involves making decisions about resource allocation, routing, congestion control, and data transmission rates.

Control Policies: These are the rules or algorithms that dictate how network components (e.g., routers, servers, clients) should behave to achieve optimization goals. They can be rule-based (predefined heuristics) or learning-based (models trained on data).
Generalization: The ability of a control policy to perform well on unseen data or in novel scenarios.
- Cross-task generalization: Applying a policy trained on one networking task (e.g., ABR) to a different task (e.g., CJS) without significant redesign.
- Cross-environment generalization: Applying a policy trained in one network environment (e.g., stable bandwidth) to a significantly different one (e.g., dynamic bandwidth fluctuations).

3.1.3. Adaptive Bitrate Streaming (ABR)

Adaptive Bitrate Streaming (ABR) is a technology used in online video platforms (like TikTok/Douyin) that dynamically adjusts the quality (bitrate) of a video stream in real-time based on the user's current network conditions (e.g., available bandwidth, latency, buffer occupancy). The goal is to provide a smooth, high-quality viewing experience by minimizing rebuffering (video stalls) and maximizing video quality (bitrate).

QoE (Quality of Experience): A metric used to quantify user satisfaction with a video streaming service. It often considers factors like average video bitrate, frequency and duration of rebuffering events, and stability of bitrate changes.

3.1.4. Cluster Job Scheduling (CJS)

Cluster Job Scheduling (CJS) is the process of efficiently distributing computational tasks (jobs) across multiple nodes (servers) in a distributed computing cluster. This is crucial for high-performance computing, big data analytics, and cloud-based machine learning. Jobs are often represented as Directed Acyclic Graphs (DAGs), where nodes are execution stages and edges represent dependencies.

JCT (Job Completion Time): A key metric in CJS, measuring the total time from a job's arrival to its successful completion. Lower JCT indicates better scheduling performance.

3.1.5. Congestion Control (CC)

Congestion Control (CC) is a fundamental mechanism in computer networks that aims to prevent network congestion (when too much data is sent into a network, exceeding its capacity). It does this by regulating the rate at which senders transmit data, usually by inferring the available bottleneck link bandwidth. Effective CC is vital for improving network resource utilization and ensuring reliable data transmission, especially for real-time applications like video calls.

Round-Trip Time (RTT): The time it takes for a signal to go from the sender to the receiver and back. A key indicator of network latency.
Packet Loss Rate: The percentage of data packets that fail to reach their destination. High packet loss often indicates congestion.

3.1.6. Reinforcement Learning (RL)

Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards for desired actions and penalties for undesired ones, iteratively learning a policy (a mapping from states to actions) that maximizes cumulative reward.

Offline RL: A variant of RL where the agent learns a policy solely from a fixed dataset of previously collected interactions (experiences) without further interaction with the environment. This is crucial when online interaction is costly, dangerous, or impractical.

3.2. Previous Works

The paper frames prior network optimization approaches as specialist policies, which fall into two main categories:

Handcrafted Rule-Based Policies: These policies rely on manually defined rules by domain experts.
- TCP Congestion Control (e.g., Slow Start) [7]: A classic example where rules govern how TCP connections increase their sending rate, detect congestion, and recover.
- BBA (Buffer-Based Approach) for ABR [32]: Heuristically considers buffer occupancy to control bitrate.
- MPC (Model Predictive Control) for ABR [33]: Optimizes QoE over a future time window using throughput estimates and buffer occupancy.
- FIFO (First-In-First-Out) and Fair Scheduling for CJS [34]: Basic queueing policies for job scheduling.
- VICC (Douyin's Production Congestion Control): A highly-optimized specialist policy in production, incorporating multiple adaptive mechanisms like congestion response, bandwidth probing, and packet loss detection.
- Background: These methods are transparent and interpretable but require significant human effort and domain expertise to design and tune for each specific task and environment. They often lack adaptability to unseen conditions.
Deep Learning (DL) Based Policies: These policies use neural networks to learn control strategies from data.
- Pensieve for ABR [12]: One of the pioneering works applying deep RL to ABR, training a neural network to select bitrates.
- GENET for ABR [30]: Combines curriculum learning and RL to train a small neural model for video streaming adaptation.
- Decima for CJS [9]: An RL model for job scheduling that uses a Graph Neural Network (GNN) to process job DAG information.
- Background: DL-based methods can learn complex patterns and potentially achieve better performance than rule-based approaches in trained scenarios. However, they often require extensive task-specific model architecture tuning and large amounts of task-specific training data. Their generalization to out-of-distribution (OOD) environments is often limited, and their decision-making can be opaque (black-box).

3.2.1. Decision Transformer (DT)

The Decision Transformer (DT) [43] is an important concept for understanding Trailblazer's offline reinforcement fine-tuning. DT reformulates Reinforcement Learning (RL) as a sequence modeling problem. Instead of learning a policy that maps states to actions, DT learns to predict future actions based on a sequence of past states, actions, and returns-to-go. The returns-to-go specify the desired cumulative reward the agent aims to achieve from a given point in time. This approach leverages the power of Transformer models (like LLMs) for sequence generation.

Formula Intuition: Given a desired return, a sequence of past states, and a sequence of past actions, the DT predicts the next action that would lead to that desired return.
Significance: This allows an LLM, inherently a sequence prediction model, to be fine-tuned for RL tasks by simply modifying its input and output format. It's particularly useful when near-optimal actions (expert demonstrations) are not directly available, but historical trajectories with associated cumulative rewards (returns) can be reconstructed.

3.2.2. Contextual Imitation Learning (CIL)

Imitation Learning (IL) [46, 47] is a machine learning paradigm where an agent learns a policy by observing and imitating expert demonstrations. Contextual Imitation Learning (CIL), as described in this paper, extends this by maintaining a long context window of historical states and actions.

Significance: CIL directly learns to map states to expert actions. It is particularly suitable when high-quality oracle supervision (i.e., ground-truth optimal or near-optimal actions) is available during training, such as bottleneck bandwidths inferred in network simulations. This simplifies the learning objective by removing the need for reward estimation or return-to-go encoding, potentially leading to faster inference.

3.3. Technological Evolution

Networking policies have evolved from simple static rules to increasingly complex adaptive mechanisms.

Early Stages (Rule-based): Dominated by handcrafted heuristics (e.g., TCP slow start, FIFO scheduling). These were easy to understand and implement but rigid and often suboptimal in dynamic, complex environments.
Mid Stages (Control Theory & Optimization): Integration of mathematical control theory (e.g., MPC for ABR) to make policies more adaptive and robust by predicting future states and optimizing over time windows. Still often relied on simplified models of the network.
Recent Stages (Deep Learning & Reinforcement Learning): Emergence of learning-based approaches (e.g., Pensieve, GENET, Decima) leveraging deep neural networks and reinforcement learning. These offered the potential for highly adaptive policies that could learn from data without explicit rule-coding. However, they faced issues with generalization (cross-task and cross-environment), data hunger, interpretability, and computational cost.
Current Frontier (LLM-based Generalist Policies - Trailblazer): This paper represents a shift towards generalist policies using Large Language Models. The idea is to leverage the pre-trained knowledge and emergent generalization abilities of LLMs as a universal foundation. This aims to overcome the generalization gap of specialist DL models and reduce the human effort required by rule-based systems, while also addressing practical deployment challenges like latency.

3.4. Differentiation Analysis

Trailblazer differentiates itself from previous works in several key ways:

Generalist vs. Specialist Paradigm:
- Prior Work: Focused on specialist policies (either rule-based or deep learning-based) tailored for specific tasks (e.g., ABR) and often brittle to new environments. This requires high human effort for policy redesign.
- Trailblazer: Proposes a generalist-driven paradigm using a single LLM to generalize across diverse tasks (ABR, CJS, CC) and environments (OOD conditions) with minimal adaptation.
Leveraging LLM Pretrained Knowledge:
- Prior Learning-Based Work: Typically trains neural networks from scratch on task-specific data, lacking a broad, shared knowledge base.
- Trailblazer: Harnesses the rich, unified knowledge base encoded in LLM pretrained parameters (from Internet-scale corpora including networking texts), which contains fundamental networking principles. This pretrained knowledge is crucial for its generalization.
Addressing LLM-Networking Misalignment:
- Prior LLM-based Networking Efforts (e.g., NetLLM, 6G-oriented multi-modal models): Some studies have explored LLMs, but often confined to specific layers or tasks.
- Trailblazer: Systematically addresses the input/output modality and domain knowledge misalignment through its NIOKA scheme (network state encoder, action decoder, offline reinforcement fine-tuning), enabling a truly unified interface for diverse networking tasks.
Efficient Real-World Deployment:
- Prior LLM-based Work: Often limited to simulation-level evaluations or applications with second-level latency tolerance (e.g., chat, Q&A), due to high LLM inference latency.
- Trailblazer: Introduces Adaptive Policy Collaboration (APC) with a lightweight, rule-based scheduler and batched inference. This selective invocation mechanism is specifically designed to meet millisecond-level latency requirements for real-time network control in production systems, a critical distinction.
Empirical Validation Scale:
- Prior Work: Largely relies on simulations.
- Trailblazer: Beyond extensive simulations, it provides large-scale real-world online A/B tests on a production system (Douyin's CC service), demonstrating industrial improvements in service quality and reliability at scale, which is a significant validation of its practical applicability.
Insights into LLM Scaling and Efficiency:
- Prior LLM Research: Often emphasizes a scaling law where performance consistently improves with model size.
- Trailblazer: Identifies early saturation in network optimization, suggesting that small-scale LLMs (e.g., 0.5B, 1B parameters) can achieve competitive performance. This insight, combined with selective invocation, provides practical principles for efficient LLM deployment in latency-sensitive domains.
  
  In essence, Trailblazer moves beyond merely applying LLMs to networking by providing a holistic framework that systematically adapts LLMs, addresses their practical limitations for real-time control, and rigorously validates their generalization and production-readiness in ways not seen in prior work.

4. Methodology

The Trailblazer framework is designed to transform Large Language Models (LLMs) into generalist policies for network optimization, aiming to achieve strong cross-task and cross-environment generalization while satisfying stringent real-time requirements. It comprises two complementary modules: Network Input-Output-Knowledge Alignment (NIOKA) and Adaptive Policy Collaboration (APC).

4.1. Principles

The core idea behind Trailblazer is to leverage the pretrained knowledge and emergent generalization abilities of LLMs by adapting them to the specific modalities and knowledge requirements of networking tasks, and then deploying them efficiently in real-world systems.

NIOKA's theoretical basis: Bridge the modality gap (text vs. non-textual network data) and knowledge gap (abstract LLM knowledge vs. fine-grained network control) by introducing specialized encoders and decoders, and by fine-tuning the LLM with domain-specific experience.
APC's theoretical basis: Address computational inefficiency by recognizing that not all network control decisions require the full reasoning power of an LLM. A scheduler intelligently routes "difficult" cases to the LLM and "simple" cases to a lightweight policy, thereby balancing performance and latency.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Network Input-Output-Knowledge Alignment (NIOKA)

The NIOKA scheme is designed to adapt LLMs for networking by addressing the fundamental misalignments in input modalities, output contents, and domain knowledge.

4.2.1.1. Modality Alignment (Input & Output)

LLMs primarily process text and generate probabilistic tokens, while network control requires multi-modal inputs (e.g., numerical statistics, graph structures) and deterministic actions (e.g., specific transmission rates).

Network State Encoder:
- Purpose: To enable the LLM to interpret non-linguistic network information by transforming raw network statistics into a semantic feature space that is compatible with the LLM's internal representations.
- Functionality: It first extracts features from raw network data (which can be vectorized, scalar, or graph-structured) using a feature encoder. Then, it maps these extracted features into the same semantic feature space as language tokens using a linear projection layer. Both the feature encoder and the linear projection layer are trainable components, learning optimal projection functions during fine-tuning.
- Implementation Details (Task-Specific):
  - For ABR: Given that ABR states often comprise vectorized data (e.g., historical throughputs, download times) and scalar data (e.g., buffer size, last bitrate), a simple 1D convolutional neural network (CNN) is used for vectorized data and a fully connected layer for scalar data to extract features.
  - For CJS: As CJS tasks involve job DAGs (Directed Acyclic Graphs), a graph neural network (GNN) is employed as the feature encoder to capture the structural and feature information from these graph-structured states.
  - For CC: The state information in CC consists of a set of time-varying network metrics (scalars like loss, RTT, sender/receiver info, delay, queue delay, packet info, jitter, request rate). A 1D CNN is used to extract features from these metrics.
Network Action Decoder:
- Purpose: To translate the high-dimensional feature vectors produced by the LLM into actionable network control decisions.
- Functionality: It replaces the LLM's original prediction head (which typically predicts language tokens) with a dedicated decoder. This decoder takes the LLM's output feature vectors and transforms them into specific network actions. This decoder is also trainable.
- Implementation Details (Task-Specific):
  - For ABR: The action space is discrete (candidate bitrates). The action decoder is designed as a linear layer that computes a probability distribution over these candidate bitrates. The bitrate with the highest probability is then selected for video download.
  - For CJS: The action space is multi-component: selecting the target execution stage and determining the number of executors. The action decoder uses two parallel linear layers: one to predict the next stage and another to determine the number of executors.
  - For CC: The action is a continuous bandwidth prediction. A linear regression layer is used as the action decoder for this continuous value output.

4.2.1.2. Knowledge Alignment (Domain-Specific Knowledge Injection)

While LLMs possess abstract network knowledge from pretraining, this might be insufficient for fine-grained, complex control logic. NIOKA addresses this by fine-tuning the LLM with domain-specific networking knowledge using an offline reinforcement fine-tuning algorithm.

Experience Dataset Collection:
- Process: Conventional non-LLM network policies (both rule-based and learning-based) are evaluated across diverse network environments. Their interactions with these environments are collected to form an experience dataset. This dataset comprises network state-action pairs and associated rewards or near-optimal actions.
- Purpose: This dataset serves as the training data for fine-tuning the LLM. By learning from the behaviors (good and bad) of existing policies, the LLM can distill effective control strategies.
Offline Reinforcement Fine-Tuning Algorithm: The choice of algorithm depends on the availability of near-optimal expert demonstrations and reward signals.
- For ABR and CJS (using Decision Transformer - DT):
  - Context: For ABR, rewards (QoE scores) can be easily quantified, but near-optimal actions (e.g., the truly optimal bitrate choice) are harder to define precisely without an oracle. For CJS, negative JCT serves as a reward. In these scenarios, DT is preferred.
  - Mechanism: DT reformulates RL as a sequence modeling problem, which naturally aligns with the LLM's capabilities. The LLM is trained to predict the next action conditioned on a sequence of historical returns-to-go, network states, and network actions.
  - Formula: The LLM takes the historical sequence as input to predict the next action: $ LLM ( \hat { a } _ { i } | R _ { i - w } , s _ { i - w } , a _ { i - w } , \cdot \cdot \cdot , R _ { i } , s _ { i } ) $ Where:
    - $\hat { a } _ { i }$ : The predicted action at time step $i$ .
    - $R _ { i } = \textstyle \sum _ { t } ^ { i } r _ { t }$ : The return-to-go at time step $i$ , representing the cumulative rewards expected from state $s_i$ onwards.
    - $s _ { i - w } , \cdot \cdot \cdot , s _ { i }$ : The sequence of network states from $w$ steps in the past up to the current state $i$ .
    - $a _ { i - w } , \cdot \cdot \cdot , a _ { i }$ : The sequence of network actions from $w$ steps in the past up to the current action $i$ .
    - $w$ : The length of the historical context window, empirically set to 10 for ABR and 20 for CJS.
  - Training Loss: The LLM is trained using a cross-entropy loss to minimize the discrepancy between the predicted action and the ground-truth action from the experience dataset: $ \mathcal { L } _ { i } = CE ( \hat { a } _ { i } , a _ { i } ) $ Where CE denotes the cross-entropy loss.
  - Inference: During inference, a sufficiently high target return is specified to guide the LLM to generate high-quality actions.
- For CC (using Contextual Imitation Learning - CIL):
  - Context: For CC, ground-truth bottleneck bandwidths (which serve as near-optimal expert demonstrations) can be accurately inferred in controlled simulation environments. Thus, CIL is preferred.
  - Mechanism: CIL leverages these ground-truth bottleneck bandwidths as expert actions to supervise the LLM directly. The LLM predicts the next action conditioned on a context window of historical states and actions.
  - Formula: The LLM predicts the next action based on the historical context: $ LLM ( \hat { a } _ { i } | s _ { i - w } , a _ { w - i } , \cdot \cdot \cdot , s _ { i - 1 } , a _ { i - 1 } , s _ { i } ) $ Where:
    - $\hat { a } _ { i }$ : The predicted action (bottleneck bandwidth) at time step $i$ .
    - $s _ { i - w } , \cdot \cdot \cdot , s _ { i }$ : The sequence of network states from $w$ steps in the past up to the current state $i$ .
    - $a _ { i - w } , \cdot \cdot \cdot , a _ { i }$ : The sequence of network actions from $w$ steps in the past up to the current action $i$ .
    - $w$ : The length of the context window.
  - Training Loss: The LLM is trained by minimizing the mean squared error (MSE) between the predicted action and the expert action: $ \mathcal { L } _ { i } = MSE ( \hat { a } _ { i } , a _ { i } ^ { e } ) $ Where MSE denotes the mean squared error and $a _ { i } ^ { e }$ is the expert action (ground-truth bottleneck bandwidth).
  - Distinction from DT: CIL does not require encoding return-to-go information, simplifying the input sequence and context length, which can lead to faster inference. It directly learns to imitate optimal behavior when explicit expert actions are available.
    
    The combination of pretrained knowledge from the LLM and domain-specific knowledge injected through fine-tuning enables the LLM to serve as an effective generalist policy.

The following figure (Figure 2a from the original paper) illustrates the Network Input-Output-Knowledge Alignment (NIOKA) in Trailblazer:

该图像是图示，展示了框架Trailblazer中的网络输入-输出-知识对齐（NIOKA）与自适应策略协作（APC）机制。NIOKA通过对齐网络状态和操作，结合离线强化学习进行细化；APC则通过调度管理流请求，智能控制复杂案例，同时为简单案例分配传统策略以提高效率。

4.2.2. Adaptive Policy Collaboration (APC)

The APC mechanism addresses the practical challenge of LLM's high inference latency in real-time network systems with strict latency constraints.

4.2.2.1. Scheduler for Adaptive Flow Request Routing

Core Idea: Instead of invoking the LLM for every control decision, APC uses a scheduler to selectively offload flow requests from the LLM to a lightweight rule-based policy for fast processing. This is termed selective invocation.
Mechanism: For each incoming flow request, the scheduler evaluates its network conditions.
- Requests under poor network conditions are deemed difficult cases and are routed to the LLM for intelligent control.
- Requests under stable conditions are handled efficiently by a conventional lightweight policy for fast processing.
Scheduler Design: A set of heuristic, deterministic rules is adopted for the scheduler. This rule-based design ensures minimal overhead and fast processing speed, which is critical for routing a large volume of concurrent requests in real-world systems.
- Example Rules (for CC task): A request is classified as experiencing good network conditions if all of the following criteria are met:
  1. Its last round-trip time (RTT) is below a threshold $\alpha _ { 1 }$ (e.g., 50 ms).
  2. The packet loss rate is below a threshold $\alpha _ { 2 }$ (e.g., 0.05).
  3. The last sending rate exceeds $\alpha _ { 3 } \times rate^{req}$ (e.g., 0.95 times the application-specific requested rate).
- Flows meeting all three criteria are routed to the lightweight policy; all others go to the LLM.

4.2.2.2. Lightweight Policy

Purpose: To handle the majority of simple/stable network cases efficiently, reducing the load on the LLM.
Design (for CC task): The lightweight policy simply sets the sending rate directly to the request rate ( $rate^{req}$ ). While this might induce congestion under truly poor conditions, it performs reliably in stable scenarios where $rate^{req}$ is often below the actual bottleneck capacity.

4.2.2.3. Batch Processing for LLM

Purpose: To further improve system efficiency and reduce per-request processing latency.
Mechanism: The LLM processes requests in batches. By grouping multiple requests, the overhead of LLM inference can be amortized across several decisions, leading to a lower effective latency per request. For instance, in the CC task, a batch size of 64 is used, achieving an average inference latency of 37.1 ms, which meets the 100 ms response latency requirement.

The collaboration between the LLM and the conventional policy via the APC scheduler creates a robust network system. It leverages the LLM's strong capabilities for complex scenarios while ensuring overall efficiency by offloading simple requests.

The following figure (Figure 2b from the original paper) illustrates the Adaptive Policy Collaboration (APC) in Trailblazer:

5. Experimental Setup

5.1. Datasets

The paper uses a combination of real-world and synthetic datasets across three networking tasks to evaluate Trailblazer's generalization capabilities.

5.1.1. Adaptive Bitrate Streaming (ABR)

Network Dynamics (Bandwidth Traces):
- FCC broadband measurement dataset [48]: Primary real-world source. Captures real-world variability across USA consumers.
  - Scale: 485 traces selected (235 for training, 150 for validation, 100 for testing), totaling over 324,000 seconds.
  - Characteristics: Represents typical broadband network conditions.
- SynthTrace: A synthetic dataset created to evaluate generalization under more challenging network dynamics.
  - Scale: 100 traces.
  - Characteristics: Broader bandwidth ranges and more dynamic fluctuation patterns than FCC traces, designed to be out-of-distribution (OOD).
Video Content (Video Manifest Files):
- Envivio-Dash3 [49]: Default widely used real-world video.
- SynthVideo: A synthetic video with larger chunk sizes, introduced for additional performance evaluation.

Choice Justification: The combination of real-world and synthetic bandwidth traces, along with different video characteristics, allows for comprehensive evaluation of cross-environment generalization under diverse and challenging conditions.

The following are the environment settings for generalization evaluation in ABR simulation from Extended Data Table 1 of the original paper:

Environment	Video	Bandwidth Traces
Training Environment	Envivio-Dash3	FCC
Default Test Environment	Envivio-Dash3	FCC
OOD Environment 1	Envivio-Dash3	SynthTrace
OOD Environment 2	SynthVideo	FCC
OOD Environment 3	SynthVideo	SynthTrace

5.1.2. Cluster Job Scheduling (CJS)

Workload Traces:
- TPC-H benchmark [51]: Primary source of workload traces.
  - Characteristics: Consists of a suite of business-oriented computational jobs with large data volumes and high processing complexity, making it a widely adopted benchmark for job scheduling.
Cluster Resources: The number of executor resources in the simulated cluster is varied to generate diverse experimental environments.

Choice Justification: TPC-H is a standard for evaluating job scheduling, and varying resources allows testing performance under different workload intensities and resource availability, crucial for cross-environment generalization.

The following are the environment settings for generalization evaluation in CJS simulation from Extended Data Table 2 of the original paper:

Environment	Number of Job Requests	Number of Executors (k)
Training Environment	200	50
Default Test Environment	200	50
OOD Environment 1	200	30
OOD Environment 2	450	50
OOD Environment 3	450	30

5.1.3. Congestion Control (CC)

Experience Dataset Construction (Douyin internal development platform):
- Source: Real-world video sessions established using six mobile devices and nine types of media content (e.g., singing, gaming), with diverse and complex network conditions imposed using the enterprise-grade network emulator HoloWAN [52].
- Scale: Over 30,000 sessions and more than 10 million data samples.
- Characteristics: Device, media, and network configurations are highly aligned with Douyin's online settings, ensuring realism. CC decisions were determined by one of four randomly selected rule-based policies from Douyin during collection.
- Split: 95% training subset for LLM fine-tuning, 5% test set for model scale selection and selective invocation evaluation.
Choice Justification: This dataset, generated under highly controlled yet realistic conditions, provides high-quality ground-truth bottleneck bandwidths (used as expert demonstrations for CIL) and allows for robust offline evaluation before online deployment.

5.2. Evaluation Metrics

5.2.1. Quality of Experience (QoE) for ABR

Conceptual Definition: Quality of Experience (QoE) is a subjective measure of a user's satisfaction with a video streaming service. It aims to quantify how good the user perceives their video watching experience to be, considering factors that impact visual quality and playback smoothness. Higher QoE indicates a better user experience.
Mathematical Formula: $ QoE _ { i } = bitrate _ { i } - \lambda _ { 1 } \times rebuf _ { i } - \lambda _ { 2 } \times | \Delta bitrate _ { i } | $
Symbol Explanation:
- QoE _ { i }: The Quality of Experience score for chunk $i$ .
- bitrate _ { i }: The bitrate (in Mbps) of video chunk $i$ . Higher bitrates generally mean better visual quality.
- rebuf _ { i }: The rebuffering time (in seconds) that occurs during the download of chunk $i$ . Rebuffering causes video stalls, negatively impacting user experience.
- $| \Delta bitrate _ { i } | = | bitrate _ { i } - bitrate _ { i - 1 } |$ : The absolute change in bitrate between the current chunk $i$ and the previous chunk i-1. Frequent or large bitrate changes can be visually jarring for users.
- $\lambda _ { 1 }$ and $\lambda _ { 2 }$ : Coefficients (weights) that control the trade-offs among these factors. Based on prior work [12], they are set to $\lambda _ { 1 } = 4.3$ and $\lambda _ { 2 } = 1$ , reflecting higher user sensitivity to rebuffering.

5.2.2. Job Completion Time (JCT) for CJS

Conceptual Definition: Job Completion Time (JCT) measures the total duration it takes for a computational job to be fully executed within a cluster. It's a critical metric for assessing the efficiency of job scheduling policies, as shorter completion times mean faster processing of workloads and better resource utilization. Lower JCT indicates better performance.
Mathematical Formula: $ JCT = t _ { e } - t _ { s } $
Symbol Explanation:
- JCT: The Job Completion Time.
- t _ { s }: The job arrival time.
- t _ { e }: The job finishing time.

5.2.3. Mean Absolute Percentage Error (MAPE) for CC (Offline)

Conceptual Definition: Mean Absolute Percentage Error (MAPE) is a measure of prediction accuracy for a forecasting method. It expresses accuracy as a percentage of the actual value, making it intuitive to understand. In the context of congestion control, it quantifies the relative discrepancy between the predicted bottleneck bandwidth (or application request rate) and the true available bandwidth, indicating how accurately the policy estimates network capacity. Lower MAPE indicates better prediction performance.
Mathematical Formula: $ MAPE = \frac { 1 } { n } \sum _ { i = 1 } ^ { n } \left. \frac { \operatorname* { m i n } ( b _ { i } ^ { p } , rate _ { i } ^ { req } ) - \operatorname* { m i n } ( b _ { i } ^ { t } , rate _ { i } ^ { req } ) } { \operatorname* { m i n } ( b _ { i } ^ { t } , rate _ { i } ^ { req } ) } \right. $
Symbol Explanation:
- MAPE: The Mean Absolute Percentage Error.
- $n$ : The total number of samples in the validation dataset.
- $b _ { i } ^ { p }$ : The predicted bandwidth for sample $i$ .
- $rate _ { i } ^ { req }$ : The application-specific request rate for sample $i$ , representing the upper bound on the sending rate required by the application.
- $b _ { i } ^ { t }$ : The true simulated bottleneck bandwidth for sample $i$ .
- $\operatorname* { m i n } ( X, Y )$ : Represents the minimum value between $X$ and $Y$ . This is used to cap the predicted and true bandwidths at the application's request rate, as sending more than the application needs doesn't improve performance beyond that point and might even cause unnecessary congestion.

5.2.4. Request Processing Delay for CC (Offline)

Conceptual Definition: Request processing delay measures the time interval from when a flow request arrives at the system until the corresponding congestion control (CC) decision is received. It's a crucial metric for evaluating the real-time responsiveness and efficiency of a CC system, especially in latency-sensitive applications.
Components: This delay includes LLM queuing delay (time spent waiting for the LLM to become available) and LLM inference latency (time taken by the LLM to process the request). Under high loads, queuing delay becomes dominant.

5.2.5. Video Stall Rate for CC (Online)

Conceptual Definition: Video stall rate is a key industrial metric that directly reflects the smoothness of video playback and overall user experience. It is calculated as the ratio of the total duration of video stalls (rebuffering) to the total video playback duration. A lower stall rate indicates fewer interruptions and a superior quality of service, which is critical for user retention and engagement on platforms like Douyin.
Mathematical Formula: $ stall.rate = \frac { \sum _ { i = 1 } ^ { N } d _ { i } ^ { s } } { \sum _ { i = 1 } ^ { N } d _ { i } ^ { p } } $
Symbol Explanation:
- stall.rate: The calculated video stall rate.
- $N$ : The total number of flows (video sessions).
- $d _ { i } ^ { s }$ : The cumulative stall duration of flow $i$ , continuously reported by a monitor.
- $d _ { i } ^ { p }$ : The total playback duration of flow $i$ .
Relative Reduction (Online A/B Tests): $ reduction = \frac { stall.rate _ { VICC } - state.rate _ { Trailblazer } } { stall.rate _ { VICC } } $ This formula quantifies the percentage improvement of Trailblazer over the VICC baseline.

5.3. Baselines

5.3.1. Adaptive Bitrate Streaming (ABR)

GENET [30]: A learning-based (Reinforcement Learning) policy that combines curriculum learning to train a small neural model for optimizing video streaming. It is representative of state-of-the-art DL-based ABR.
BBA (Buffer-Based Approach) [32]: A rule-based heuristic policy that prioritizes maintaining a desired playback buffer occupancy level as a critical signal for bitrate control. It's a classic and widely used ABR algorithm.
MPC (Model Predictive Control) [33]: A rule-based policy that uses throughput estimates and buffer occupancy to choose bitrates by optimizing a given QoE metric over a future time window. It represents a more sophisticated heuristic approach.

5.3.2. Cluster Job Scheduling (CJS)

Decima [9]: A learning-based (Reinforcement Learning) model for job scheduling that utilizes a Graph Neural Network (GNN) to process job Directed Acyclic Graph (DAG) information. It's a state-of-the-art RL scheduler.
FIFO (First-In-First-Out) [34]: A rule-based common scheduling policy that processes jobs in the order of their arrival and allocates requested resources. It's a simple, widely used baseline.
Fair scheduling [34]: A rule-based scheduling policy that distributes resources in a "round robin" fashion to ensure each job receives a roughly equal share of cluster resources. Another common heuristic.

5.3.3. Congestion Control (CC)

VICC: A production-grade, highly optimized specialist policy currently deployed in Douyin's real-time services. It's an adaptive policy designed to balance bandwidth utilization and latency using multiple adaptive mechanisms (congestion response, bandwidth probing, packet loss detection, jitter resilience). VICC serves as a strong, real-world industrial benchmark.

5.4. Experience Dataset Construction

5.4.1. For ABR

The GENET baseline is used for experience collection. Diverse network environments are simulated, GENET interacts with these environments, and the resulting states, actions, and rewards (QoE scores) are collected to form the training dataset for LLM fine-tuning via the Decision Transformer framework.

5.4.2. For CJS

The Decima model is used to interact with diverse simulated training environments. The resulting state-action-reward tuples (where reward is negative JCT) are collected as the training data to fine-tune the LLM based on Trailblazer using the Decision Transformer framework.

5.4.3. For CC

An experience dataset is constructed using Douyin's internal development platform. This involves:

Establishing real-world video sessions with six mobile devices and nine media content types.
Imposing diverse and complex network conditions (bandwidth, delay, packet loss) using the HoloWAN network emulator [52].
During data collection, CC decisions are made by one of four randomly selected rule-based policies from Douyin.
This process yields over 30,000 sessions and more than 10 million data samples, split into 95% for LLM fine-tuning (using Contextual Imitation Learning) and 5% for testing. The ground-truth bottleneck bandwidths inferred in the controlled simulation environment serve as near-optimal expert demonstrations.

5.5. Online Deployment (for CC)

Platform: Douyin's online CC service (underpins its social video call application).
Duration: 3 weeks of large-scale A/B tests.
Users: Over 150,000 users across more than 100 cities.
Playback Time: Over 1,200 days of video playback time accumulated.
Infrastructure: Six media servers responsible for media sessions and CC logic. Incoming flows are randomly routed.
- Three servers run the VICC policy.
- Three servers run Trailblazer.
Trailblazer Components Deployment:
- Scheduler and lightweight CC policy: Deployed locally on the media servers.
- LLM (Qwen2.5-0.5B): Hosted on a dedicated GPU server.
LLM Configuration:
- Model: Qwen2.5-0.5B (chosen due to early saturation insight, balancing performance and efficiency).
- Inference Batch Size: 64.
- GPU Memory: Approximately 4.5 GB.
- Inference Latency: About 30 ms per inference (for a batch of 64 requests).
Communication: Scheduler and LLM communicate via WebSocket API. When the scheduler identifies flows under poor network conditions, it forwards their contextual information to the LLM. The LLM returns control decisions to the scheduler for execution.
Latency Target: End-to-end request processing latency consistently below 100 ms. Trailblazer achieves this by combining a small-scale LLM, batched inference, and collaborative control.

6. Results & Analysis

This section details the experimental results, comparing Trailblazer's performance against specialist baselines in simulated and real-world environments, and provides insights into the factors contributing to its success.

6.1. Generalization Comparison of Generalist and Specialist Policies

The following figure (Figure 3 from the original paper) presents a comprehensive comparison between the generalist approach Trailblazer and specialist baselines on heterogeneous networking tasks and environments:

$Fig. 3 Comprehensive comparison between the generalist approach Trailblazer and specialist baselines on heterogeneous networking tasks and environments. For ABR, we benchmark Trailblazer against the learning-based policy GENET \[30\] and rule-based policies BBA \[32\] and MPC \[33\]. For CJS, we compare it against Decima \[9\], a learning-based policy, as well as two rule-based policies First-In-First-Out (FIFO) \[34\] and Fair scheduling \[34\]. a, Performance comparison of cross-task generalization. Results are averaged over three random seeds, with the mean and standard deviation reported. Policies that are not applicable on the specific task are marked with $\\times$ . b, Performance comparison of cross-environment generalization under more challenging OOD test settings. Scatters and box shapes represent the distribution of performance, while triangles denote mean values.$ 该图像是图表，展示了一般策略Trailblazer与专门基线在不同网络任务和环境中的比较。图中包含ABR和CJS的性能数据，左侧为QoE评分，右侧为JCT时间。a部分展现了跨任务的平均结果，而b部分比较了跨环境的一致性和性能分布。

6.1.1. Cross-Task Generalization

Experiment: Trailblazer (powered by a single LLM) is compared against specialist policies on two heterogeneous tasks: Adaptive Bitrate Streaming (ABR) and Cluster Job Scheduling (CJS).
Results (Figure 3a):
- ABR: Trailblazer consistently outperforms all baselines, achieving 14.5%-36.6% higher QoE.
  - Trailblazer (QoE: ~0.96) is significantly better than GENET (~0.83), BBA (~0.7), and MPC (~0.65).
- CJS: Trailblazer reduces JCT by 6.8%-41.3%.
  - Trailblazer (JCT: ~0.78) is better than Decima (~0.8), Fair (~1.2), and FIFO (~1.3).
Analysis: Conventional rule-based or learning-based policies, which rely on task-specific designs, fail to generalize across different tasks (e.g., GENET for ABR and Decima for CJS are not applicable to the other task, marked with $\times$ ). In contrast, Trailblazer, with its single LLM, successfully generalizes across these heterogeneous tasks. This demonstrates that LLMs can serve as a unified foundation for generalist network policies, effectively breaking the task-isolation barrier of the specialist paradigm.

6.1.2. Cross-Environment Generalization

Experiment: Trailblazer is evaluated across various challenging out-of-distribution (OOD) test environments that differ substantially from training conditions (e.g., more dynamic bandwidth fluctuations, different video chunk sizes, varying job request numbers/executor counts).
Results (Figure 3b): Trailblazer consistently outperforms all baselines in terms of average values and distributions across all OOD cases for both ABR and CJS.
- ABR (QoE improvement): Compared to rule-based baselines (BBA, MPC), Trailblazer improves mean QoE by 3.9%-24.8%. Compared to learning-based GENET, it improves mean QoE by 1.5%-44.3%.
- CJS (JCT reduction): Compared to rule-based baselines (FIFO, Fair), Trailblazer reduces mean JCT by 2.5%-6.8%. Compared to learning-based Decima, it reduces mean JCT by 10.5%-41.6%.
Analysis: The consistent advantage of Trailblazer in OOD environments highlights a key strength of the generalist paradigm. By leveraging the strong generalization capabilities of LLMs, it adapts robustly to previously unseen conditions where specialist approaches often fail due to their reliance on static priors or limited training data.

6.2. Effects of Knowledge and LLM Model Scale in Networking

The following figure (Figure 4 from the original paper) investigates the success of LLM in networking by studying the importance of pretrained knowledge, domain knowledge, and the impact of LLM model scale:

Fig. 4 Study of the success of LLM in networking. a, Investigation on the importance of pretrained knowledge of the LLM and domain knowledge injected by Trailblazer. b, Investigation on the impact of LLM model scale on task performance. 该图像是图表，展示了LLM在网络优化中的成功研究。左侧(a)部分比较了在不同知识条件下的平均质量体验（QoE）得分和延迟（JCT）的变化，右侧(b)部分则展示了不同LLM模型规模对任务表现的影响。结果表明，即使在较小模型中，LLM仍能提供显著的性能提升。

6.2.1. Insight 1: Pretrained knowledge enables LLMs to function as generalist policies for networking.

Experiment: A variant of Trailblazer where the LLM's pretrained weights are discarded, and it is reinitialized and trained from scratch on each downstream task (ABR and CJS).
Results (Figure 4a): This variant suffers from significant performance degradation across both tasks.
- ABR: QoE drops from ~0.96 (Trailblazer) to ~0.76 (Trailblazer w/o pretrain).
- CJS: JCT increases from ~0.78 (Trailblazer) to ~0.98 (Trailblazer w/o pretrain).
Analysis: This validates that the pretrained knowledge of the LLM implicitly encodes transferable and abstract network knowledge, which is a critical prerequisite for the LLM to generalize effectively as a generalist policy. Training from scratch without this foundation severely limits performance.

6.2.2. Insight 2: Domain-specific knowledge is essential to unlock the full generalist potential of LLMs in networking.

Experiment: A variant where the LLM backbone is frozen (retaining pretrained weights), and only the network state encoder and action decoder are fine-tuned. This prevents the LLM from directly acquiring task-specific networking knowledge during training.
Results (Figure 4a): Despite retaining pretrained knowledge, this variant (Trailblazer w/o domain knowledge) fails to generalize across different tasks or performs worse than the full Trailblazer.
- ABR: QoE drops slightly from ~0.96 to ~0.92.
- CJS: JCT increases significantly from ~0.78 to ~1.2.
Analysis: Pretrained knowledge alone is insufficient without domain alignment. Trailblazer's NIOKA scheme (specifically the offline reinforcement fine-tuning) bridges the gap between abstract pretrained knowledge and fine-grained domain expertise. The differing sensitivity (smaller drop in ABR, larger drop in CJS) suggests that some tasks might be more amenable to abstract reasoning from pretrained knowledge, while others require more specific fine-tuning. Jointly integrating both types of knowledge is key for effective generalization.

6.2.3. Insight 3: LLMs exhibit early saturation in network optimization with increasing model scale.

Experiment: Investigation of LLM model scale on ABR task performance using the OPT model family [35] (0.35B, 1B, 2.7B, 6.7B parameters).
Results (Figure 4b):
- OPT-0.35B performs worse than baselines (QoE ~0.8).
- All OPT-variants larger than 1B outperform all baselines (QoE > 0.9).
- Performance saturates rapidly beyond 1B parameters, with larger models yielding only marginal gains. For example, OPT-1B achieves a QoE of ~0.92, while OPT-6.7B reaches ~0.95.
Analysis: This phenomenon, termed early saturation, contrasts with the typical scaling law in NLP where performance consistently improves with model scale. It reveals that the model scale of LLM required for effective network optimization is relatively small. This is an important insight for practical deployment: small LLMs can achieve competitive performance while meeting stringent low-latency requirements of real-world network systems, without the need for excessively large models.

The following figure (Extended Data Fig. 1 from the original paper) shows a study on the performance of different LLM model families with ABR task as the example:

该图像是一个柱状图，展示了不同模型（OPT、Mistral、LLava、Llama2、GENET）在QoE评分上的表现。各模型的平均评分接近1，表现出较好的用户体验，图中横轴为模型名称，纵轴为平均QoE评分。
Analysis: This figure shows that Trailblazer is robust across diverse LLM backbones (Llama2, OPT, Mistral, LLaVa), all standardized to 7B parameters. All LLM-powered Trailblazer variants significantly outperform the GENET baseline. This further validates the feasibility of Trailblazer as a universal framework for aligning LLMs to networking tasks, demonstrating that the framework's design is effective regardless of the specific LLM family used.

6.3. Evaluation in Real-World Network Environments

The following figure (Extended Data Fig. 2 from the original paper) shows a study of the selection of LLM model scale on the CC task:

该图像是一个图表，展示了不同规模的LLM模型（0.5B、1.5B、3B、7B）在两个方面的性能：a) MAPE (%)，以及b) 每批次和每样本的运行时间（毫秒）。可观察到，随着模型规模的增加，MAPE有所波动，而运行时间显著上升。

Analysis: This figure explores LLM model scale for Congestion Control (CC) using the Qwen2.5 model family (0.5B, 1.5B, 3B, 7B).
- Performance (Figure 8a): MAPE (Mean Absolute Percentage Error) remains consistently around 36.5% for all Qwen2.5 models from 0.5B parameters upwards, and all significantly outperform VICC (dashed line).
- Runtime (Figure 8b): Computational cost (runtime per batch/sample) rises sharply with model size.
Conclusion: Given the early saturation in performance beyond 0.5B parameters and the sharp increase in computational cost, the Qwen2.5-0.5B model is chosen as the backbone for the CC task. This choice optimizes the trade-off between performance and efficiency, aligning with the early saturation insight.

The following figure (Extended Data Fig. 3 from the original paper) shows a study of the impacts of LLM inference batch size on the CC task:

该图像是图表，展示了在不同批量大小下，LLM和VICC基线的MAPE（百分比平均绝对误差）变化及每批次和每样本的运行时间。图 (a) 显示了批量大小与MAPE的关系，表现出19.10%至19.21%的误差。图 (b) 则比较了不同批量大小下的运行时间，展示了随批量大小增加而变化的趋势。
Analysis: This figure investigates the impact of LLM inference batch size on the CC task using Qwen2.5-0.5B.
- Performance (Figure 9a): MAPE remains stable (around 19.1-19.2%) across varying batch sizes, slightly outperforming VICC (dashed line). This indicates that batching does not compromise task accuracy.
- Runtime (Figure 9b): Increasing batch size significantly reduces per-request inference latency (though per-batch latency increases). For example, at a batch size of 64, the average inference latency is 37.1 ms, which satisfies the 100 ms response latency requirement for CC in Douyin while leaving a safety margin.
Conclusion: Batch processing is an effective strategy to reduce per-request inference latency without compromising performance, crucial for real-time LLM deployment.

6.3.1. Validation of the Effectiveness of Scheduler

The following figure (Figure 5 from the original paper) compares Trailblazer with and without the scheduler to enable efficient collaboration:

$Fig. 5 Comparison between Trailblazer with and without the scheduler to enable efficient collaboration. a, Performance analysis of the two variants, evaluated across different proportions `( p )` of requests under poor network conditions. The performance of VICC and the rule-based CC policy used to collaborate with the LLM is also reported. b, Performance analysis of the two variants under different number of peak requests when $p = 2 0 \\%$ , $8 0 \\%$ .$ 该图像是图表，展示了Trailblazer在不同请求比例和峰值请求数量下的性能对比。图(a)分析了在不良网络条件下各个算法的均方根百分比误差（MAPE）随请求比例变化的情况，比较了Trailblazer及其无调度器版本与其他协作算法的表现。图(b)则分别在请求比例为20%和80%时，展示了不同峰值请求数量对请求处理延迟和MAPE的影响。

Experiment: An ablation study comparing Trailblazer with and without the Adaptive Policy Collaboration (APC) scheduler under varying network conditions (proportion $p$ of requests under poor conditions) and system loads (peak requests). Metrics are MAPE (task performance) and request processing delay (system efficiency).
Results (Figure 5a - Performance Analysis):
- Trailblazer (with scheduler) significantly outperforms VICC across all proportions ( $p$ ) of poor network conditions.
- Even when collaborating with a simple rule-based CC policy (whose MAPE grows rapidly with $p$ ), Trailblazer maintains high performance.
- Trailblazer incurs at most 3.08% higher MAPE than Trailblazer w/o scheduler. This performance gap narrows at higher $p$ as more requests are routed to the LLM.
Results (Figure 5b - Efficiency Analysis):
- Trailblazer w/o scheduler (where all requests go to the LLM) exhibits severe performance degradation with significantly higher processing delay as request peak volume increases. For instance, at $p = 20\%$ and 2,000 peak requests, its delay is ~345 ms.
- In contrast, Trailblazer (with scheduler) demonstrates strong resilience and superior efficiency. At $p = 20\%$ and a peak load of 2,000 requests, it reduces the average delay from 345 ms (w/o scheduler) to 61 ms (below Douyin's 100 ms CC requirement).
- This efficiency gain is achieved with only a negligible 2.66% increase in MAPE compared to Trailblazer w/o the scheduler at that specific load.

6.3.2. Insight 4: Selective invocation significantly improves system efficiency without compromising performance.

Analysis: These results reveal an important principle: selective invocation. Invoking the LLM for network control only when necessary (i.e., for difficult cases under poor network conditions) rather than for every single request (per-request control) is key. This strategy significantly improves system efficiency by offloading the majority of simple/stable cases to a lightweight policy, effectively addressing the LLM's high inference latency while maintaining high task performance. It's the core to efficiently deploy LLM-based generalist policies in real-time, latency-sensitive network systems.

6.3.3. Effectiveness of Real-World Deployment

The following figure (Figure 6 from the original paper) displays the results of large-scale online A/B tests between the generalist Trailblazer and specialist VICC within Douyin's CC service:

$Fig. 6 Results of large-scale online $\\mathbf { A } / \\mathbf { B }$ tests between the generalist Trailblazer and specialist VICC within Douyin's CC service. a, Relative reduction of Trailblazer over VICC on different video stall rates. b,c, Relative reduction in video stall rates of Trailblazer over VICC across different client OS and distance to server. e,d, Client statistics in the $\\mathrm { A } / \\mathrm { B }$ tests.$ 该图像是图表，展示了在 Douyin 的 CC 服务中，通用型 Trailblazer 与专业型 VICC 之间的大规模在线 $ext{A/B}$ 测试结果。图中包括视频停滞率的相对减少情况，以及不同客户端操作系统和距离服务器的客户统计数据。

Online A/B Test Setup: Trailblazer (using Qwen2.5-0.5B and APC) deployed in Douyin's CC service for 3 weeks, compared against VICC (Douyin's production specialist policy). Served over 150,000 users, accumulated over 1,200 days of video playback. Primary metric: video stall rate.

6.3.3.1. Overall Video Stall Rate Reduction

Results (Figure 6a): Trailblazer consistently improves video stall rates across different interruption durations compared to VICC.
- Reduces 100ms video stall rates by 0.92%.
- Reduces 200ms video stall rates by 1.28%.
- Reduces 500ms video stall rates by 0.76%.
Analysis: Even seemingly modest gains in stall rates can yield substantial business benefits on a large-user platform like Douyin, translating to improved user retention and higher engagement. This strong empirical validation shows Trailblazer's ability to operate reliably in production environments and deliver measurable industrial improvements.

6.3.3.2. Scalability on Client Heterogeneity (OS)

Client Statistics (Figure 6d): Android is the dominant OS (~65%), followed by iOS (~30%), with "Other" OS platforms (mainly HarmonyOS) being a smaller but significant portion (~5%).
Results (Figure 6b):
- Android: Trailblazer performs on par with VICC.
- iOS: Trailblazer reduces video stall rates by 3.58%.
- Other OS: Trailblazer significantly outperforms VICC by 24.45%.
Analysis: VICC, being a rule-based specialist policy, might not be fully optimized for emerging or less common platforms like HarmonyOS. Trailblazer, as an LLM-based generalist approach, enables rapid OS adaptation without manual tuning, exhibiting strong generalization across heterogeneous and evolving device systems. This highlights its suitability for real-world scenarios with a wide range of devices and operating systems.

6.3.3.3. Impact of Geographical Distance (Network Variability)

Client Statistics (Figure 6e): Clients are distributed across various distance categories from the server.
Results (Figure 6c): Trailblazer consistently outperforms VICC in reducing video stall rates across all geographical regions, with relative reductions ranging from 0.85% to 10.29%.
- Notably, the performance gain increases with distance.
Analysis: Geographical distance serves as a proxy for network complexity (inter-domain routing, path stability). The increased gain with distance suggests that Trailblazer is particularly effective in complex network environments. This demonstrates its strong generalization and robustness to network variability, enabling large-scale deployment without region-specific customization.

6.4. Data Presentation (Tables)

The following are the results from Extended Data Table 1 of the original paper:

Environment	Video	Bandwidth Traces
Training Environment	Envivio-Dash3	FCC
Default Test Environment	Envivio-Dash3	FCC
OOD Environment 1	Envivio-Dash3	SynthTrace
OOD Environment 2	SynthVideo	FCC
OOD Environment 3	SynthVideo	SynthTrace

The following are the results from Extended Data Table 2 of the original paper:

Environment	Number of Job Requests	Number of Executors (k)
Training Environment	200	50
Default Test Environment	200	50
OOD Environment 1	200	30
OOD Environment 2	450	50
OOD Environment 3	450	30

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Trailblazer, the first systematic framework that successfully grounds Large Language Models (LLMs) as generalist policies for network optimization. By integrating a Network Input-Output-Knowledge Alignment (NIOKA) scheme and an Adaptive Policy Collaboration (APC) mechanism, Trailblazer effectively addresses the challenges of applying LLMs to networking: modality and knowledge misalignment, and computational inefficiency. Through extensive simulations on Adaptive Bitrate Streaming (ABR) and Cluster Job Scheduling (CJS), and crucial large-scale real-world online A/B tests on Douyin's Congestion Control (CC) service, Trailblazer demonstrates superior cross-task and cross-environment generalization compared to conventional specialist policies. The framework not only proves reliable and effective in a production environment, delivering measurable improvements in key industrial performance metrics like video stall rate, but also reveals two fundamental insights: early saturation of LLM performance in networking (small LLMs are sufficient) and the importance of selective invocation for efficiency. These findings establish LLMs as a powerful foundation for a new generalist-driven paradigm in network policy design, promising strong generalization with minimal human effort.

7.2. Limitations & Future Work

The authors acknowledge a key limitation:

Explainability: The internal decision logic of LLMs remains difficult to interpret. While Trailblazer is empirically effective, the "why" behind its decisions is not transparent.

The suggested future work focuses on addressing this limitation:
Enhancing Explainability: Future research will concentrate on making the LLM's reasoning process more understandable, for example, by mapping its decisions to explicit representations of its decision logic. This is crucial for fully comprehending the capabilities and identifying areas for improvement in LLM-based generalist policies.

7.3. Personal Insights & Critique

This paper presents a compelling vision for the future of network policy design, shifting from specialist to generalist approaches. The integration of LLMs into a domain as critical and complex as networking is a significant step forward.

Inspirations drawn from this paper:

Generalization Power of LLMs: The paper strongly validates the hypothesis that LLMs possess inherent generalization capabilities that extend beyond natural language processing to complex control tasks in networking. This suggests a broader applicability for LLMs as "generalist agents" across various engineering domains.
Bridging the Modality Gap: The NIOKA scheme offers a clear blueprint for adapting text-centric LLMs to non-textual data, a common challenge when applying LLMs to fields like robotics, control systems, or scientific discovery. The use of task-specific encoders/decoders and offline fine-tuning is a practical and effective strategy.
Balancing Performance and Efficiency: The Adaptive Policy Collaboration (APC) and the insights on early saturation and selective invocation are highly valuable for practical LLM deployment. They highlight that simply scaling up models or blindly invoking them for every decision is not always optimal. Intelligent resource allocation and strategic model invocation are crucial for real-time, latency-sensitive applications. This pragmatic approach is essential for moving LLM research from academic curiosity to industrial utility.
The "Knowledge Base" Angle: The idea that LLMs implicitly compress "fundamental networking principles" from their training data is profound. It suggests that LLMs aren't just pattern matchers, but can assimilate high-level domain knowledge, which can then be grounded in specific tasks. This elevates them beyond traditional deep reinforcement learning agents that often learn from scratch.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Explainability as a Hurdle: While the authors acknowledge explainability as a limitation, its absence is a major concern for mission-critical systems like networks. Operators need to understand why a decision was made to debug issues, ensure compliance, or manually intervene. Future work on mapping internal logic is vital, but the fundamental black-box nature of large neural networks remains a challenge.
Scheduler Robustness: The current APC scheduler relies on simple heuristic, deterministic rules. While efficient, these rules might become brittle in extremely novel or adversarial network conditions that fall between the "good" and "poor" thresholds. A learning-based scheduler could potentially be more adaptive, though this would reintroduce inference overhead. A hybrid, lightweight learning approach for the scheduler could be an interesting future direction.
Data Dependence for Fine-tuning: The NIOKA relies on offline experience datasets collected from conventional policies. The quality and diversity of this dataset are crucial. If the baseline policies are poor or the collected environments are not sufficiently diverse, the LLM's fine-tuned performance might be limited by the quality of its "teachers." Generating truly optimal expert demonstrations for complex, dynamic networking tasks is itself a hard problem.
Computational Cost of Fine-tuning: While inference for small LLMs can be efficient, the offline reinforcement fine-tuning process (especially for large LLMs) can still be computationally intensive. The paper doesn't detail the training time or resource requirements for fine-tuning, which could be a practical barrier for smaller organizations or faster iteration cycles.
LLM Security and Robustness: The paper doesn't discuss potential security vulnerabilities (e.g., adversarial attacks on input states) or robustness issues (e.g., sensitivity to noisy or manipulated network measurements) when using LLMs for control. This is a critical area for any AI-driven control system.
"Early Saturation" Generalizability: While early saturation is observed for the tasks studied, it's an empirical finding. It's not guaranteed to hold for all networking tasks or all LLM architectures. Further theoretical or empirical work is needed to understand the underlying reasons for this phenomenon.

Transferability and Applications: The methods and conclusions of this paper are highly transferable to other domains involving real-time control, sensing, and decision-making where generalization across tasks and environments is challenging and latency is critical. Examples include:

Robotics: Controlling robot movements or interactions where LLMs could interpret high-level goals and sensor data, with APC handling routine tasks and the LLM managing complex, novel situations.
Autonomous Driving: LLMs could act as a high-level policy for route planning or unusual scenario handling, collaborating with specialized, low-latency controllers for basic vehicle dynamics.
Smart Grids: Optimizing energy distribution or responding to grid anomalies, where LLMs could interpret system states and APC could manage routine load balancing.
Resource Management in Cloud Computing: Beyond job scheduling, LLMs could optimize resource allocation (CPU, memory, storage) for various microservices, adapting to changing workloads and application requirements.

Overall, Trailblazer represents a significant stride towards creating adaptable and efficient AI-driven control policies. While challenges remain, particularly in explainability, the demonstrated real-world performance and insights into LLM behavior make this a landmark paper in the application of LLMs beyond their traditional linguistic boundaries.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Large Language Models as Generalist Policies for Network Optimization

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~40 min read · 54,810 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Large Language Models (LLMs)

3.1.2. Network Optimization

3.1.3. Adaptive Bitrate Streaming (ABR)

3.1.4. Cluster Job Scheduling (CJS)

3.1.5. Congestion Control (CC)

3.1.6. Reinforcement Learning (RL)

3.2. Previous Works

3.2.1. Decision Transformer (DT)

3.2.2. Contextual Imitation Learning (CIL)

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Network Input-Output-Knowledge Alignment (NIOKA)

4.2.1.1. Modality Alignment (Input & Output)

4.2.1.2. Knowledge Alignment (Domain-Specific Knowledge Injection)

4.2.2. Adaptive Policy Collaboration (APC)

4.2.2.1. Scheduler for Adaptive Flow Request Routing

4.2.2.2. Lightweight Policy

4.2.2.3. Batch Processing for LLM

5. Experimental Setup

5.1. Datasets

5.1.1. Adaptive Bitrate Streaming (ABR)

5.1.2. Cluster Job Scheduling (CJS)

5.1.3. Congestion Control (CC)

5.2. Evaluation Metrics

5.2.1. Quality of Experience (QoE) for ABR

5.2.2. Job Completion Time (JCT) for CJS

5.2.3. Mean Absolute Percentage Error (MAPE) for CC (Offline)

5.2.4. Request Processing Delay for CC (Offline)

5.2.5. Video Stall Rate for CC (Online)

5.3. Baselines

5.3.1. Adaptive Bitrate Streaming (ABR)

5.3.2. Cluster Job Scheduling (CJS)

5.3.3. Congestion Control (CC)

5.4. Experience Dataset Construction

5.4.1. For ABR

5.4.2. For CJS

5.4.3. For CC

5.5. Online Deployment (for CC)

6. Results & Analysis

6.1. Generalization Comparison of Generalist and Specialist Policies

6.1.1. Cross-Task Generalization

6.1.2. Cross-Environment Generalization

6.2. Effects of Knowledge and LLM Model Scale in Networking

6.2.1. Insight 1: Pretrained knowledge enables LLMs to function as generalist policies for networking.

6.2.2. Insight 2: Domain-specific knowledge is essential to unlock the full generalist potential of LLMs in networking.

6.2.3. Insight 3: LLMs exhibit early saturation in network optimization with increasing model scale.

6.3. Evaluation in Real-World Network Environments

6.3.1. Validation of the Effectiveness of Scheduler

6.3.2. Insight 4: Selective invocation significantly improves system efficiency without compromising performance.

6.3.3. Effectiveness of Real-World Deployment

6.3.3.1. Overall Video Stall Rate Reduction

6.3.3.2. Scalability on Client Heterogeneity (OS)

6.3.3.3. Impact of Geographical Distance (Network Variability)

6.4. Data Presentation (Tables)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers