Paper status: completed

NetLLM: Adapting Large Language Models for Networking

Published:02/04/2024

Multimodal Data Processing (2)LLM Adaptation for Networking Tasks (1)Adaptive Bitrate Streaming (1)Networking Prediction and Optimization (1)Low-Cost Fine-Tuning Framework (1)

Original Link PDF

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study introduces the NetLLM framework, which adapts large language models to efficiently solve networking tasks, reducing engineering costs and improving generalization. In three specific applications, NetLLM outperforms existing state-of-the-art algorithms.

Abstract

Many networking tasks now employ deep learning (DL) to solve complex prediction and optimization problems. However, current design philosophy of DL-based algorithms entails intensive engineering overhead due to the manual design of deep neural networks (DNNs) for different networking tasks. Besides, DNNs tend to achieve poor generalization performance on unseen data distributions/environments. Motivated by the recent success of large language models (LLMs), this work studies the LLM adaptation for networking to explore a more sustainable design philosophy. With the powerful pre-trained knowledge, the LLM is promising to serve as the foundation model to achieve "one model for all tasks" with even better performance and stronger generalization. In pursuit of this vision, we present NetLLM, the first framework that provides a coherent design to harness the powerful capabilities of LLMs with low efforts to solve networking problems. Specifically, NetLLM empowers the LLM to effectively process multimodal data in networking and efficiently generate task-specific answers. Besides, NetLLM drastically reduces the costs of fine-tuning the LLM to acquire domain knowledge for networking. Across three networking-related use cases - viewport prediction, adaptive bitrate streaming and cluster job scheduling, we showcase that the NetLLM-adapted LLM significantly outperforms state-of-the-art algorithms.

Mind Map

In-depth Reading

English Analysis~44 min read · 58,958 chars

1. Bibliographic Information

1.1. Title

NetLLM: Adapting Large Language Models for Networking

1.2. Authors

Duo Wu, Xianda Wang, Yaqi Qiao, Zhi Wang, Junchen Jiang, Shuguang Cui, Fangxin Wang. The authors are primarily affiliated with the SSE and FNi at The Chinese University of Hong Kong, Shenzhen, with additional affiliations from SIGS, Tsinghua University, and The University of Chicago. This indicates a collaborative effort involving researchers from prominent institutions in computer science and networking.

1.3. Journal/Conference

ACM SIGCOMM 2024 Conference (ACM SIGCOMM '24), August 4-8, 2024, Sydney, NSW, Australia. SIGCOMM is a highly prestigious and influential conference in the field of computer networking. Publication at SIGCOMM signifies that the research is considered to be of high quality, impactful, and at the forefront of networking advancements.

1.4. Publication Year

Published at (UTC): 2024-02-04T04:21:34.000Z (as a preprint). The official conference publication will be in August 2024.

1.5. Abstract

The paper addresses the limitations of current deep learning (DL)-based algorithms in networking, specifically their intensive engineering overhead due to manual deep neural network (DNN) design for diverse tasks, and their poor generalization to unseen data. Motivated by the success of large language models (LLMs), this work proposes to adapt LLMs for networking tasks, envisioning them as "foundation models" capable of "one model for all tasks" with improved performance and generalization. The authors present NetLLM, the first framework designed to efficiently harness LLMs for networking problems. NetLLM enables LLMs to process multimodal networking data effectively and generate task-specific answers efficiently. It also significantly reduces the costs of fine-tuning LLMs for domain knowledge acquisition. Through evaluations on three networking use cases—viewport prediction, adaptive bitrate streaming, and cluster job scheduling—NetLLM demonstrates significant outperformance over state-of-the-art algorithms.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2402.02338 PDF Link: https://arxiv.org/pdf/2402.02338v3.pdf Publication Status: Preprint (published on arXiv), accepted for ACM SIGCOMM 2024.

2. Executive Summary

2.1. Background & Motivation

The paper addresses critical challenges in applying deep learning (DL) to solve complex prediction and optimization problems in networking. While DL-based algorithms have shown promise, they suffer from two major limitations:

High Model Engineering Costs: Designing specialized deep neural network (DNN) architectures for each distinct networking task is labor-intensive and difficult. The "one model for one task" paradigm leads to significant engineering overhead. Even with approaches like Transformers, manual tuning and specialized designs remain costly.
Poor Generalization: DNNs often perform poorly when encountering unseen data distributions or environments. This lack of generalization hinders their widespread deployment in real-world network systems, as operators question their superiority over conventional rule-based algorithms in production.

The paper identifies a crucial gap: the success of large language models (LLMs) in natural language processing (NLP) and other domains (robotics, chip design) suggests their potential as "foundation models" that can generalize across tasks with minimal handcrafting. LLMs possess emergent abilities like planning, pattern mining, problem-solving, and strong generalization, which could be highly beneficial for networking tasks.

The paper's entry point is to explore whether and how LLMs can be adapted to solve various networking tasks efficiently and effectively, thereby pioneering a "one model for all tasks" philosophy in networking, reducing handcraft costs, and enhancing generalization.

2.2. Main Contributions / Findings

The paper presents NetLLM, the first framework designed to efficiently adapt Large Language Models (LLMs) for networking tasks. Its primary contributions and key findings are:

Identification of Key Challenges for LLM Adaptation in Networking: The authors rigorously analyze and articulate three specific challenges:
- Large Modality Gap: Networking data (time-series, graphs, images, scalars) significantly differs from the text-based native input of LLMs.
- Inefficiency of Token-based Answer Generation: LLMs' autoregressive token generation can lead to physically invalid answers (hallucination) and high latency, unsuitable for real-time network systems.
- High Adaptation Costs: Fine-tuning large LLMs, especially for Reinforcement Learning (RL)-based decision-making tasks requiring extensive environment interaction, is prohibitively expensive.
Design of NetLLM Framework with Three Core Modules:
- Multimodal Encoder: Addresses the modality gap by automatically projecting diverse networking inputs into the LLM's token-like feature space, leveraging existing modality-specific feature encoders (e.g., ViT for images, 1D-CNN for time-series, GNN for graphs) and trainable linear projections.
- Networking Head: Replaces the default Language Modeling (LM) head, enabling efficient and reliable answer generation. It generates task-specific answers directly within a valid range in a single inference, preventing hallucination and reducing latency.
- Data-Driven Low-Rank Networking Adaptation (DD-LRNA): Drastically reduces fine-tuning costs. It employs a data-driven pipeline for both supervised learning (SL) and RL tasks (eliminating costly environment interaction in RL by using pre-collected experience datasets) and introduces low-rank matrices for parameter-efficient fine-tuning (reducing trainable parameters to 0.31%).
Demonstrated Superior Performance and Generalization:
- Across Three Diverse Use Cases: The NetLLM-adapted LLM (Llama2-7B) was evaluated on viewport prediction (VP), adaptive bitrate (ABR) streaming, and cluster job scheduling (CJS), which cover both prediction and decision-making tasks, centralized and distributed control, and diverse input modalities.
- Significant Performance Gains: NetLLM consistently and significantly outperforms state-of-the-art learning-based and rule-based algorithms: 10.1-36.6% MAE reduction for VP, 14.5-36.6% QoE improvement for ABR, and 6.8-41.3% JCT reduction for CJS in environments similar to training.
- Stronger Generalization: NetLLM shows superior performance in unseen environments, where conventional learning-based algorithms often struggle or even perform worse than rule-based ones. This highlights the LLM's extensive pre-trained knowledge and emergent abilities (e.g., pattern mining, planning) as crucial for generalization.
Properties and Transferability: NetLLM exhibits compatibility with different LLMs, ensures reliability by guaranteeing valid answers, offers efficiency in adaptation and answer generation, and holds potential for transferability to other domains (medicine, finance, telecommunication) due to its handling of prediction and decision-making tasks.

3.1. Foundational Concepts

To understand NetLLM, a basic grasp of the following concepts is essential:

Deep Learning (DL): A subfield of machine learning that uses artificial neural networks with multiple layers (deep neural networks or DNNs) to learn complex patterns from data. DL has achieved remarkable success in areas like image recognition, natural language processing, and speech recognition.
Deep Neural Networks (DNNs): Neural networks characterized by having many layers between the input and output layers. These layers extract hierarchical features from data, allowing them to learn highly abstract representations.
Supervised Learning (SL): A machine learning paradigm where an algorithm learns from a labeled dataset. For each input, there's a corresponding correct output (label). The goal is for the model to learn a mapping from inputs to outputs, allowing it to predict outputs for new, unseen inputs. Examples include classification (predicting categories) and regression (predicting continuous values).
Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by interacting with an environment. The agent performs actions, receives rewards (or penalties) from the environment, and adjusts its policy to maximize cumulative reward over time. It's often used for decision-making problems.
Large Language Models (LLMs): These are advanced deep learning models, typically based on the Transformer architecture, that are trained on vast amounts of text data (billions of words). They learn to understand, generate, and process human language, acquiring extensive "world knowledge" and emergent abilities like reasoning, planning, and generalization. Examples include ChatGPT, Llama2, and Falcon.
Transformer Architecture: The dominant neural network architecture for sequence modeling tasks, particularly in NLP. It introduced the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element.
- Attention Mechanism: At its core, the attention mechanism allows the model to focus on different parts of the input sequence when generating an output. For a query $Q$ , keys $K$ , and values $V$ , the output of scaled dot-product attention is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where $Q$ represents the query (what we are looking for), $K$ represents the keys (what features are available), $V$ represents the values (the information associated with the features), and $d_k$ is the dimension of the keys, used for scaling. The softmax function normalizes the attention scores.
Tokenization: The process of breaking down raw text into smaller units called "tokens." These tokens can be words, subwords, or characters. For LLMs, subword tokenization (e.g., "awesome" -> "aw", "esome") is common, allowing the model to handle a large vocabulary while keeping the total number of unique tokens manageable.
Embeddings: Numerical representations (vectors) of words, tokens, or other data types. In LLMs, each token is mapped to a high-dimensional vector space, where tokens with similar meanings are located closer together. These embeddings are what the neural network processes.
Language Modeling (LM) Head: The final layer of a traditional LLM that predicts the probability distribution of the next token in a sequence. It takes the LLM's internal representation (hidden states) and maps them to a vocabulary-sized vector, then applies softmax to get probabilities for each possible next token. This is how LLMs generate text token by token.
Fine-tuning: The process of taking a pre-trained model (like an LLM) and further training it on a smaller, task-specific dataset. This adapts the model's general knowledge to a particular domain or task, typically by updating some or all of its parameters.
Parameter-Efficient Fine-Tuning (PEFT): Techniques designed to fine-tune large models without updating all their parameters. Instead, they introduce a small number of new, trainable parameters (e.g., low-rank matrices) while keeping most of the pre-trained model frozen. This significantly reduces computational costs and memory requirements. Low-Rank Adaptation (LoRA) is a prominent example of such a technique, where weight matrices $W_0$ are updated by adding a low-rank decomposition A B, such that $W = W_0 + AB$ .
Networking Tasks:
- Viewport Prediction (VP): In immersive video (e.g., $360^\circ$ video), predicting where a user will look next (their "viewport") is crucial to stream only the relevant high-quality content, saving bandwidth.
- Adaptive Bitrate (ABR) Streaming: Dynamically adjusting the quality (bitrate) of video segments based on current network conditions (throughput, delay) and playback buffer status to optimize user experience (Quality of Experience, QoE).
- Cluster Job Scheduling (CJS): Optimizing the allocation of resources (e.g., CPU, memory) and scheduling of tasks (job stages) within a distributed computing cluster to minimize job completion times and maximize resource utilization.

3.2. Previous Works

The paper contextualizes its work by reviewing existing approaches in networking:

Rule-Based Algorithms: Historically, network optimization relied on hand-crafted control rules. Examples include Copa [6] for congestion control (adjusts sending rates based on queueing delay) and PANDA [53] for ABR (switches bitrates based on estimated bandwidth).
- Limitations: These algorithms are labor-intensive due to the significant human effort required for rule engineering, implementation, and validation.
Learning-Based Algorithms (DL/DNN-based): More recently, deep learning has been applied to networking tasks.
- Supervised Learning (SL) for Prediction: DNNs are trained for tasks like traffic classification [54, 73] and bandwidth prediction [9, 64]. Yan et al. [105] trained a DNN for future bandwidth prediction in video clients.
- Reinforcement Learning (RL) for Decision-Making: DNNs are used for congestion control [1, 106], ABR [44, 62], and CJS [63, 78]. Mao et al. [63] used RL for resource allocation in distributed computing clusters.
- Limitations:
  - High Model Engineering Costs: Require manual design of specialized DNN architectures for each task, shifting the engineering burden from rules to models. Even with Transformers, specific architecture tuning or specialized tokenization/attention mechanisms are needed [36, 99].
  - Poor Generalization: DNNs often struggle on unseen data distributions or environments, sometimes performing worse than rule-based algorithms [44, 105]. This limits their practical deployment.
Domain-Adapted LLMs (in other fields): The paper draws motivation from successful LLM adaptations outside NLP.
- Robotics: PaLM-E [23] adapts PaLM [15] for robotic manipulation by generating step-by-step control commands and adapting to environmental changes.
- Biology: ESMFold [55] uses LLMs to predict atomic-level protein structures.
- Chip Design: ChipNeMo [59] develops domain-adapted LLMs for chip design.
- Networking (limited prior work): Some studies showed LLMs generating technical documents for networking (e.g., digital twin descriptions [51]), but none provided an in-depth investigation into adapting LLMs to solve core networking tasks.

3.3. Technological Evolution

The field of network optimization has evolved through distinct phases:

Hand-crafted Rule-based Systems (Decades ago to present): Early and still prevalent systems rely on human experts designing explicit rules and heuristics (e.g., if buffer drops below X, reduce bitrate). These are robust in known conditions but brittle and costly to adapt to new scenarios.
Traditional Machine Learning (ML) (Pre-2010s): Simple ML models for prediction tasks (e.g., linear regression for bandwidth prediction) began to emerge, but were limited by their expressiveness for complex network dynamics.
Deep Learning (DL) based Systems (2010s-present): With advancements in computational power and data availability, DNNs offered powerful function approximation capabilities. SL-based methods tackled prediction (e.g., traffic classification), while RL-based methods addressed decision-making (e.g., ABR, congestion control). This marked a shift from rule engineering to model engineering.
Large Language Model (LLM) Adaptation (Emerging, 2020s-present): Motivated by the "foundation model" paradigm, this phase seeks to leverage the pre-trained knowledge and emergent abilities of LLMs to create highly generalized and adaptable solutions across diverse networking tasks, aiming to reduce the model engineering burden and improve robustness significantly. NetLLM represents a pioneering work in this emerging phase for networking.

3.4. Differentiation Analysis

Compared to prior work, NetLLM offers several core innovations and differentiators:

Addressing End-to-End LLM Adaptation for Networking: Unlike previous work that only explored LLMs for networking documentation [51], NetLLM is the first comprehensive framework to adapt LLMs to solve core networking prediction and decision-making problems.
Overcoming the Modality Gap:
- Prior DL: DNNs are typically designed for specific modalities (e.g., CNNs for images, LSTMs for time-series) or require complex, task-specific feature engineering.
- Prompt Learning (LLMs): A natural alternative for LLMs, but NetLLM demonstrates empirically that prompt learning (transforming diverse data into text) is sub-optimal and infeasible for complex modalities (e.g., images, graphs).
- NetLLM's Multimodal Encoder: Directly addresses this by using modality-specific encoders (reusing existing, effective ones) and projecting features into the LLM's token space. This is more effective than textual prompts for non-textual data.
Efficient and Reliable Answer Generation:
- Prior DL: Outputs are typically numerical predictions or actions, directly mapped from DNN outputs.
- Standard LLMs (LM Head): Autoregressive token generation can lead to "hallucinations" (physically invalid answers) and high latency due to multiple inference steps, which is detrimental in real-time networking.
- NetLLM's Networking Head: Replaces the LM head with a task-specific linear layer, ensuring direct generation of valid answers within a single inference, thus enhancing reliability and reducing latency significantly.
Cost-Effective Adaptation for LLMs:
- Prior DL: Fine-tuning DNNs can be costly, especially for RL tasks requiring extensive environment interaction.
- Full Fine-tuning LLMs: Prohibitively expensive due to large parameter counts (high GPU memory, long training times) and risks disrupting pre-trained knowledge.
- NetLLM's DD-LRNA:
  - Data-Driven RL: Eliminates time-consuming environment interaction for RL tasks by leveraging pre-collected experience datasets.
  - Low-Rank Adaptation: Introduces a small set of trainable low-rank matrices (similar to LoRA) while freezing the LLM's main parameters. This drastically reduces memory (60.9% less GPU memory) and training time (15.1% less training time) while preserving pre-trained knowledge for multi-task sharing.
"One Model for All Tasks" with Enhanced Generalization:
- Prior DL: Typically "one model for one task," requiring specialized DNN design for each. Struggle with generalization to unseen environments.
- NetLLM: Uses a single LLM as a foundation model across diverse networking tasks (VP, ABR, CJS) without modifying its core architecture. The framework allows the LLM to learn domain knowledge via adapters, demonstrating superior generalization compared to specialized SOTA DNNs.
  
  In essence, NetLLM systematically tackles the unique challenges of bridging LLMs with networking, providing a holistic and empirically validated solution that surpasses existing paradigms in performance, efficiency, and generalization.

4. Methodology

The NetLLM framework is designed to efficiently adapt Large Language Models (LLMs) for networking tasks by addressing the challenges of multimodal input, efficient answer generation, and high adaptation costs. It comprises three main building blocks: a multimodal encoder, a networking head, and a Data-Driven Low-Rank Networking Adaptation (DD-LRNA) scheme.

The overall architecture of NetLLM is depicted in Figure 5 (from the original paper). During fine-tuning, the parameters of the core LLM are frozen to preserve its extensive pre-trained knowledge. The multimodal encoder, networking heads, and the introduced low-rank matrices are the only tunable components, allowing for task-specific optimization with minimal overhead.

4.1. Multimodal Encoder

The primary goal of the multimodal encoder is to bridge the significant modality gap between diverse networking inputs and the LLM's native text-based input format. It projects multimodal input data into the same feature space as language tokens, making them understandable and usable by the LLM.

The architecture of the multimodal encoder consists of two main blocks (Figure 6 from the original paper):

4.1.1. Feature Encoder

Instead of designing new encoders from scratch, NetLLM reuses existing, well-designed feature encoders tailored for specific data modalities. This pragmatic approach minimizes model engineering costs.

Images: For image data (e.g., video content information in Viewport Prediction), NetLLM employs a Vision Transformer (ViT) [22]. ViT is effective at extracting high-level features from images. By default, the pre-trained weights of ViT are frozen to leverage its general visual understanding.
Time-series and Sequence Data: For data exhibiting time-varying patterns, such as historical throughputs, delays in ABR, or historical viewports in VP, a 1D Convolutional Neural Network (1D-CNN) [62] is used. 1D-CNNs are well-suited for capturing temporal dependencies and features in sequential data.
Scalar Data: Simple numerical values, like the current buffer length in ABR, are processed using a fully connected layer.
Graph Information: For complex relational data like Directed Acyclic Graphs (DAGs) describing job dependencies and resource demands in Cluster Job Scheduling (CJS), a Graph Neural Network (GNN) [63, 101] is utilized. GNNs are designed to operate on graph structures, learning representations of nodes and edges.

4.1.2. Linear Projection

The features extracted by these modality-specific encoders often have different dimensions (e.g., ViT features might be 768-dimensional, while Llama2 expects 4096-dimensional input embeddings). To reconcile these differences and align the features with the LLM's token space, NetLLM uses a set of trainable linear layers.

Function: These linear layers learn an efficient mapping from the encoder's feature space to the LLM's token embedding space.
Output: They produce a set of "token-like embedding vectors" that can be directly fed into the LLM.
Normalization: To ensure training stability, the output embeddings from the linear projection layers are further normalized using layer normalization [7].

Example with VP Task: As illustrated in Figure 6 (from the original paper), for the Viewport Prediction (VP) task, image data (video content) is processed by a ViT, and time-series viewport data is processed by a 1D-CNN. The extracted features from both are then passed through separate linear projection layers. Finally, all resulting embeddings are layer-normalized and concatenated before being fed into the LLM. The experimental results (Figure 2, left) show that this multimodal encoder significantly outperforms prompt-learning-based approaches, reducing the Mean Absolute Error (MAE) for VP by 19.7%.

该图像是示意图，展示了NetLLM的多模态编码器用于编码多模态数据的框架。图中包含了输入的多模态数据，包括图像和视口预测，采用特征编码器和线性投影进行处理，最终输出经过层归一化和类令牌嵌入后输入到大型语言模型中。

Figure 6 (from the original paper): Illustration of the multimodal encoder of NetLLM to encode multimodal data.

4.2. Networking Head

The networking head module is designed to enable efficient and reliable answer generation for specific networking tasks, overcoming the limitations of the default Language Modeling (LM) head used by LLMs.

4.2.1. Limitations of LM Head

Traditional LLMs use an LM head to predict tokens one by one in an autoregressive manner (Figure 1 from the original paper). This approach has two drawbacks for networking:

Hallucination: The inherent uncertainty in token prediction can lead to physically invalid answers (e.g., predicting a non-existent bitrate for ABR). Figure 2 (middle) shows that an LM head fails to guarantee 100% valid answers for VP.
High Latency: Generating a complete answer often requires multiple inference steps (token by token). This can result in high answer generation latency, making it unsuitable for real-time network systems that demand quick responsiveness. Figure 2 (right) shows this can be as high as 3.84s for VP, exceeding a 1-second response deadline.

该图像是一个示意图，展示了大型语言模型（如GPT-3、Llama2）在基于token的答案生成机制中的结构。图中包含了tokenizer、token、词汇表及其嵌入的关系，最后展示了输入文本及生成的答案 "6371 km"。

Figure 1 (from the original paper): Illustration of the token-based answer generation mechanism of LLMs.

4.2.2. Design of Networking Head

NetLLM replaces the default LM head with task-specific networking heads.

Structure: Each networking head is a lightweight trainable linear layer.
Function: It maps the high-level features output by the LLM directly into task-specific answers.
Direct Answer Generation: Unlike token prediction, the networking head generates the final answer in a single inference step.
Reliability: The output of the networking head is constrained to the valid range of possible answers for the specific task (e.g., valid viewport coordinates, a set of predefined bitrates). This guarantees that all generated answers are physically valid, enhancing the reliability of the LLM for networking.
Efficiency: By generating answers in a single inference, the networking head significantly reduces answer generation latency.

Example with ABR Task: Figure 7 (from the original paper) contrasts the LM head with the networking head for an ABR task.
LM Head: Predicts tokens sequentially, potentially leading to invalid bitrates or requiring multiple inferences to form a complete bitrate decision.
Networking Head: Directly predicts a probability distribution over a discrete set of candidate bitrates (e.g., {750, 2850, 4300} kbps). The bitrate with the highest probability is then chosen as the action. This ensures a valid bitrate is always selected in a single inference.

该图像是一个示意图，展示了LM头和ABR头在自适应比特率（ABR）任务中的比较。图中展示了视频块以不同比特率编码的情况，其中750 kbps的比特率被标记为无效，而2850 kbps和4300 kbps则为有效选择，分别标记为成功下载的选项。

Figure 7 (from the original paper): Comparison between LM head and networking head with ABR task as an example. For illustration, we assume that video chunks are encoded into three bitrate versions {750, 2850, 4300} kbps.

$Figure 2: Illustration of the ineffectiveness for some natural alternatives with VP task as the example. Left: Prompt learning \[60, 68\] that transforms data into textual prompts achieves sub-optimal performance, while NetLLM with a multimodal encoder to encode task input data effectively outperforms baseline. Middle, Right: Token-based prediction with LM head fails to guarantee valid answers and produce stale responses, while NetLLM efficiently addresses these issues with the networking head module.$ 该图像是图表，展示了NetLLM在不同任务中的表现。左侧柱状图显示Prompt Learning的平均绝对误差，NetLLM明显降低了误差。中间的柱状图表示Token预测的有效答案比例，NetLLM达到100%。右侧柱状图则表明，NetLLM在答案生成时间上显著优于Token预测。整体上，NetLLM在性能和时效性上具有优势。

Figure 2 (from the original paper): Illustration of the ineffectiveness for some natural alternatives with VP task as the example. Left: Prompt learning [60, 68] that transforms data into textual prompts achieves sub-optimal performance, while NetLLM with a multimodal encoder to encode task input data effectively outperforms baseline. Middle, Right: Token-based prediction with LM head fails to guarantee valid answers and produce stale responses, while NetLLM efficiently addresses these issues with the networking head module.

4.3. Data-Driven Low-Rank Networking Adaptation (DD-LRNA)

The DD-LRNA scheme is designed to drastically reduce the high costs associated with fine-tuning LLMs, particularly for large models and RL-based tasks. It achieves this through a data-driven adaptation pipeline and a low-rank adaptation approach.

4.3.1. Data-Driven Networking Adaptation

This component provides efficient training pipelines for both prediction (SL) and decision-making (RL) networking tasks.

Supervised Learning (SL) Tasks (e.g., VP):
- Process: For prediction tasks, standard SL data-driven training is used. A task-specific dataset $\mathcal{D}_{sl} = \{ \boldsymbol{X}, \boldsymbol{y} \}$ (inputs $x$ and labels $y$ ) is utilized.
- Flow: The multimodal encoder processes input $x$ , the LLM (with its frozen parameters and low-rank adapters) generates features, and the networking head produces prediction results $\hat{y}$ .
- Loss Function: The training loss is computed based on a predefined loss function $F_{sl}$ : $ \mathcal{L}{sl} = F{sl}(y, \hat{y}) $ where $F_{sl}$ can be cross entropy (CE) for classification (e.g., traffic classification) or mean squared error (MSE) for regression (e.g., bandwidth prediction, VP). This loss is then used to update the trainable parameters (multimodal encoder, networking head, and low-rank matrices).
Reinforcement Learning (RL) Tasks (e.g., ABR, CJS):
- Challenge: Traditional RL training involves time-consuming active interaction between the LLM agent and the environment to collect experiences. For large LLMs, this is prohibitively expensive. Figure 3 (from the original paper) shows that experience collection accounts for 52.27% and 39.25% of total training time for ABR and CJS tasks, respectively, in standard RL.
- Solution: DD-LRNA adopts an efficient data-driven RL technique (also known as offline RL) [79, 106]. Instead of active interaction, it uses an experience pool collected once by existing (non-LLM) networking algorithms. This eliminates the need for real-time environment interaction during LLM fine-tuning, significantly reducing costs (e.g., 51.1%/37.7% training time reduction for ABR/CJS).
- Process:
  1. Experience Collection: An existing policy (e.g., GENET for ABR, Decima for CJS) interacts with the environment to collect an experience dataset $\mathcal{D}_{rl} = \{ \tau_1, \cdots, \tau_{|\mathcal{D}_{rl}|} \}$ . Each trajectory $\tau = \{ r_t, \mathbf{s}_t, \mathbf{a}_t \}_{t=1}^T$ consists of rewards $r$ , states $\mathbf{s}$ , and actions $\mathbf{a}$ over time steps $t=1, \dots, T$ .
  2. Return Calculation: Instead of raw rewards, the reward $r_t$ in each sample is replaced by the return $R_t$ , which is the sum of future rewards expected from state $\mathbf{s}_t$ : $ R_t = \sum_{i=t}^T r_i $
  3. State and Action Discretization: States or actions, which might be composed of multiple pieces of information, are discretized into components: $\mathbf{s}_t = \{ s_t^1, \cdots, s_t^n \}$ and $\mathbf{a}_t = \{ a_t^1, \cdots, a_t^m \}$ .
  4. Trajectory Representation: The trajectory is then represented as: $ \tau = { R_t, s_t^1, \cdots, s_t^n, a_t^1, \cdots, a_t^m }_{t=1}^T $
  5. Fine-tuning: The LLM is fine-tuned to learn the distribution of returns. At each training step, a sequence of data $\boldsymbol{d}$ is sampled from the dataset within a context window $w$ : $ \boldsymbol{d} = { R_i, s_i^1, \cdots, s_i^n, a_i^1, \cdots, a_i^m }{i=t-w+1}^t \in \mathcal{D}{rl} $ This sequence is fed to the LLM (via the multimodal encoder for state/action components) to generate actions $\{ \hat{a}_i^1, \cdots, \hat{a}_i^m \}_{i=t-w+1}^t$ .
  6. Loss Function: The training loss is calculated to minimize the difference between the generated actions and the actual actions in the dataset: $ \mathcal{L}{rl} = \frac{1}{w} \sum{i=1}^w \sum_{j=1}^m F_{rl}(a_i^j, \hat{a}_i^j) $ where $F_{rl}$ measures the difference (e.g., CE for discrete actions, MSE for continuous actions).
- Inference: During inference, a target return (e.g., maximum possible return) is specified to trigger the LLM to generate optimal actions.
  
  $Figure 3: Using standard RL techniques \[86, 93\] to adapt LLM for RL-based decision-making tasks (ABR and CJs) incurs high training time due to the active environment interaction for experience collection. NetLLM eliminates this time-consuming process by designing an efficient data-driven adaptation pipeline in the DD-LRNA scheme.$ 该图像是一个柱状图，展示了在ABR和CJS任务上，传统强化学习（Standard RL）与NetLLM在训练时间上的比较。结果显示，NetLLM在参数更新和经验收集上的时间消耗显著低于传统方法，分别为29.03小时（99.63%）与0.11小时（0.37%），相较之下，标准RL耗时较长，体现了NetLLM的高效性。

Figure 3 (from the original paper): Using standard RL techniques [86, 93] to adapt LLM for RL-based decision-making tasks (ABR and CJs) incurs high training time due to the active environment interaction for experience collection. NetLLM eliminates this time-consuming process by designing an efficient data-driven adaptation pipeline in the DD-LRNA scheme.

4.3.2. Low-Rank Networking Adaptation

This component tackles the prohibitive computational costs of directly fine-tuning all parameters of a large LLM.

Problem: Full parameter fine-tuning of an LLM with initial parameters $\Phi_0$ requires computing updates $\Delta\Phi$ for all parameters, where the dimension of $\Delta\Phi$ equals $|\Phi_0|$ . This is extremely resource-intensive (e.g., 65.88GB GPU memory and 7.9h training time for Llama2-7B on VP, as shown in Figure 4 from the original paper). It also risks disrupting the valuable pre-trained knowledge.
Insight: The paper leverages the insight that parameter changes during adaptation ( $\Delta\Phi$ ) often reside in an intrinsic low-rank subspace [4, 38].
Mechanism:
1. Frozen LLM Parameters: The pre-trained parameters of the core LLM ( $\Phi_0$ ) are frozen to preserve its general knowledge.
2. Low-Rank Matrices: For each pre-trained weight matrix $W_0 \in \Phi_0$ of dimension $d \times k$ , two additional low-rank matrices, $A$ (dimension $d \times r$ ) and $B$ (dimension $r \times k$ ), are introduced. These matrices approximate the desired parameter update $\Delta W$ such that $W = W_0 + \Delta W = W_0 + AB$ . Here, $r$ is a low rank, where $r \ll \min\{d, k\}$ .
3. Training: During adaptation, only the parameters of $A$ and $B$ are updated, while $W_0$ remains fixed.
Benefits:
- Reduced Costs: This significantly reduces the number of trainable parameters (e.g., to 0.31% of total parameters), leading to substantial reductions in GPU memory (60.9%) and training time (15.1%).
- Knowledge Preservation: Freezing the main LLM parameters ensures that its extensive pre-trained knowledge is retained. This allows the same base LLM to serve as a foundation model across different networking tasks, with each task having its own set of low-rank matrices A, B to learn domain-specific knowledge.
  
  $Figure 4: Illustration of the high adaptation costs of full-parameter fine-tune \[16, 92\] on the VP task. The DD-LRNA scheme of NetLLM efficiently reduces the costs by introducing a set of small trainable low-rank matrices.$ 该图像是一个示意图，展示了在各种训练参数上NetLLM相较于全参数微调的优势。图中显示，NetLLM在可训练参数占比为0.31%，显著低于全参数微调的100%。此外，NetLLM的GPU内存消耗为27.24GB，训练时间为6.7小时，均显著低于全参数微调的65.88GB和7.9小时。

Figure 4 (from the original paper): Illustration of the high adaptation costs of full-parameter fine-tune [16, 92] on the VP task. The DD-LRNA scheme of NetLLM efficiently reduces the costs by introducing a set of small trainable low-rank matrices.

4.3.3. Putting It All Together (DD-LRNA Scheme)

Figure 8 (from the original paper) summarizes the DD-LRNA scheme:

The LLM's core parameters are frozen.
Task-specific trainable low-rank matrices are allocated.
For prediction tasks, standard SL datasets are used. For decision-making tasks, experience datasets are collected by existing algorithms without active environment interaction during LLM fine-tuning.
During fine-tuning, mini-batches of data are sampled from the respective datasets.
Data is fed through the multimodal encoder, processed by the LLM (with its frozen parameters and updated low-rank matrices), and outputs are generated by the networking head.
Loss is computed using either $\mathcal{L}_{sl}$ (Eq. 1) or $\mathcal{L}_{rl}$ (Eq. 4).
Gradients are propagated to update the low-rank matrices, multimodal encoder parameters, and networking head parameters, optimizing performance.

该图像是示意图，展示了NetLLM的低秩网络适应方案。图中分为两个部分：上半部分描述了预测任务的适应过程，包括输入数据、低秩矩阵和大语言模型的更新；下半部分则展示了决策制定任务的适应过程，涵盖现有策略、状态和生成的动作。各部分均通过计算损失函数进行更新，从而实现模型优化。

Figure 8 (from the original paper): Ilustration of the data-driven low-rank networking adaptation scheme of NetLLM.

4.4. Implementation

NetLLM is implemented in Python and Bash and designed for integration with existing Supervised Learning (SL) and Reinforcement Learning (RL) codebases.

4.4.1. APIs

As shown in Figure 9 (from the original paper), NetLLM exposes three key APIs for interaction:

Adapt: This API triggers the fine-tuning process. It uses a provided dataset to adapt the LLM for a target task and returns a snapshot of the adapted LLM.
Test: Evaluates the performance of the adapted LLM on specified testing environments, generated according to given simulation settings.
RL_Collect: For RL-based tasks that lack an existing dataset, this API is used. It collects an experience dataset by running given RL policies (existing networking algorithms) to interact with the environments. The collected dataset can then be fed into the Adapt API.

该图像是示意图，展示了如何将NetLLM与现有的监督学习（SL）/强化学习（RL）代码库进行集成以实现LLM适配。图中通过流程箭头表明了适配、性能测试和数据采集的关系，并细分了SL和RL代码库的组成部分。左侧的NetLLM适配器部分描述了适应LLM的过程，右侧则列出了集成所需的代码库组件。

Figure 9 (from the original paper): Components and interfaces needed to integrate NetLLM with an existing SL/RL codebase for LLM adaptation.

4.4.2. Specifics

Integration: NetLLM has been integrated into existing codebases for VP [99], ABR [103], and CJS [30].
Multimodal Encoder Details:
- ViT: Used for images, with its pre-trained parameters typically frozen.
- 1D-CNN: For time-series and sequence data.
- Fully Connected Layer: For scalar data.
- GNN: For graph information (e.g., DAGs in CJS).
- All multimodal encoder components (except frozen ViT) are trainable.
Networking Head Customization:
- VP Head: Three neurons output roll, pitch, and yaw values for viewport coordinates.
- ABR Head: Outputs a probability distribution over candidate bitrates.
- CJS Heads: Two heads—one for determining the next job stage, another for allocating executor resources to that stage.
DD-LRNA Hyperparameters:
- Context Window (w): For RL return distribution learning, $w$ is set to 10 for ABR and 20 for CJS.
- Low-Rank (r): The rank of low-rank matrices is set to 32 for VP, and 128 for ABR and CJS. The authors note that NetLLM performs well across a range of values (generally $w \geq 10$ and $r \geq 32$ ).
- Experience Collection for RL: GENET [103] and Decima [63] are used to collect experience datasets for ABR and CJS, respectively. The dataset collection is a one-time process.

5. Experimental Setup

5.1. Datasets

The experiments utilize a combination of real-world and synthetic datasets for training and testing across the three networking tasks (VP, ABR, CJS).

5.1.1. Viewport Prediction (VP)

The following are the results from Table 2 of the original paper:

Setting	Viewport Dataset	Prediction Setup
default train	Jin2022	hw =2s, pw = 4s
default test	Jin2022	hw = 2s, pw = 4s
unseen setting1	Jin2022	hw = 4s, pw = 6s
unseen setting2	Wu2017	hw = 2s, pw = 4s
unseen setting3	Wu2017	hw = 4s, pw = 6s

Jin2022 [43]: A large-scale immersive video viewport dataset recording traces from 84 viewers watching 27 60-second videos.
- Default Setup: 15 videos and 42 viewers for training, 6 videos/21 viewers for validation, 6 videos/21 viewers for testing (882 traces total).
- hw (historical window): 2 seconds.
- pw (prediction window): 4 seconds.
Wu2017 [97]: Another dataset containing 9 videos (average length 242 seconds) watched by 48 viewers.
- Unseen Settings: 4 videos and 9 viewers sampled for testing generalization (36 long viewport traces).
Prediction Setup: hw is approximately $pw/2$ across settings. Increasing pw (e.g., to 6s in unseen setting1) increases prediction difficulty.
Rationale: These datasets cover diverse viewer behaviors and video content, allowing for robust evaluation of viewport prediction models under various conditions and prediction challenges.

5.1.2. Adaptive Bitrate (ABR) Streaming

The following are the results from Table 3 of the original paper:

Setting	Video Dataset	Bandwidth Traces
default train	Envivio-Dash3	FCC
default test	Envivio-Dash3	FCC
unseen setting1	Envivio-Dash3	SynthTrace
unseen setting2	SynthVideo	FCC
unseen setting3	SynthVideo	SynthTrace

Envivio-Dash3: A video from the DASH-246 JavaScript reference client [25], following the format used by GENET [103] and Pensieve [62].
FCC [18]: Broadband traces used as the default bandwidth dataset.
- Default Setup: 235 traces for training, 150 for validation (as used by GENET), and 100 randomly sampled traces for testing. Over 90 hours of bandwidth traces.
SynthVideo: A synthetic video, generated following the method in Pensieve [62], with a larger bitrate than Envivio-Dash3. Used for unseen settings.
SynthTrace: A synthetic bandwidth dataset, generated using the Pensieve method [62], comprising 100 traces with a larger bandwidth range and more dynamic fluctuation patterns than FCC. Used for unseen settings.
Rationale: The combination of real-world (FCC, Envivio-Dash3) and synthetic (SynthVideo, SynthTrace) datasets allows evaluation under both typical and more challenging/dynamic network and video content conditions, testing generalization capabilities.

5.1.3. Cluster Job Scheduling (CJS)

The following are the results from Table 4 of the original paper:

setting	Job Requests	Executor Resources(k)
default train	200	50
default test	200	50
unseen setting1	200	30
unseen setting2	450	50
unseen setting3	450	30

TPC-H [14]: A real-world dataset containing job requests with large data volumes, high executor demands, and high complexity.
Default Setup: Number of job requests set to 200, executor resources to 50k units (consistent with pre-trained Decima [30]). Job requests in the default testing setting are different from training using different random seeds.
Unseen Workloads: For generalization, harder workloads are simulated:
- unseen setting1: Reduced executor resources (30k units) for 200 job requests.
- unseen setting2: Increased job requests (450) for 50k executor resources.
- unseen setting3: Increased job requests (450) and reduced executor resources (30k units).
Rationale: The TPC-H dataset provides realistic job characteristics, and varying job requests and executor resources allow testing the scheduler's performance under different resource constraints and workload intensities, crucial for evaluating robustness and scalability.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, a complete explanation is provided below:

5.2.1. Mean Absolute Error (MAE) for Viewport Prediction (VP)

Conceptual Definition: MAE is a measure of the average magnitude of the errors in a set of predictions, without considering their direction. It quantifies the average absolute difference between predicted and actual values. In the context of viewport prediction, it indicates how far, on average, the predicted viewport coordinates are from the true viewport coordinates. A lower MAE signifies better prediction accuracy.
Mathematical Formula: $ MAE = \frac{1}{H} \sum_{t=1}^H \frac{| \alpha_t^p - \alpha_t^g | + | \beta_t^p - \beta_t^g | + | \zeta_t^p - \zeta_t^g ) |}{3} $
Symbol Explanation:
- MAE: Mean Absolute Error.
- $H$ : The prediction horizon, representing the total number of time steps (or samples) over which predictions are made.
- $t$ : An index for a specific time step within the prediction horizon.
- $\mathbf{v}_t^p = (\alpha_t^p, \beta_t^p, \zeta_t^p)$ : The predicted viewport coordinate at time step $t$ .
- $\mathbf{v}_t^g = (\alpha_t^g, \beta_t^g, \zeta_t^g)$ : The ground-truth (actual) viewport coordinate at time step $t$ .
- $\alpha$ : Represents the roll value of the viewport coordinate.
- $\beta$ : Represents the pitch value of the viewport coordinate.
- $\zeta$ : Represents the yaw value of the viewport coordinate.
- $|\cdot|$ : Denotes the absolute value.
- The division by 3 averages the absolute errors across the three coordinate components (roll, pitch, yaw) for each time step.
- The outer summation $\sum_{t=1}^H$ sums these average errors over the entire prediction horizon $H$ .
- The multiplication by $\frac{1}{H}$ calculates the overall mean of these errors.

5.2.2. Quality of Experience (QoE) for Adaptive Bitrate (ABR) Streaming

Conceptual Definition: QoE is a composite metric used to evaluate the user's overall satisfaction with a streaming video experience. It typically balances positive factors like high video quality (bitrate) with negative factors like interruptions (rebuffering) and sudden quality changes (bitrate fluctuations). A higher QoE score indicates a better user experience.
Mathematical Formula: $ QoE = \frac{\sum_{i=1}^C (Bitrate_i - \lambda Rebuf_i - \gamma BitrateChange_i)}{C} $
Symbol Explanation:
- QoE: Quality of Experience.
- $C$ : The total number of video chunks (segments) in the video being streamed.
- $i$ : An index for a specific video chunk.
- $Bitrate_i$ : The bitrate (in Mbps) of chunk $i$ that was delivered to the user. Higher bitrate generally means higher video quality.
- $Rebuf_i$ : The rebuffering time (in seconds) experienced during the download or playback of chunk $i$ . Rebuffering is a significant negative factor for QoE.
- $BitrateChange_i$ : The absolute change in bitrate (in Mbps) between chunk $i$ and the previously played chunk. Frequent or large bitrate changes can be perceived negatively by users. If $i=1$ , $BitrateChange_1$ is typically 0.
- $\lambda$ : A weight parameter that scales the impact of rebuffering time on the QoE. A higher $\lambda$ indicates a stronger penalty for rebuffering. The paper sets $\lambda = 4.3$ .
- $\gamma$ : A weight parameter that scales the impact of bitrate changes on the QoE. A higher $\gamma$ indicates a stronger penalty for bitrate fluctuations. The paper sets $\gamma = 1$ .
- The summation $\sum_{i=1}^C$ accumulates the weighted scores for all chunks.
- The division by $C$ calculates the average QoE per chunk over the entire video.

5.2.3. Job Completion Time (JCT) for Cluster Job Scheduling (CJS)

Conceptual Definition: JCT is a direct measure of how long it takes for a job to finish its execution within a computing cluster, from its arrival to its completion. It is a critical performance metric in job scheduling, as minimizing JCT often implies efficient resource utilization and faster processing of workloads. A lower JCT indicates better scheduling performance.
Mathematical Formula: $ JCT = t_e - t_s $
Symbol Explanation:
- JCT: Job Completion Time.
- $t_e$ : The finishing time of a job (the timestamp when the job completes all its stages).
- $t_s$ : The arrival time of a job (the timestamp when the job is submitted to the scheduler).

5.3. Baselines

The paper compares NetLLM's adapted LLM against three state-of-the-art learning-based algorithms and representative rule-based (non-DNN) algorithms for each task. The choice of baselines is justified by their open-source availability and representativeness in their respective domains.

5.3.1. Baselines for Viewport Prediction (VP)

TRACK [85]:
- Type: Learning-based (State-of-the-Art).
- Description: A deep neural network model based on Long Short-Term Memory (LSTM) architecture. It uses both historical viewer viewports and video saliency maps (regions of interest in the video) as input to achieve high prediction performance.
- Implementation: Open-source Keras code was carefully converted to PyTorch for compatibility. Retrained from scratch with original hyperparameters.
Linear Regression (LR) [80]:
- Type: Rule-based / Statistical.
- Description: Assumes viewport movement follows a linear function over time and uses linear regression to estimate this function for prediction.
- Implementation: Custom implementation based on the paper's ideas.
Velocity-based Prediction (Velocity) [24]:
- Type: Rule-based / Heuristic.
- Description: Calculates the moving speed of a viewer's historical viewports and extrapolates this velocity to estimate future viewport positions.
- Implementation: Custom implementation based on the paper's ideas.

5.3.2. Baselines for Adaptive Bitrate (ABR) Streaming

GENET [103]:
- Type: Learning-based (State-of-the-Art RL).
- Description: An RL-based streaming algorithm that improves upon Pensieve [62] by introducing a curriculum learning technique to facilitate RL training and improve convergence performance.
- Implementation: Open-source codebase used. Pre-trained model weights utilized.
BBA (Buffer-Based Adaptation) [39]:
- Type: Rule-based / Heuristic.
- Description: A classic buffer-based ABR algorithm that primarily uses playback buffer occupancy as a critical signal to make bitrate decisions, aiming to maintain the buffer at a desired level.
- Implementation: Included in the GENET open-source codebase.
MPC (Model Predictive Control) [107]:
- Type: Rule-based / Optimization.
- Description: A control-theoretic approach that optimizes a given QoE metric over a future chunk horizon. It uses estimates of network throughput and buffer occupancy to make bitrate decisions.
- Implementation: Included in the GENET open-source codebase.

5.3.3. Baselines for Cluster Job Scheduling (CJS)

Decima [63]:
- Type: Learning-based (State-of-the-Art RL).
- Description: An RL model for job scheduling in distributed computing clusters. It uses a Graph Neural Network (GNN) to process DAG information (job properties, resource demands, dependencies) for efficient scheduling.
- Implementation: PyTorch re-implementation [30] used due to original's outdated nature. Pre-trained model weights utilized.
First-In-First-Out (FIFO) [87]:
- Type: Rule-based / Deterministic.
- Description: A common scheduling algorithm (used in systems like Apache Spark) that schedules jobs in the strict order of their arrival and allocates requested resources.
- Implementation: Adopted from the same source as Decima [30].
Fair Scheduling (Fair) [87]:
- Type: Rule-based / Heuristic.
- Description: Another common scheduling algorithm (in Apache Spark) that schedules jobs in a "round robin" fashion, attempting to ensure each job receives a roughly equal share of cluster resources over time.
- Implementation: Adopted from the same source as Decima [30].

5.4. Hardware Settings

Experiments were conducted on a Linux server equipped with:

CPUs: Eight Intel(R) Xeon(R) Gold 5318Y CPUs.
GPUs: Two NVIDIA 40GB A100 GPUs.

5.5. Real-world ABR Testbed Setup

A real-world client-server ABR system was used to test NetLLM-adapted Llama2, leveraging the testbed from GENET [103].

Client Player: dash.js (version 2.4) modified to support BBA, MPC, GENET, and the adapted Llama2. The client runs in a Google Chrome browser (version 87).
Video Server: Apache version 2.7, running on the same machine as the client.
Network Emulation: Mahimahi [70] is used to emulate various network environments, including broadband traces [18] and cellular mobile traces [84]. An 80ms Round Trip Time (RTT) was simulated between the client and server. 100 traces from both broadband and cellular datasets were randomly sampled for emulation.
Rationale: This setup allows for evaluation in realistic and diverse network conditions, verifying the generalization of NetLLM to practical deployment scenarios.

6. Results & Analysis

The evaluation focuses on demonstrating NetLLM's effectiveness in LLM adaptation for networking, showcasing its superior performance and generalization compared to state-of-the-art (SOTA) algorithms.

6.1. Core Results Analysis

6.1.1. General Evaluation (Same Settings as Training Environments)

The following figure (Figure 10 from the original paper) presents the performance of each method for the corresponding tasks in testing environments generated with the same settings as training environments:

Figure 10: Comparing NetLLM-adapted Llama2 for VP, ABR, and CJS, with baselines in testing environments generated with the same settings as training environments. 该图像是图表，展示了NetLLM适配的Llama2在三个网络任务（VP、ABR 和 CJS）中的平均表现与基线方法的比较。图中的数据包括不同随机种子的测试结果，显示NetLLM在各项任务中均优于传统算法。

Figure 10 (from the original paper): Comparing NetLLM-adapted Llama2 for VP, ABR, and CJS, with baselines in testing environments generated with the same settings as training environments.

As shown in Figure 10, the NetLLM-adapted Llama2 consistently outperforms all other methods across all three tasks:

Viewport Prediction (VP): NetLLM reduces Mean Absolute Error (MAE) by 10.1% to 36.6% compared to baselines. The Cumulative Distribution Function (CDF) for VP shows that a large proportion of NetLLM's predictions have lower MAE.
Adaptive Bitrate Streaming (ABR): NetLLM improves Quality of Experience (QoE) by 14.5% to 36.6% compared to baselines. The CDF for ABR indicates NetLLM achieves higher QoE scores more frequently.
Cluster Job Scheduling (CJS): NetLLM reduces Job Completion Time (JCT) by 6.8% to 41.3% compared to baselines. The CDF for CJS demonstrates NetLLM leads to lower JCTs, for instance, the 90th percentile JCT for Llama2 is 97.3 seconds, significantly lower than Decima (109.3s), Fair (135.6s), and FIFO (187.5s).

Analysis: The results strongly validate NetLLM's effectiveness. Learning-based algorithms (TRACK, GENET, Decima) generally outperform traditional rule-based algorithms (LR, Velocity, BBA, MPC, FIFO, Fair), attributable to DNNs' strong function approximation capabilities. However, NetLLM-adapted LLMs demonstrate even greater power due to the LLM's large parameter size and pre-trained knowledge, enabling superior function approximation, pattern mining, and long-term planning abilities. A key advantage of NetLLM is its ability to use a single LLM as a foundation model across diverse networking tasks without architectural modifications, significantly reducing model engineering overhead compared to specialized DNN designs.

6.1.2. Generalization (Unseen Environments)

The following figure (Figure 11 from the original paper) compares the generalization performance of NetLLM-adapted Llama2 for VP, ABR, and CJS, with baselines in testing environments generated with settings different from training environments:

Figure 11: Comparing the generalization performance of NetLLMadapted Llama2 for VP, ABR, and CJS, with baselines in testing environments generated with settings different from training environments. The shape of box shows the distribution and the triangle in each box denotes average. 该图像是图表，展示了NetLLM在不同未见设置下对三种网络任务（视口预测VP、适应性比特率流传输ABR和集群作业调度CJS）的性能比较。每个子图通过箱型图展示了不同算法的分布情况，中间的三角形标记代表平均值。

Figure 11 (from the original paper): Comparing the generalization performance of NetLLM-adapted Llama2 for VP, ABR, and CJS, with baselines in testing environments generated with settings different from training environments. The shape of box shows the distribution and the triangle in each box denotes average.

Figure 11 illustrates NetLLM's superior generalization performance in unseen environments:

NetLLM-adapted Llama2 consistently outperforms baselines in terms of average values and distributions across all unseen settings.
It reduces MAE by 1.7% to 9.1% for VP, improves QoE by 3.9% to 24.8% for ABR, and reduces JCT by 2.5% to 6.8% for CJS on average, compared to learning-based algorithms.

Detailed ABR Generalization Analysis (Figure 12): The following figure (Figure 13 from the original paper) breaks down the QoE scores of all ABR methods for more detailed analysis:

该图像是一个条形图，展示了不同算法在宽带和蜂窝网络下的平均QoE得分。图中包含BBA、MPC、GENET和NetLLM的比较，其中NetLLM在两个类别中均显示出更高的得分，表明其在网络任务中的优越性能。

Figure 13 (from the original paper): Comparing NetLLM-adapted Llama2 with baselines for ABR on real-world environments with different network connections.

Figure 12 (from the original paper, but mislabeled as 13 in the text description, using the image provided to avoid conflict) provides a detailed breakdown of QoE scores for ABR across unseen settings, which is crucial for understanding generalization in ABR.

Unseen Setting 1 (Different Streaming Video): GENET, a learning-based algorithm, is surpassed by MPC (rule-based) with 5.2% lower average QoE. This indicates GENET's struggle to optimize bitrates when the video content differs from its training data.
Unseen Setting 2 (More Dynamic Bandwidth Fluctuations): GENET performs even worse than MPC, with 5.9% lower average QoE. It struggles to adapt to dynamic bandwidth changes, potentially selecting high bitrates inappropriately and leading to high rebuffering times.
NetLLM's Performance: In contrast, the NetLLM-adapted Llama2 maintains the highest QoE scores across all unseen settings for ABR. It demonstrates a good balance between bitrate, rebuffering, and bitrate changes.

Real-World Tests (Figure 14): The following figure (Figure 14 from the original paper) compares NetLLM-adapted Llama2 with baselines for ABR on real-world environments with different network connections:

Figure 14: Comparing NetLLM-adapted Llama2 with baselines for ABR on real-world environments with different network connections. 该图像是图表，展示了不同模型在视口预测（VP）与自适应码率流（ABR）任务上的性能对比。左侧显示了VP的平均绝对误差（MAE）分数，右侧显示了ABR的平均质量体验（QoE）评分，不同模型的表现以条形图形式呈现。

Figure 14 (from the original paper): Comparing NetLLM-adapted Llama2 with baselines for ABR on real-world environments with different network connections.

Real-world tests in a client-server ABR system (using Mahimahi to emulate network conditions) show that the NetLLM-adapted Llama2 outperforms baselines on both broadband and cellular networks. This confirms its ability to generalize to practical, real-world scenarios.

Overall Generalization Analysis: These results underscore that conventional DNN models often exhibit generalization issues in unseen environments. NetLLM, by effectively leveraging the LLM's extensive pre-trained knowledge and emergent abilities, provides a robust solution for stronger generalization, which is critical for real-world deployment in dynamic networking environments.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Task	DNN Input	DNN Output	Objective	Learning Paradigm
Viewport Prediction (VP)	time-series: historical viewports; image: video content information	future viewports	minimize error between predicted and actual viewports	SL
Adaptive Bitrate Streaming (ABR)	time-series: historical throughputs, delay; sequence: chunk sizes at different bitrates; scalar: current buffer length	bitrate selected for the next video chunk	maximize user's Quality of Experience (QoE)	RL
Cluster Job Scheduling (CJS)	graph: DAGs describing dependency and resource demands of job execution stages	job stage to run next, number of executors allocated to the stage	minimize job completion time	RL

6.3. Ablation Studies / Parameter Analysis

6.3.1. Importance of Pre-trained and Domain Knowledge

The following figure (Figure 13 from the original paper) explores the importance of pre-trained and learned domain-specific knowledge of LLM in networking adaptation:

Figure 13: Exploring the importance of pre-trained and learned domain-specific knowledge of LLM in networking adaptation. 该图像是图表，展示了在网络适应中预训练知识和领域特定知识的重要性。图表包含三个部分，分别显示了无预训练知识、无领域知识和完整知识条件下的平均绝对误差（MAE）、平均用户体验（QoE）评分和平均作业完成时间（JCT）的对比。结果表明，完全知识下的性能显著优于其他条件。

Figure 12 (from the original paper): Exploring the importance of pre-trained and learned domain-specific knowledge of LLM in networking adaptation.

To understand why LLMs are effective, the authors investigated the impact of pre-trained and learned domain knowledge using Llama2-7B:

Absence of Pre-trained Knowledge: When Llama2's weights were randomly initialized (disabling pre-trained knowledge) and trained from scratch, a dramatic performance drop was observed across all tasks (represented by "No Pre-trained" bars). This indicates that the LLM's pre-trained knowledge, including emergent abilities like planning and pattern mining, are crucial and transferable to networking tasks (e.g., pattern mining for viewport prediction).
Absence of Domain Knowledge: When low-rank matrices (representing learned domain knowledge) were disabled while preserving pre-trained knowledge, performance degradation occurred across tasks ("No Domain-specific" bars). This highlights the necessity of NetLLM's mechanism to acquire domain-specific knowledge for optimal performance.

Analysis: Both the LLM's extensive pre-trained knowledge and the task-specific domain knowledge learned through NetLLM are essential for its success in networking adaptation.

6.3.2. Impacts of Different Types of LLMs

The following figure (Figure 15 from the original paper) compares the performance of different LLMs adapted by NetLLM for VP and ABR with learning-based algorithms:

$Figure 16: Exploring the impacts of LLM sizes in networking adaptation, with OPT \[108\] as the foundation model.$ 该图像是图表，展示了不同大小的LLM在网络适配中的影响。左侧为视口预测（VP），右侧为自适应比特率流（ABR），横轴为LLM参数大小（十亿），纵轴为平均MAE和平均QOE相较于基线的百分比。各基线（LR、Velocity、TRACK、BBA、MPC、GENET）在不同参数大小下的表现有所不同。

Figure 15 (from the original paper): Comparing the performance of different LLMs adapted by NetLLM for VP and ABR with learning-based algorithms.

NetLLM's compatibility was tested by adapting three other LLMs (OPT, Mistral, LLaVa), all at 7B parameters, for VP and ABR tasks:

All adapted LLMs (OPT, Mistral, LLaVa) outperformed state-of-the-art learning-based algorithms on both VP and ABR. This confirms NetLLM's generic framework and compatibility across different LLM architectures.
Interestingly, the multimodal LLaVa performed worse than the single-modal Llama2. This suggests that the multimodal fusion knowledge acquired by LLaVa during its pre-training (on image and text corpora) might not be directly beneficial or optimally aligned for the specific multimodal data patterns encountered in networking.

6.3.3. Impacts of Different Sizes of LLMs

The following figure (Figure 16 from the original paper) explores the impacts of LLM sizes in networking adaptation, with OPT [108] as the foundation model:

$Figure 17: Illustration of using prompt learning \[60\] to adapt the Llama2-7B LLM \[92\] for the VP task.$ 该图像是示意图，展示了如何使用提示学习将 Llama2-7B LLM 应用于视口预测任务。图中包含历史视口数据、提示文本以及 LLM 的输出，展示了生成的下一组视口数据过程。

Figure 16 (from the original paper): Exploring the impacts of LLM sizes in networking adaptation, with OPT [108] as the foundation model.

The investigation into the impact of LLM size used OPT [108] with varying parameter counts:

When the parameter size exceeded 1 billion (1B), the adapted OPT models (e.g., OPT-1.3B, OPT-2.7B, OPT-7B) achieved superior or comparable performance to advanced learning-based algorithms.
However, OPT-0.35B performed significantly worse than all baselines for the ABR task. This indicates that very small LLMs may lack sufficient common knowledge to generalize effectively across tasks.

Analysis: This suggests that LLMs with parameter sizes greater than 1B are generally suitable for networking adaptation, while those smaller than 1B might not be optimal, implying a threshold for effective knowledge transfer.

6.3.4. Computation Overhead

The overhead of deploying NetLLM-adapted LLMs was profiled:

Llama2-7B: Requires 29GB memory and takes approximately 0.1s to 0.3s to generate one answer.
OPT-1.3B: Requires only 7GB memory and takes about 0.04s to generate one answer. This is acceptable for many networking tasks and can be accommodated by commercial GPUs (e.g., NVIDIA 10GB 3080).

Analysis: While LLMs can be resource-intensive, smaller LLMs (like OPT-1.3B) can provide a good trade-off between performance and overhead, making them practical for many networking applications. Further reduction in overhead is possible through model compression techniques.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper presents NetLLM, a pioneering framework for adapting Large Language Models (LLMs) to solve complex networking problems. The core motivation stems from the limitations of existing deep learning (DL) solutions in networking, namely high model engineering costs and poor generalization. NetLLM addresses this by envisioning LLMs as foundation models capable of "one model for all tasks."

The framework overcomes three key challenges:

Large Modality Gap: Handled by a multimodal encoder that projects diverse networking data (time-series, images, graphs, scalars) into the LLM's token-like feature space using existing modality-specific encoders and linear projections.
Inefficiency of Token-based Answer Generation: Solved by a networking head that replaces the traditional LM head, enabling direct, single-inference generation of valid, task-specific answers, thus preventing hallucination and reducing latency.
High Adaptation Costs: Mitigated by Data-Driven Low-Rank Networking Adaptation (DD-LRNA). This scheme employs an offline data-driven pipeline for both supervised learning (SL) and reinforcement learning (RL) tasks (eliminating costly environment interaction in RL) and uses low-rank matrices for parameter-efficient fine-tuning, significantly reducing computational resources while preserving the LLM's pre-trained knowledge.

Evaluations across three representative networking tasks—viewport prediction, adaptive bitrate streaming, and cluster job scheduling—demonstrate that NetLLM-adapted LLMs significantly outperform state-of-the-art algorithms in terms of both performance and generalization to unseen environments. The study also confirms the importance of both pre-trained and domain-specific knowledge for LLM success in networking and explores the compatibility with different LLM types and sizes.

NetLLM thus serves as a significant stepping stone towards a more sustainable design philosophy for future networking algorithms, leveraging the powerful capabilities of LLMs with low adaptation efforts.

7.2. Limitations & Future Work

The authors discuss several considerations and avenues for future research:

Customization Efforts: While NetLLM reduces ad-hoc design costs, adapting it to new networking tasks still requires creating a new networking head (a simple linear layer) and selecting/integrating modality-specific feature encoders for new data types. However, they note that the ongoing advancements in multimodal LLMs may eventually provide more generic, built-in encoders, further simplifying this.
Comparison with Retrieval-Augmented Generation (RAG): The paper argues that RAG (which uses an external knowledge base of explicit textual information) is challenging for networking. This is because networking knowledge is often abstract and implicit (e.g., ABR policies) and difficult to represent as plain text. NetLLM's DD-LRNA offers an alternative by enabling LLMs to learn this implicit domain knowledge efficiently.
Reducing LLM Computation Overhead: The authors acknowledge that LLMs, even with NetLLM's optimizations, still incur computational overhead during inference. They suggest integrating existing model compression techniques (e.g., pruning [26], quantization [27], knowledge distillation [69]) into NetLLM to further reduce memory and latency. This trade-off between performance and resource consumption is left for future exploration.
LLM Explainability in Networking: A crucial future direction is to investigate the internal working mechanisms of LLMs in networking to improve their explainability. Understanding why LLMs succeed in these tasks (beyond general pattern mining or planning abilities) is vital for building more reliable and trustworthy LLM-based networking systems.

7.3. Personal Insights & Critique

Innovation and Vision: NetLLM is a highly innovative and timely work. It effectively translates the "foundation model" paradigm from NLP to networking, which is a significant conceptual leap. The explicit articulation of the three main challenges (modality gap, generation inefficiency, adaptation costs) and providing concrete solutions for each is a strong point. The vision of "one model for all tasks" with LLMs is compelling for reducing engineering overhead in a domain traditionally plagued by task-specific solutions.
Pragmatic Design Choices: The framework makes pragmatic choices, such as reusing existing SOTA feature encoders for different modalities. This avoids reinventing the wheel and focuses the innovation on integrating these components effectively with the LLM. The networking head is a clever and simple solution to the hallucination and latency problems of standard LLM outputs in a critical, real-time domain.
Effectiveness of DD-LRNA: The DD-LRNA scheme, combining offline RL with low-rank adaptation, is a powerful contribution. It addresses the practical barriers of fine-tuning massive LLMs, making the adaptation feasible. The demonstrated reduction in GPU memory and training time is substantial and critical for adoption. While not explicitly named LoRA, the low-rank adaptation mechanism is conceptually aligned with prominent PEFT methods, indicating a solid technical foundation.
Critique on "One Model for All Tasks": While the paper champions the "one model for all tasks" vision, in practice, NetLLM still requires task-specific adapters (multimodal encoder, networking head, low-rank matrices) for each task, while the core LLM remains frozen. It's more accurate to say "one core LLM with task-specific lightweight adapters for all tasks." This is still a vast improvement over "one full DNN model for one task," but the phrasing could be slightly nuanced. The multimodal LLaVa performing worse than Llama2 is an interesting finding, suggesting that multimodal pre-training benefits aren't universally transferable, and specialized adapters might still be key for domain-specific multimodal data.
Inference Overhead: Although DD-LRNA significantly reduces training costs, the inference overhead of LLMs (even optimized with networking heads) remains a concern for extremely low-latency, high-throughput networking control planes. While the authors mention compression techniques as future work, this is a practical bottleneck for widespread real-time deployment. The 0.04s for OPT-1.3B might be acceptable for some tasks (like CJS or ABR at chunk boundaries), but not for others (e.g., sub-millisecond congestion control).
Explainability: The call for future work on LLM explainability in networking is crucial. Network operators need to understand why an LLM makes a certain decision (e.g., reducing bitrate, scheduling a job) to trust and debug it, especially when system performance is critical. This is a general LLM challenge but particularly salient in networking.
Potential for Cross-Domain Transfer: The claim of transferability to other domains (medicine, finance) is plausible given the generic nature of prediction and decision-making tasks. This highlights the broad impact potential of the NetLLM framework beyond networking.

Overall, NetLLM makes a robust and well-validated case for LLM adaptation in networking. It effectively bridges the gap between powerful foundation models and the specific demands of network management, paving the way for more intelligent, adaptive, and maintainable network systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

NetLLM: Adapting Large Language Models for Networking

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~44 min read · 58,958 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Multimodal Encoder

4.1.1. Feature Encoder

4.1.2. Linear Projection

4.2. Networking Head

4.2.1. Limitations of LM Head

4.2.2. Design of Networking Head

4.3. Data-Driven Low-Rank Networking Adaptation (DD-LRNA)

4.3.1. Data-Driven Networking Adaptation

4.3.2. Low-Rank Networking Adaptation

4.3.3. Putting It All Together (DD-LRNA Scheme)

4.4. Implementation

4.4.1. APIs

4.4.2. Specifics

5. Experimental Setup

5.1. Datasets

5.1.1. Viewport Prediction (VP)

5.1.2. Adaptive Bitrate (ABR) Streaming

5.1.3. Cluster Job Scheduling (CJS)

5.2. Evaluation Metrics

5.2.1. Mean Absolute Error (MAE) for Viewport Prediction (VP)

5.2.2. Quality of Experience (QoE) for Adaptive Bitrate (ABR) Streaming

5.2.3. Job Completion Time (JCT) for Cluster Job Scheduling (CJS)

5.3. Baselines

5.3.1. Baselines for Viewport Prediction (VP)

5.3.2. Baselines for Adaptive Bitrate (ABR) Streaming

5.3.3. Baselines for Cluster Job Scheduling (CJS)

5.4. Hardware Settings

5.5. Real-world ABR Testbed Setup

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. General Evaluation (Same Settings as Training Environments)

6.1.2. Generalization (Unseen Environments)

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

6.3.1. Importance of Pre-trained and Domain Knowledge

6.3.2. Impacts of Different Types of LLMs

6.3.3. Impacts of Different Sizes of LLMs

6.3.4. Computation Overhead

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers