Paper status: completed

Large Language Model Offloading using Active Inference in 6G Symbiotic IoT

Published:01/01/2025

Large Language Model Offloading (1)Active Inference Methods (1)6G Edge Computing Resource Scheduling (1)Cloud-Edge Collaborative Computing (1)Symbiotic Internet of Things (1)

Original Link

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper presents an active inference-based offloading method for large language models in 6G symbiotic IoT, optimizing resource scheduling and computation through cloud-edge collaboration for enhanced system efficiency and intelligent inference services.

Abstract

1 Large Language Model Offloading using Active Inference in 6G Symbiotic IoT Xiaoming He, Member, IEEE , Yunzhe Jiang, Xiaoming Xu, Huajun Cui, Yinqiu Liu, Member, IEEE , Mingkai Chen, Member, IEEE , Yan Hong, Member, IEEE , and Jie Zhang, Member, IEEE Abstract —The increasing demand for Large Language Model (LLM) applications in mobile computing poses a challenge for devices with limited resources, as they struggle to efficiently handle complex inference tasks. Despite its traditional use for offloading tasks to remote servers, Deep Reinforcement Learning (DRL) exhibits notable limitations, such as data inefficiency, latency insensitivity, and poor adaptability to variable workloads, thereby adversely impacting the performance of LLMs. Deep Reinforcement Learning (DRL) is traditionally used to offload tasks to remote servers. However, it has several limitations which negatively affect the performance of LLMs. We present an approach which is based on active inference for task offloading in LLM and cloud-edge computing resource scheduling, especially relevant to emerging 6G networks. These networks are designed to provide enhanced connectivity, reduced l

Mind Map

In-depth Reading

English Analysis~34 min read · 45,319 chars

1. Bibliographic Information

1.1. Title

Large Language Model Offloading using Active Inference in 6G Symbiotic IoT

1.2. Authors

Xiaoming He, Member, IEEE (College of Internet of Things, Nanjing University of Posts and Telecommunications, Nanjing, China)
Yunzhe Jiang (College of Communication and Information Engineering, University of Electronic Science and Technology of China, Chengdu, China)
Xiaoming Xu (Beijing KOAL Guoxin Technology Co., Ltd.)
Huajun Cui (Digital Intelligence Research Institute, PowerChina, Beijing Engineering Corporation Limited, Beijing, China)
Yinqiu Liu, Member, IEEE (College of Computing and Data Science, Nanyang Technological University, Singapore)
Mingkai Chen, Member, IEEE (Key Laboratory of Broadband Wireless Communication and Sensor Network Technology, Nanjing University of Posts and Telecommunications, China)
Yan Hong, Member, IEEE (College of Textile and Clothing Engineering, Soochow University, Suzhou 215021, China; Corresponding Author)
Jie Zhang, Member, IEEE (Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China)

1.3. Journal/Conference

The specific journal or conference for this paper is not explicitly stated in the provided excerpt. However, based on the authors' affiliations and the referencing style (e.g., IEEE member status, IEEE conference/journal citations in the references), it is highly probable that this work is submitted to or published in an IEEE-affiliated journal or conference, particularly given that reference [12] is "IEEE Transactions on Mobile Computing" and reference [21] is "IEEE Vehicular Technology Conference (VTC)".

1.4. Publication Year

The publication year is not explicitly stated in the provided header or footer. However, the references largely consist of recent works from 2023 and 2024. For instance, reference [12] is from December 2024 and reference [21] is from October 2023, both with very similar titles and some shared authors. This suggests the paper is very recent, likely published or submitted in late 2023 or 2024.

1.5. Abstract

The increasing demand for Large Language Model (LLM) applications in mobile computing poses a challenge for devices with limited resources, as they struggle to efficiently handle complex inference tasks. Despite its traditional use for offloading tasks to remote servers, Deep Reinforcement Learning (DRL) exhibits notable limitations, such as data inefficiency, latency insensitivity, and poor adaptability to variable workloads, thereby adversely impacting the performance of LLMs. Deep Reinforcement Learning (DRL) is traditionally used to offload tasks to remote servers. However, it has several limitations which negatively affect the performance of LLMs. We present an approach which is based on active inference for task offloading in LLM and cloud-edge computing resource scheduling, especially relevant to emerging 6G networks. These networks are designed to provide enhanced connectivity, reduced latency, and increased data rates. Our approach capitalizes on these strengths to optimize task distribution and maximize resource utilization, fostering a symbiotic relationship between devices and networks. Simulations demonstrate that our method outperforms standard DRL by enhancing data efficiency and better adapting to varying loads, aligning with 6G’s emphasis on flexible and responsive networks. By integrating active inference into cloud-edge systems, we develop a more robust and adaptable LLM strategy that is well-suited for the 6G era, promoting a Symbiotic Internet-of-Things (IoT) where devices and networks dynamically collaborate and share resources to fulfill the requirements of advanced applications.

1.6. Original Source Link

/files/papers/69007e63ed47de95d44a3483/paper.pdf (This is a PDF link to the paper, likely a preprint or an internal file reference.)

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve is the efficient and robust deployment of Large Language Model (LLM) applications in resource-constrained mobile computing environments, particularly within the context of emerging 6G Symbiotic IoT networks.

This problem is crucial because LLMs, despite their powerful capabilities, demand significant computational and memory resources, making their direct execution on mobile and IoT devices challenging. Traditional Deep Reinforcement Learning (DRL) methods, often used for task offloading to remote servers, suffer from several limitations:

Data inefficiency: They require extensive data for training.
Latency insensitivity: They may not adequately optimize for real-time responsiveness.
Poor adaptability to variable workloads: Their performance degrades in dynamic environments with fluctuating demands. These limitations directly impact the performance of LLMs in mobile settings, creating a gap in effective offloading and resource scheduling strategies.

The paper's entry point is to leverage active inference as an alternative to traditional DRL for LLM task offloading and cloud-edge computing resource scheduling. This is especially pertinent to 6G networks, which promise enhanced connectivity, reduced latency, and higher data rates, offering an opportunity to optimize LLM deployment by fostering a symbiotic relationship between devices and network infrastructure. The innovative idea is to use a rewardless guidance mechanism within active inference to overcome the shortcomings of DRL, promoting more adaptive and efficient LLM operations.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Comprehensive System Model and Mathematical Formulation: The study presents a detailed system model and mathematical formulation specifically tailored to the GPT-J-6B LLM. This model is grounded in empirical data obtained from a server cluster environment and covers both the training and inference stages of the LLM lifecycle, providing a solid analytical foundation.
Innovative Active Inference Framework: Capitalizing on recent advancements in active inference approaches, the paper introduces a novel framework designed to address the complexities of inference task delegation and resource distribution for LLMs. This approach is claimed to surpass conventional DRL techniques in terms of convergence and generalization capabilities.
Enhanced Performance in Simulations: Through rigorous simulation analysis, the authors demonstrate that their proposed framework yields a strategy with enhanced convergence characteristics. It also outperforms mainstream DRL algorithms in the context of LLM inference tasks, showing superior data efficiency and better adaptation to varying loads. These findings align with the 6G emphasis on flexible and responsive networks.
Robust and Adaptable LLM Strategy for 6G Symbiotic IoT: By integrating active inference into cloud-edge systems, the research develops a more robust and adaptable LLM strategy. This strategy is specifically suited for the 6G era, promoting a Symbiotic Internet-of-Things (IoT) where devices and networks dynamically collaborate and share resources to fulfill the requirements of advanced applications.

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following key concepts:

Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data to understand, generate, and process human language. They typically have billions of parameters, allowing them to perform complex tasks like translation, summarization, question answering, and content creation (e.g., GPT series like ChatGPT). Their large size, however, makes them computationally intensive, especially during inference (the process of using a trained model to make predictions or generate outputs).
6G Networks: The sixth generation of wireless communication technology, 6G is envisioned to succeed 5G. It aims to provide even higher data rates, ultra-low latency, massive connectivity, and enhanced intelligence, enabling new applications like holographic communication, omnipresent AI, and Symbiotic IoT.
Symbiotic Internet-of-Things (IoT): In IoT, physical devices, vehicles, home appliances, and other items are embedded with sensors, software, and other technologies to connect and exchange data over the internet. A Symbiotic IoT extends this by envisioning a highly collaborative ecosystem where IoT devices and network infrastructure dynamically interact, share resources, and adapt to each other's needs to achieve collective intelligence and efficiency.
Cloud-Edge Computing: This is a distributed computing paradigm that combines cloud computing (centralized data centers with vast resources) with edge computing (computation performed closer to the data source, like IoT devices or edge servers). The goal is to reduce latency, save bandwidth, and improve responsiveness by processing data locally at the network edge rather than sending everything to a distant cloud.
- Cloud Server (CS): A powerful, centralized server located in a data center, offering extensive computational resources.
- Multi-Access Edge Computing (MEC) Server: A server located closer to end-users (e.g., at a cellular base station), providing localized computation and storage to reduce latency.
Deep Reinforcement Learning (DRL): A subfield of machine learning that combines reinforcement learning (where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward) with deep learning (using neural networks to learn representations from data). DRL agents learn optimal policies through trial and error, often in complex environments.
Active Inference: A theoretical framework from neuroscience and machine learning that posits that intelligent agents minimize free energy (a measure of surprise or prediction error) to maintain their internal model of the world and act purposefully. Unlike DRL which relies on external reward signals, active inference agents intrinsically seek to improve their predictive models and reduce uncertainty about their environment, leading to goal-oriented behavior. In this paper, it's used with rewardless guidance, meaning it doesn't need explicit, hand-crafted reward functions.
Partially Observable Markov Decision Process (POMDP): A mathematical framework for modeling decision-making in situations where the agent's actions are assumed to be a part of a Markov process (where future states depend only on the current state, not the sequence of events that preceded it), but the agent cannot directly observe the underlying state of the environment. Instead, it receives observations that are probabilistically related to the state. This is more realistic for many real-world scenarios than a fully observable Markov Decision Process (MDP).
Free Energy Principle: In active inference, this principle states that any self-organizing system that is at equilibrium with its environment must minimize its variational free energy. This is achieved by updating its internal model to predict sensory inputs more accurately and by acting on the environment to make its sensory inputs more consistent with its predictions. It essentially frames all cognitive and biological processes as attempts to minimize long-term surprise.
Kullback-Leibler (KL) Divergence ( $D_{KL}$ ): A measure from information theory that quantifies how one probability distribution is different from a second, reference probability distribution. A KL divergence of zero means the two distributions are identical. It's often used in machine learning to measure the difference between a model's predicted distribution and the true distribution.

3.2. Previous Works

The paper contextualizes its work by referencing previous research in Large Language Models (LLMs) and Deep Reinforcement Learning (DRL).

3.2.1. Large Language Models (LLMs)

LLMs are characterized by their extensive parameter sets and computational demands. Prior work has focused on making LLM inference more efficient, especially on resource-constrained devices:

Device-side inference engines: Xu et al. [22] proposed LLMCad for efficient execution of privacy-sensitive generative tasks on mobile applications. Yi et al. [23] introduced EdgeMoE, an on-device inference engine for hybrid expert-based LLMs, tackling issues of large parameter scale and high runtime cost on edge devices.
Security in IoT with LLMs: Ferrag et al. [24] proposed SecurityBERT, a BERT-based architecture to identify cyber threats in IoT networks, showcasing LLM applications beyond natural language processing.
LLM Inference Offloading: He et al. [12] (one of the authors of the current paper, or a very closely related work) and Fang et al. [21] (also closely related) explored LLM inference offloading and resource allocation in cloud-edge computing using an active inference approach, directly motivating the current study.

3.2.2. Decision Making with Deep Reinforcement Learning (DRL)

DRL algorithms combine deep neural networks and reinforcement learning to address complex decision and control problems by learning optimal strategies through environmental interactions, without relying on prior knowledge.

Resource Management in IoT: Liu et al. [31] integrated wireless power transfer to address limited battery capacity and low computing power in IoT nodes, offloading computational tasks to edge computing servers.
Vehicular Networks: Zhang et al. [32] proposed an urban vehicle-mounted cloud-assisted MEC network for computing offloading in dynamic traffic environments.
Resource Allocation: Wang et al. [33] introduced a $DRL+FL$ (Federated Learning) intelligent resource allocation model to solve communication congestion and quality of user experience (QoE) degradation.
General DRL Applications: The paper also references broader DRL applications in gaming, robotics, and general resource management [13-15], acknowledging its success but also highlighting its limitations regarding reward functions [16-17] and adaptability in dynamic IoT environments [18-20].

3.3. Technological Evolution

The evolution of LLM deployment in IoT and 6G contexts can be traced through:

Early LLM Development: Initially, LLMs were large, monolithic models requiring significant data center resources for both training and inference (e.g., GPT family mentioned).
Edge/Mobile Optimization: The challenge of deploying LLMs on resource-constrained mobile and edge devices led to specialized inference engines and optimization techniques (e.g., LLMCad, EdgeMoE).
Reinforcement Learning for Offloading: DRL emerged as a promising approach for dynamic task offloading and resource allocation in MEC and IoT to improve efficiency.
Limitations of Traditional DRL: Despite successes, DRL faced issues like data inefficiency and rigid reward functions, particularly in highly dynamic IoT environments.
Emergence of 6G: The promise of 6G (enhanced connectivity, lower latency, higher data rates) created new opportunities and challenges for LLM integration, demanding more responsive and adaptable offloading strategies.
Active Inference for Enhanced Adaptability: This paper fits into the timeline by proposing active inference with rewardless guidance as the next step, aiming to overcome DRL limitations and fully leverage 6G capabilities for Symbiotic IoT. It represents an evolution towards more biologically plausible and adaptable decision-making for resource management.

3.4. Differentiation Analysis

The core innovation of this paper, compared to the main methods in related work, lies in its use of active inference with a rewardless guidance mechanism, distinguishing it from traditional Deep Reinforcement Learning (DRL) approaches.

DRL's Limitations: Traditional DRL (as seen in Rainbow DQN, PPO, SAC baselines) relies heavily on explicitly defined reward functions. These functions guide the agent's learning process by providing numerical feedback for actions. However, designing effective reward functions is challenging, often leading to suboptimal generalizability and data inefficiency, especially in dynamic and unpredictable environments like Symbiotic IoT. DRL can also be latency-insensitive and poorly adaptable to variable workloads.
Active Inference with Rewardless Guidance: The proposed method replaces these traditional reward models. Instead of learning to maximize an external reward, the agent using active inference with rewardless guidance intrinsically seeks to improve its internal predictive model of the environment and minimize its variational free energy (or "surprise"). This means:
- Intrinsic Motivation: The agent's learning is driven by an internal desire to understand its surroundings and anticipate future states, rather than chasing external rewards.
- Enhanced Generalization: By developing a more sophisticated understanding of the environment without rigid reward functions, the agent can generalize better to unseen or dynamic conditions.
- Adaptability: The rewardless guidance mechanism $rg(s_t, a_t)$ directly incorporates task completion, latency, and prediction accuracy, allowing the agent to dynamically prioritize these factors without needing a predefined reward landscape. This leads to more context-aware decisions.
- Convergence and Efficiency: The paper claims this approach offers superior convergence and data efficiency compared to DRL, as it leverages the fundamental principles of active inference for more robust learning.
  
  In essence, while DRL is about learning what to do to get rewards, this active inference approach is about learning how the world works and then acting to make sensory inputs consistent with its predictions, which implicitly achieves desired outcomes like minimal latency and high accuracy.

4. Methodology

4.1. Principles

The core idea behind the proposed method is to leverage active inference for LLM task offloading and resource allocation within cloud-edge networks, particularly for 6G Symbiotic IoT environments. The theoretical basis is that intelligent agents can optimize their behavior by minimizing variational free energy, which acts as an intrinsic drive to improve their internal models of the world and reduce prediction errors.

Instead of relying on explicit reward functions—a common challenge in Deep Reinforcement Learning (DRL)—this approach uses a rewardless guidance mechanism. This mechanism guides the agent towards states that naturally minimize latency and maximize task success and prediction accuracy, aligning with the operational goals of LLM services in IoT. By dynamically adapting to the environment and its own predictions, the agent aims to achieve robust and efficient offloading and resource scheduling in highly dynamic 6G settings.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. System Configuration

The system operates within a multi-LLM user environment that interacts with both a cloud computing hub and edge computing servers.

Cloud Server (CS): Equipped with substantial and robust computing resources, capable of accommodating the computational demands of LLM inference tasks.
Multi-Access Edge Computing (MEC) Centers: Located close to users, designed to efficiently offload LLM inference tasks by offering reduced latency due to proximity.
Terminal Devices (Dev): A diverse array of devices initiating LLM inference task requests.
- Mobile terminals ( $D_{mobi}(i = 1, 2, ...)$ ): Devices with dynamic locations, such as smartphones, drones, and connected vehicles.
- Fixed terminals ( $D_{unmo}(i = 1, 2, ...)$ ): Devices with static locations, such as personal computers and workstations.
- Collectively, these are represented as $Dev \in \{D_{mobi}, D_{unmo}\}, i = 1, 2, \ldots, N_{Dev}$ , where $N_{Dev}$ is the total number of devices.
Servers (Ser): The MEC and CS are collectively denoted as $Ser \in \{\mathbb{CS}, \mathbf{MEC}\}, j = 1, 2, \ldots, N_{ser}$ , where $N_{ser}$ is the total number of servers.
Task Request: At any given time $t$ , a terminal $Dev_i$ generates a random offload request for an LLM task $T_t$ . A decision algorithm then delegates task $T_t$ to a server $Ser_j$ and allocates necessary network, computational, and graphics memory resources.

The following figure (Figure 1 from the original paper) shows the system architecture:

该图像是图1，展示了用于大语言模型卸载的云边框架示意图，体现了协同架构下的共生物联网系统，以及基于环境状态获取、卸载决策和策略优化的流程。

4.2.2. Task Model Formulation

This section details the inference process for LLM tasks, using the GPT-J-6B model as a representative case. This model has 6 billion parameters and a layered encoder-decoder structure.

The execution of inference within the GPT-J-6B model is broken down into sequential stages:

Input Encoding: The input text is converted into a suitable format by a tokenizer. The resulting sequence is denoted as $\textbf{x} = [x_1, \cdot \cdot \cdot, x_n]$ , where $x_i$ is the $i$ -th token or vocabulary element.
Vector Embedding: The sequence $\textbf{x}$ is mapped into a vector space by an embedding layer, yielding the vector sequence $[e_1, \cdot \cdot \cdot, e_n]$ , where $e_i$ represents the $i$ -th vector in embedding space.
Positional Embedding: A positional encoding is applied to the embedding vectors to incorporate the order of sequence elements, resulting in the positionally encoded sequence $[pe_1, pe_2, \ldots, pe_n]$ , where $P$ is the position encoding matrix.

Following these initial steps, the detailed mechanics of the attention and multi-head self-attention mechanisms are elucidated.

Query, Key, and Value Matrices: The query matrix $Q$ , key matrix $K$ , and value matrix $V$ are defined as: $Q = PE \times W^q$ $K = PE \times W^k$ $V = PE \times W^v$ where $W^q$ , $W^k$ , and $W^v$ are parameter matrices (learnable weights), and PE is the positionally encoded sequence.
Attention Layer: The Attention_layer function computes the attention scores and applies them to the values. $\mathrm{Attention\_layer}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ where:
- $Q$ , $K$ , $V$ : Query, Key, and Value matrices.
- $Q K^T$ : Dot product between Query and Key matrices, measuring similarity.
- $\sqrt{d_k}$ : Scaling factor, where $d_k$ is the dimension of the key vectors. This prevents the dot products from growing too large, pushing the softmax into regions with tiny gradients.
- $\mathrm{softmax}(\cdot)$ : Normalization function that converts raw scores into probabilities, ensuring they sum to 1.
- $V$ : Value matrix, whose rows are weighted and summed based on the attention scores.
Multi-Head Self-Attention Layer: The MultiHead_layer performs multiple attention computations in parallel (heads) and then concatenates their results. $\begin{array} { r l } & { \mathrm { MultiHead\_layer } ( Q , K , V ) = } \\ & { \mathrm { Concatenation } ( \mathrm { head\_t } , \cdot \cdot \cdot ) W ^ { O } , } \\ & { \mathrm { head\_i = \mathrm { Attention\_layer } } ( Q W _ { i } ^ { Q } , K W _ { i } ^ { K } , V W _ { i } ^ { V } ) , } \end{array}$ where:
- $\mathrm{Concatenation}(\mathrm{head\_t}, \cdot \cdot \cdot)$ : The concatenation of outputs from different attention heads.
- $W^O$ : A linear projection matrix that combines the concatenated outputs of the attention heads.
- $\mathrm{head\_i}$ : The output of the $i$ -th attention head.
- $W_i^Q \in \mathbb{R}^{d_{model} \times d_q}$ , $W_i^K \in \mathbb{R}^{d_{model} \times d_k}$ , $W_i^V \in \mathbb{R}^{d_{model} \times d_v}$ : Parameter matrices specific to the $i$ -th head, projecting the input $Q$ , $K$ , $V$ into different subspaces. $d_{model}$ is the model dimension, $d_q$ , $d_k$ , $d_v$ are the dimensions of the query, key, and value vectors for a single head.
  
  The output of the self-attention layer, denoted as $Z$ , is derived from the multi-head self-attention computation: $Z = { \mathrm { MultiHead\_layer } } ( Q , K , V ) .$ This procedure is iterated across multiple layers, corresponding to the model's depth.
Feedforward Neural Network: The subsequent stage involves a feedforward neural network. The attention mechanism's output $Z$ serves as input to a multilayer perceptron (MLP) to determine the subsequent output: $Y = \mathbf{MLP}(Z)$ where $\mathbf{MLP}(\cdot)$ encapsulates linear mappings and non-linear activation functions.
Task Request and Response: The inference occurs on server $Ser_j$ . A task request $T_t$ from $Dev_i$ is dispatched as a $PS_x$ -sized packet to $Ser_j$ . Upon processing, $Ser_j$ returns a $PS_y$ -sized packet containing the predicted text to $Dev_i$ .

4.2.3. Terminal Mobility Model

The system model categorizes nodes into stationary and mobile.

Stationary Nodes: MEC servers, CS, and fixed terminals ( $D_{unmo}$ ).
Mobile Nodes: Mobile terminals ( $D_i^{mobi}$ ) such as internet-connected vehicles, smartphones, and unmanned aerial vehicles (UAVs). These move at a predetermined velocity, intermittently generating LLM inference task requests.

To determine the proximity of mobile terminal $D_i^{mobi}$ to computational endpoints (MEC and CS), the Euclidean distance metric is used. Given coordinates $(x_1, y_1, z_1)$ for $D_i^{mobi}$ and $(x_2, y_2, z_2)$ for the endpoint, the distance $d$ is calculated using the standard Euclidean distance formula.

The paper provides a detailed characterization of communication channels:

Ground-to-Ground (G2G) Channels: Both transmitter and receiver are terrestrial devices. The path loss $PL_{G2G}$ is: $P L _ { G 2 G } = 1 2 8 . 1 + 3 7 . 6 \log ( d ) .$ where $d$ is the distance between the transmitter and receiver in kilometers.
Ground-to-Air (G2A) Channels: One end is a ground device, and the other is an aerial device. The path loss $PL_{G2A}$ $P L_{G 2 A}$ is: $P L _ { G 2 A } = 1 0 \alpha \log ( d ) + C ,$ where:
- $\alpha$ : Path loss exponent, affected by environmental factors (building density, type, height, vegetation).
- $d$ : Euclidean distance between sender and receiver.
- $C$ : A constant depending on operating frequency and antenna gain.
Air-to-Air (A2A) Channels: Facilitate relay communication between UAVs. Path loss $PL_{A2A}$ $P L_{A 2 A}$ is defined as: $P L _ { A 2 A } = 1 0 \alpha \log ( d ) ,$ where:
- $\alpha$ : Path loss exponent, which can be a relatively small value for high-altitude UAVs operating in line-of-sight conditions.
- $d$ : Aerial distance between UAVs.

4.2.4. Communication Model

In the system model, terminal $Dev_i$ initiates task $T_t$ at time $t$ . The decision algorithm offloads this request to server $Ser_j$ , which processes it and returns the result $P_y$ to $Dev_i$ . Both offloading and result transmission depend on the wireless communication channel's quality. The distance $d$ is assumed constant during an effective communication session.

Drawing from information theory, the data transmission rate $R$ is given by the Shannon-Hartley theorem-like formula: $R = W \log _ { 2 } ( 1 + { \frac { \mathrm { Power } \cdot G } { N } } ) ,$ where:

$W$ : Communication channel bandwidth (in Hz).
$\mathrm{Power}$ : The device's transmission power (in Watts).
$G$ : The channel gain (unitless).
$N$ : The noise power due to thermal fluctuations within the channel (in Watts).

The channel gain $G$ is further defined based on antenna gain, path loss, and shadow fading: $G = g - P L - X _ { \sigma } .$ where:
$g$ : Antenna gain (in dB), unique to the receiving antenna.
PL: Path loss (in dB), a characteristic of the channel (e.g., $PL_{G2G}$ , $PL_{G2A}$ , $PL_{A2A}$ ).
$X_\sigma$ : Shadow fading factor (in dB), commonly represented as a Gaussian random variable with a mean of zero, $X_\sigma \sim \mathcal{N}(0, \sigma^2)$ , guaranteeing it is always positive. $g$ and $\sigma$ are typically considered constants.

4.2.5. Data Transfer Model

Data transfer is divided into two phases: task offloading and result return.

Task Offloading Phase: The decision algorithm dispatches a task packet $PS_x$ associated with task $T_t$ from terminal $Dev_i$ to server $Ser_j$ . This phase encompasses four stages:
1. Transmission: Time taken to send the packet.
2. Propagation: Time taken for the signal to travel across the distance.
3. Queuing: Time spent waiting in the server's task queue.
4. Computation: Time spent by the server processing the task. The transmission latency is $\frac{PS_x}{R}$ , where $PS_x$ is the task packet size and $R$ is the data transmission rate. The propagation latency is $\frac{d_1}{c}$ , where $d_1$ is the distance between sender and receiver, and $c$ is the speed of light. Queuing time ( $L_q$ ) depends on the server's queue status and remaining processing time of tasks ahead. Task processing time ( $L_c$ ) depends on server computational capabilities and inference frameworks.
Result Return Phase: The server returns a result packet $PS_y$ to $Dev_i$ . This phase involves only transmission and propagation. The transmission latency is $\frac{PS_y}{R}$ , and the propagation latency is $\frac{d_2}{c}$ , where $d_2$ is the distance during result return (which may vary from $d_1$ ).
Total Time Delay: The total time delay $L_{T_t}$ for a successfully offloaded task $T_t$ is computed as: $L _ { T _ { t } } = \frac { P S _ { x } + P S _ { y } } { R } + \frac { d _ { 1 } + d _ { 2 } } { c } + L _ { q } + L _ { c } .$ where:
- $PS_x$ : Size of the task packet.
- $PS_y$ : Size of the result packet.
- $R$ : Data transmission rate.
- $d_1$ : Distance during task offloading.
- $d_2$ : Distance during result return.
- $c$ : Speed of light.
- $L_q$ : Queuing latency.
- $L_c$ : Computation latency.
Delay Constraint: The maximum acceptable delay for all task requests $T_t$ is constrained by $t_{max}$ . If $L_{T_t} > t_{max}$ , the task $T_t$ is considered infeasible and abandoned.

The following figure (Figure 2 from the original paper) shows the correlation between CPU allocation and time cost:

该图像是图表，展示了CPU分配比例与时间开销之间的关系。图中对比了边缘计算（MEC）和云计算（Cloud）在不同CPU使用率下的时间开销，体现了随着CPU分配增加，时间开销均有所下降，但云计算整体时间开销低于MEC。

4.2.6. Problem Formulation

The primary objective is to find the most efficient strategy for delegating computationally demanding LLM inference operations to either edge or cloud computing infrastructures, considering resource limitations at the end-user level. The goal is to reduce the mean response time for all LLM inference jobs while enhancing the precision of the model's predictive outcomes.

System Utility Function: The total system utility $U$ is encapsulated in the following function: $U \left( L _ { T . } , P _ { T . } \right) = \frac { 1 } { \arg \left( \sum _ { T _ { t } } L _ { T _ { t } } \right) } + \arg \left( \sum _ { T _ { t } } P _ { T _ { t } } \right) ,$ where:
- $\mathrm{avg}(\cdot)$ : Represents the average function.
- $L_{T.}$ : Represents the set of latencies for all tasks $T_t$ . The term $\frac{1}{\mathrm{avg}(\sum_{T_t} L_{T_t})}$ maximizes utility by minimizing average latency.
- $P_{T.}$ : Represents the set of prediction accuracies for all tasks $T_t$ . The term $\mathrm{avg}(\sum_{T_t} P_{T_t})$ maximizes utility by maximizing average prediction accuracy.
Optimization Problem: The ultimate objective is to maximize this overall system utility $U$ : $\begin{array} { r l } { \mathrm { maximize } } & { U \left( L _ { T . } , P _ { T . } \right) } \\ { \mathrm { subject~to } } & { \left\{ \begin{array} { l l } { W _ { \mathrm { rest } } , C _ { \mathrm { rest } } , M _ { \mathrm { rest } } \geq 0 , } \\ { L _ { T _ { t } } \leq t _ { \mathrm { max } } , \quad \forall t } \end{array} \right. } \end{array}$ where:
- $W_{rest}$ , $C_{rest}$ , $M_{rest}$ : Represent the residual bandwidth, computational, and graphics memory resources available at each MEC and CS at time $t$ . These must be non-negative.
- $L_{T_t} \leq t_{max}$ : Each task's total delay must not exceed the maximum acceptable delay.
Enhanced Problem Formulation (with Economic and Risk Factors): The formulation can be extended to include economic considerations and risk management, which are crucial in practical cloud-edge computing systems. $\begin{array} { r l } { \mathrm { maximize } } & { U \left( L _ { T } , P _ { T } , \mathrm { Price } , \mathrm { Risk } \right) } \\ { \mathrm { subject~to } } & { \left\{ \begin{array} { l l } { W _ { \mathrm { rest } } , C _ { \mathrm { rest } } , M _ { \mathrm { rest } } \geq 0 , } \\ { L _ { T _ { t } } \leq t _ { \mathrm { max } } , \quad \forall t , } \\ { \mathrm { Price } \leq \mathrm { Budget } , } \\ { \mathrm { Risk } \leq \mathrm { Threshold } } \end{array} \right. } \end{array}$ where:
- $\mathrm{Price}$ : The cost associated with offloading tasks.
- $\mathrm{Risk}$ : The risk associated with offloading tasks (e.g., security, reliability).
- $\mathrm{Budget}$ : The maximum allowable cost.
- $\mathrm{Threshold}$ : The maximum allowable risk. This integration allows for a more comprehensive approach to managing resources by considering economic and reliability aspects in offloading decisions.

4.2.7. Active Inference Based Offloading Strategy

4.2.7.1. State and Action Representations

The agent (the decision-making entity) needs to understand the environment state and define its possible actions.

Environment State ( $S_t$ ): The state of server $Ser_j$ $S e r_{j}$ is represented as $s_j' = [C_j, W_j, M_j]^T$ $s_{j}^{'} = [C_{j}, W_{j}, M_{j}]^{T}$ .
- $C_j$ : Remaining computational resources of server $j$ .
- $W_j$ : Remaining bandwidth of server $j$ .
- $M_j$ : Remaining graphics memory resources of server $j$ . The overall state $S_t$ is formed by integrating the distance matrix $D$ from terminal $Dev_i$ to $Ser_j$ with the status of all servers: $[D; s_1'; \cdot \cdot \cdot; s_{N_{ser}}']$ .
Agent Action ( $a_t$ ): The agent's actions involve delegating the LLM inference task $T_t$ $T_{t}$ to a specific server and distributing resources. The vector of actions executed by the agent at time $t$ $t$ is: $a_t = [j, c_j, w_j, m_j]$ $a_{t} = [j, c_{j}, w_{j}, m_{j}]$ where:
- $j$ : The unique index of the server ( $Ser_j$ ) to which the task is delegated.
- $c_j$ : The amount of computational resources allocated for $T_t$ by $Ser_j$ .
- $w_j$ : The amount of channel bandwidth allocated.
- $m_j$ : The amount of graphics memory allocated. An action is invalid if allocated resources exceed the server's remaining resources, meaning $T_t$ cannot be executed at that time.

4.2.7.2. Rewardless Guidance in Active Inference

The proposed algorithm uses a rewardless guidance mechanism, moving away from conventional DRL's reliance on environmental reward signals. The focus of offloading decisions is to ensure tasks are executed with minimal latency and maximal success rate.

Rewardless Guidance Function (rg): This concept is formalized as: $r g ( s _ { t } , a _ { t } ) = t c \left( \frac { 1 } { L _ { T _ { t } } } + P _ { T _ { t } } \right) ,$ where:
- tc: A binary variable. It is 1 if task $T_t$ is completed successfully, and 0 otherwise. This ensures that only successfully completed tasks contribute to the guidance.
- $L_{T_t}$ : The total time delay for task $T_t$ . The term $\frac{1}{L_{T_t}}$ indicates that rewardless guidance increases as latency decreases (i.e., faster completion is better).
- $P_{T_t}$ : The prediction accuracy for task $T_t$ . Rewardless guidance increases as prediction accuracy increases. A higher $rg(s_t, a_t)$ value indicates that action $a_t$ is more aligned with state $s_t$ under rewardless guidance, increasing the likelihood of selecting $a_t$ . This effectively guides the agent towards actions that achieve both low latency and high accuracy without an explicit external reward signal.

4.2.7.3. Framework of Active Inference Decision

The active inference-based decision-making approach enables the agent to make internal decisions and interact with its environment. The agent-environment interaction is defined as a Partially Observable Markov Decision Process (POMDP) [35].

POMDP Description:
- At time t-1, the agent is in state $s_{t-1}$ and selects action $a_{t-1}$ with probability $P$ .
- It then transitions to state $s_t$ at time $t$ . This transition, $P(s_t | s_{t-1}, a_{t-1})$ , is probabilistic.
- In a POMDP, the agent cannot always observe the true environmental state. Instead, it continuously receives observations $o_t$ based on the probability $P(o_t | s_t)$ .
Generative Model: The agent uses $p(o, s | \theta)$ to predict external environmental conditions, where $\theta$ denotes learnable parameters of its internal model.
Free Energy Minimization: Adhering to the free energy principle, the agent reduces free energy through two processes:
1. Internal Model Building: It engages in POMDP, gathers observations $o_t$ , and builds an internal model $p(o, s | \theta)$ to represent the environment. This model acts as the agent's internal representation of the external world.
2. Action Planning: During action planning, the agent uses this model to identify and execute actions that reduce free energy. This active inference mechanism improves the agent's ability to understand its surroundings, anticipate future states, and perform goal-oriented actions.
Free Energy ( $F$ ): The goal of active inference optimization is to enhance the evidence of the agent's generative model, thereby reducing free energy. By setting expected preferences, $p(o, s, \theta)$ can be directed toward achieving this goal state. The agent seeks to minimize free energy, denoted as: $F = D _ { K L } ( q ( s , \theta ) \Vert p ( o , s , \theta ) ) ,$ where:
- $D_{KL}(\cdot \Vert \cdot)$ : Kullback-Leibler (KL) divergence.
- $q(s, \theta)$ : Represents the agent's belief about future variables (its approximate posterior distribution over states and parameters).
- $p(o, s, \theta)$ : The agent's generative model (its prior distribution over observations, states, and parameters).
- $F$ : Also known as the evidence lower bound (ELBO) [36], guides the agent's strategy selection. The agent selects the strategy $\pi$ that minimizes $F$ .
Anticipated Future Free Energy ( $\tilde{F}$ ): The agent's objective extends beyond a single time point $t$ . It seeks to minimize the anticipated future free energy, formulated as: $\tilde { F } = D _ { K L } \left( q ( o _ { 0 : T } , s _ { 0 : T } , \theta , \pi ) \lVert p ( o _ { 0 : T } , s _ { 0 : T } , \theta ) ) , \right.$ where:
- $O_{0:T}$ : The sequence of observations made by the agent from time 0 to $T$ .
- $s_{0:T}$ : The sequence of states experienced by the agent over the same interval.
- $q(o_{0:T}, s_{0:T}, \theta, \pi)$ : The agent's subjective probability distribution over future variables, conditioned on a policy $\pi$ .
- $p(o_{0:T}, s_{0:T} | \theta)$ : The agent's generative model, with $\theta$ being the parameters of the underlying neural network. The optimal policy $\pi^*$ is derived by minimizing $\tilde{F}$ . This minimization is achieved by aligning the generative model's output distribution $p(o_{0:T}, s_{0:T}, \theta)$ with the true state distribution $q(o_{0:T}, s_{0:T}, \theta, \pi)$ , such that: $D _ { K L } ( q \big ( o _ { 0 : T } , s _ { 0 : T } , \theta , \pi \big ) \lVert p \big ( o _ { 0 : T } , s _ { 0 : T } , \theta \big ) \big ) = 0 \Rightarrow \tilde { F } = 0 .$ This implies that the agent's beliefs about future observations and states, given its policy, become identical to its generative model's predictions, minimizing surprise and uncertainty.

4.2.7.4. Algorithm 1: Guided Active Inference Offloading without Rewards

The full process of the proposed active inference algorithm is detailed as follows:

Algorithm 1: Guided Active Inference Offloading without Rewards

Require:
- Transition probabilities $P(s_t \mid s_{t-1}, a_{t-1})$ .
- Initial policy $\pi$ (denoted as $EMPY π$ in the algorithm, implying an empty or initial policy).
- Optimization involves $I$ iterations.
- Considers $J$ potential policies, from which the top $k$ are selected.
- Process executed over $n_{episodes}$ episodes, each with $n_{steps}$ steps.
- Ensemble model parameterized by $\theta$ .
Goal: Refine the initial policy $\pi$ into an optimized strategy $\pi^*$ .
Ensure: Optimized strategy $\pi^*$ .
for every episode do

 `EPY t`  $\gets 0$  (Initialize time step  $t$  to 0).

 `Reset`  $s_t$  (Reset the environment state for a new episode).

```
 **for** `every step` **do**
```

      $Dev_i$  `generates task`  $T_t$ .

     **for** `every iteration`  $i$  **do**

        `A set of J potential policies derived from`  $q(\pi)$ .

        **for** `each candidate policy`  $j$  **do**

            `Get`  $\pi_j \sim q(\pi)$  (Sample a candidate policy from the distribution of policies).

            `Compute`  $r_1$  by minimizing  $\tilde{F}$  (Calculate a component of the rewardless guidance based on minimizing anticipated future free energy).

            `Compute`  $r_2$  (Calculate another component of the rewardless guidance).

```
        **end for**
```

        `Rank policies`  $\pi_j$  `based on`  $r = r_1 + r_2$  `and select the top`  $k$ . (Policies are evaluated using the combined rewardless guidance  $r$ , and the best  $k$  policies are chosen).

```
    **end for**
```

    `Adjust`  $\pi$  `based on the top`  $k$  `policies`; (The current policy  $\pi$  is updated using information from the selected top  $k$  policies).

    `Choose action`  $a_t$  `based on`  $\pi$ ; (The agent selects an action  $a_t$  according to the updated policy  $\pi$ ).

    `Obtain`  $s_{t+1}, L_{T_t}, P_{T_t}$ , `and check completion by applying`  $a_t$ ; (The environment transitions to state  $s_{t+1}$ , and the latency  $L_{T_t}$  and prediction accuracy  $P_{T_t}$  for the task are observed, along with task completion status).

    `Store`  $(s_t, a_t, L_{T_t}, P_{T_t}, s_{t+1})$  (The experience tuple is stored for later learning).

    `Update`  $s_{t+1}$ ; (The current state is updated for the next step).

```
**end for**
```

`Train ensemble model`  $\theta$ ; (The parameters of the internal generative model are updated using the collected experience).

end for
return $\pi^* = \pi$ EMPY (The optimized policy $\pi^*$ is returned).

5. Experimental Setup

5.1. Datasets

The experimental validation utilized the HumanEval dataset [38], published by OpenAI.

Source: OpenAI.
Scale and Characteristics: The dataset comprises 164 programming problems. Each problem includes:
- Function signatures.
- String annotations (docstrings).
- Code bodies (the solution).
- Test units (to verify correctness).
Domain: Programming tasks, specifically code generation and completion.
Creation Method: The problems were created manually (handwritten creation) to ensure accuracy and non-repeatability.
Language: Problems are articulated in Python, with descriptive sections (like comments) in English.
Purpose: This dataset is suitable for evaluating LLMs on their ability to generate functional and correct code, making it relevant for tasks that involve LLM inference where the output quality (and thus accuracy $P_{T_t}$ ) is critical.

5.2. Evaluation Metrics

The paper uses several standard metrics to evaluate the performance of its proposed method and the baseline DRL algorithms. While the paper does not provide explicit mathematical formulas for these metrics, their conceptual definitions are standard in reinforcement learning and system performance evaluation.

Total Reward (or Sum Reward):
- Conceptual Definition: In traditional DRL, total reward is the cumulative sum of rewards an agent receives over an episode or a series of interactions with the environment. It is the primary objective function for DRL agents, indicating how well the agent is achieving its goals. In this paper, for the proposed method, total reward is used solely for performance assessment, decoupled from the action selection mechanism, which uses rewardless guidance instead. For baselines, it's the direct optimization target.
- Mathematical Formula: For an episode of length $T$ : $ \text{Total Reward} = \sum_{t=0}^{T} R_t $
- Symbol Explanation:
  - $R_t$ : The reward received by the agent at time step $t$ .
  - $T$ : The total number of time steps (or horizon) in an episode.
Task Completion Rate:
- Conceptual Definition: This metric quantifies the percentage of LLM inference tasks that are successfully completed within their respective maximum acceptable delay ( $t_{max}$ ) and without invalid resource allocations. It measures the reliability and effectiveness of the offloading strategy in handling the workload.
- Mathematical Formula: $ \text{Task Completion Rate} = \frac{\text{Number of Successfully Completed Tasks}}{\text{Total Number of Tasks Requested}} \times 100% $
- Symbol Explanation:
  - Number of Successfully Completed Tasks: Count of tasks that meet all constraints (latency, resources) and produce a result.
  - Total Number of Tasks Requested: Total number of LLM inference tasks initiated by terminals.
Mean Latency:
- Conceptual Definition: Mean latency refers to the average total time delay ( $L_{T_t}$ ) for all LLM inference tasks from initiation to result return. Minimizing latency is crucial for responsive mobile computing and real-time IoT applications.
- Mathematical Formula: $ \text{Mean Latency} = \frac{1}{N_{completed}} \sum_{k=1}^{N_{completed}} L_{T_k} $
- Symbol Explanation:
  - $N_{completed}$ : The number of successfully completed tasks.
  - $L_{T_k}$ : The total time delay for the $k$ -th successfully completed task, as defined in Section 4.2.5.
Mean Pass@100:
- Conceptual Definition: This metric, specifically mentioned in the context of the HumanEval dataset, likely refers to the average prediction accuracy for tasks, scaled or aggregated. Pass@k is a common metric for code generation models (like GPT-J-6B on HumanEval), indicating the proportion of problems for which at least one of $k$ generated solutions passes the unit tests. Here, pass@100 would mean out of 100 generated solutions per problem, how many pass. The mean pass@100 would be the average of these scores across all problems. In the context of the paper's utility function, it directly relates to $P_{T_t}$ , the prediction accuracy of task $T_t$ . A higher value indicates better quality or accuracy of the LLM's output.
- Mathematical Formula: If $P_{T_t}$ is the accuracy for task $T_t$ (e.g., its pass@100 score), then: $ \text{Mean Pass@100} = \frac{1}{N_{completed}} \sum_{k=1}^{N_{completed}} P_{T_k} $
- Symbol Explanation:
  - $N_{completed}$ : The number of successfully completed tasks.
  - $P_{T_k}$ : The prediction accuracy (e.g., pass@100 score) for the $k$ -th successfully completed task.

5.3. Baselines

The paper compared its proposed method against three prominent Deep Reinforcement Learning (DRL) algorithms:

Rainbow DQN [39]:
- Description: Rainbow DQN is an advanced DRL algorithm that combines several key improvements to the original Deep Q-Network (DQN). These improvements include Double DQN (to reduce overestimation bias), Prioritized Experience Replay (to sample more important experiences more frequently), Dueling Networks (to separate state value and advantage functions), Multi-step Learning (to use returns from multiple steps), NoisyNets (for exploration), and Distributional RL (to learn a distribution over returns rather than just the expected return). It's known for its strong performance in environments with discrete action spaces.
- Representativeness: It represents a state-of-the-art value-based DRL method, known for its stability and performance in various RL benchmarks.
PPO (Proximal Policy Optimization) [40]:
- Description: PPO is a policy gradient DRL algorithm that strikes a balance between ease of implementation, sample efficiency, and performance. It works by optimizing a surrogate objective function with a clipping mechanism to constrain policy updates, preventing them from becoming too large and destabilizing training. PPO is widely used and performs well in both discrete and continuous action spaces.
- Representativeness: It is one of the most popular and robust policy gradient methods, often considered a strong baseline for many DRL tasks.
SAC (Soft Actor-Critic) [41]:
- Description: SAC is an off-policy actor-critic DRL algorithm that optimizes a stochastic policy in an entropy-regularized reinforcement learning framework. The entropy regularization encourages exploration and helps prevent the policy from collapsing to a single action. SAC is known for its stability, sample efficiency, and effectiveness in continuous control tasks.
- Representativeness: It represents a state-of-the-art actor-critic method, particularly effective in continuous control settings relevant to resource allocation.
  
  These algorithms were chosen as benchmarks because they are state-of-the-art (SOTA) and effective in both discrete and continuous domains, providing a solid comparative baseline for the proposed active inference method.

5.4. Environmental Configuration and Resource Constraints

The experimental environment was configured to align with the maximum demands identified during the training phase, with the algorithm's efficacy assessed under fluctuating workloads during testing.

LLM Model: GPT-J-6B was used, consisting of 28 layers, with a model dimension ( $d_{model}$ ) of 4096 and a feedforward dimension ( $d_{forward}$ ) of 16384. It included $n_{heads} = 16$ attention heads, each with a dimension ( $d_{head}$ ) of 256. Rotary Positional Embedding (RoPE) was utilized for $d_{RoPE} = 64$ dimensions per head. The model was trained using a tokenization vocabulary of $n_{vocab} = 50257$ , employing the same Byte Pair Encoding (BPE) scheme as GPT-2 and GPT-3 [37].
Edge Server: Featured an NVIDIA 3090 GPU. For LLM inference, it used the GPT-J-6B model devoid of any acceleration methods.
Cloud Server: Leveraged Triton server to enhance the inference performance of the GPT-J-6B model.
Resource Disparity: Consequently, the computation times ( $L_c$ ) and the number of passes (pass@100) required to offload the GPT-J-6B task differed significantly between the cloud and edge due to their disparate hardware capabilities and optimization frameworks.
Simulation Parameters:
- Maximum time steps ( $t_{max}$ ): Ranged from 1 to 15 seconds (in 1-second intervals) for latency variation analysis. For training, it was set to 15 seconds.
- Number of tasks ( $n_{tasks}$ ): Set to 100 for both training and latency variation analysis.
- Cloud-to-Edge Server Resource Ratio: 1:4 (This ratio significantly impacts offloading decisions, especially for latency-sensitive tasks).

6. Results & Analysis

6.1. Core Results Analysis

The simulation results are presented through benchmarking against prominent DRL algorithms (Rainbow DQN, PPO, and SAC) across two main phases: training phase performance and latency variation analysis.

6.1.1. Training Phase Performance

This section compares the performance of the proposed method with existing ones during the training phase, with $t_{max} = 15$ seconds and $n_{tasks} = 100$ .

The following figure (Figure 3 from the original paper) shows training phase performance metrics:

Fig. 3. Benchmarking Prominent DRLs During Training. 该图像是图表，展示了论文中多个深度强化学习方法在训练过程中的性能对比，包括累积回报、任务完成率、平均延迟和平均通过率四个指标，图中方法包括AI w/RewardlessGuidance、RainbowDQN、SAC和PPO。

Convergence Speed and Total Reward (Figure 3a):
- The proposed active inference method with rewardless guidance (AI w/RewardlessGuidance) demonstrates superior convergence speed, achieving stable performance around episode 200.
- Initially (first 50 episodes), it underperforms due to environmental complexity, but quickly surpasses Rainbow DQN, which converges rapidly but then plateaus.
- SAC lags significantly, while PPO shows slow initial convergence but eventually matches Rainbow DQN's performance, though still outperformed by the proposed method.
- This indicates that the active inference approach learns to effectively manage LLM offloading faster and reaches a higher performance level compared to traditional DRL techniques.
Task Completion Rate (Figure 3b):
- The proposed method achieves a remarkably high task completion rate of approximately $99\%$ .
- This significantly exceeds Rainbow DQN's $85\%$ , PPO's $90\%$ , and SAC's $80\%$ .
- This highlights the method's ability to reliably execute a large proportion of tasks without compromising individual task requirements, which is critical for LLM services.
Mean Latency (Figure 3c):
- The proposed method exhibits an average task completion latency of about 8 seconds, outperforming SAC (around 10 seconds).
- It is slightly lower than both Rainbow DQN and PPO.
- The shaded regions in the graph represent the variability (standard deviation) of average latency during training. These regions are larger for mainstream DRL strategies, reflecting greater instability and fluctuation in their latency performance. The AI w/RewardlessGuidance shows much tighter variability, underscoring its stability and importance for low-latency IoT applications in dynamic environments.
Mean Pass@100 (Figure 3d):
- The proposed method achieves an average pass@100 of approximately 0.175, surpassing Rainbow DQN and PPO's 0.15, and SAC's 0.14.
- This suggests that the active inference method tends to offload tasks more frequently to high-accuracy edge nodes or makes better decisions leading to higher quality LLM outputs.
- While mainstream DRLs might prioritize low latency, the proposed method effectively balances both latency and accuracy goals.

6.1.2. Latency Variation Analysis

This section evaluates algorithm performance under varying maximum latency thresholds ( $t_{max}$ ), ranging from 1 to 15 seconds (in 1-second intervals), with $n_{tasks} = 100$ .

The following figure (Figure 4 from the original paper) shows a benchmark comparison of several prominent deep reinforcement learning algorithms under different maximum latency thresholds:

Fig. 4. Benchmarking Prominent DRLs Under Different Maximum Latency Thresholds. 该图像是图4，展示了在不同最大延迟阈值下几种主要深度强化学习算法的性能基准比较，包括总奖励、任务完成率、平均延迟及平均通过@100四个指标。

Total Reward (Figure 4a):
- The Cloud and MEC (presumably separate baselines or configurations, though not explicitly detailed in the text accompanying the figure) yield the highest reward when $t_{max} \geq 10$ seconds. This suggests that with sufficient time, centralized or closer-to-edge resources can maximize rewards.
- The proposed AI w/RewardlessGuidance consistently shows higher rewards than the DRL baselines across various $t_{max}$ values, especially as $t_{max}$ increases.
Task Completion Rate (Figure 4b):
- No algorithm can complete tasks when $t_{max} \le 2$ seconds. This is attributed to the unavoidable combined inference and transmission times exceeding this threshold, highlighting a fundamental physical limitation of the system.
- Within the range of $3 \leq t_{max} \leq 9$ seconds, the proposed method and Rainbow DQN maintain a task completion rate above $20\%$ . In contrast, PPO and SAC drop below this rate, indicating poorer performance in moderately constrained latency scenarios.
- The text notes that minimum inference time surpasses 9 seconds given edge constraints and wireless transmission delay, and tasks can only be offloaded to the cloud during this period (which aligns with the 1:4 cloud-to-edge resource ratio).
- When $t_{max} \geq 10$ seconds, the proposed method achieves near-100% task completion, demonstrating its robustness under less stringent latency constraints.
Mean Latency (Figure 4c):
- The figure shows a declining curve for mean latency as $t_{max}$ increases, particularly noticeable as cloud offloading becomes possible. This implies that greater tolerance for delay allows tasks to be processed by more powerful (but potentially more distant) cloud resources, thereby reducing the average processing time.
- The proposed method consistently maintains lower mean latency across different $t_{max}$ values compared to the DRL baselines.
Mean Pass@100 (Figure 4d):
- When $t_{max} \le 2$ seconds, the average pass@100 remains near zero for all algorithms, corresponding to the inability to complete tasks.
- For $3 \leq t_{max} \leq 9$ seconds, the average pass@100 stays below 0.05 for all algorithms, indicating that even completed tasks during this constrained period have low accuracy.
- However, when $t_{max} \geq 10$ seconds, AI w/Rewardless Guidance achieves an average pass@100 of approximately 0.15, significantly outperforming Rainbow DQN and PPO (around 0.12) and SAC (about 0.1). This suggests that given sufficient latency tolerance, the proposed method optimizes for higher-quality LLM outputs.
  
  Overall, the active inference method with rewardless guidance consistently demonstrates superior performance across all evaluated metrics and varying latency thresholds. This indicates its strong adaptability to diverse latency demands, which is a key requirement for efficient task offloading and execution in IoT environments.

6.2. Data Presentation (Tables)

The provided text excerpt does not include any tables of experimental results. All results are presented graphically in Figure 3 and Figure 4.

6.3. Ablation Studies / Parameter Analysis

The paper presents an analysis of performance under varying maximum latency thresholds ( $t_{max}$ ), which can be considered a form of parameter analysis or sensitivity analysis rather than a formal ablation study. This analysis (Section V.C) investigates how the algorithms perform as the t_max parameter changes, revealing their robustness and adaptability to different latency requirements. The results discussed in Section 6.1.2 detail how the algorithms' total reward, task completion rate, mean latency, and mean pass@100 are affected by this crucial parameter, showing the proposed method's superior performance across a wide range of t_max values. No explicit ablation studies (where specific components of the proposed method are removed to assess their individual contribution) are described in the provided text.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study successfully proposed an active inference approach utilizing rewardless guidance to address the critical challenge of resource scarcity for Large Language Models (LLMs) during inference in IoT cloud-edge computing environments. The authors developed a robust cloud-edge network system designed for efficient handling of LLM inference tasks, from request initiation at terminals to result delivery. Extensive simulations empirically validated the effectiveness of the proposed method. It was shown to outperform conventional Deep Reinforcement Learning (DRL) techniques in several key aspects: training convergence speed, tolerance to maximum latency thresholds during testing, and overall task load management. The research highlights the potential of active inference to create more adaptable and robust LLM strategies suitable for the demands of the 6G era and the Symbiotic Internet-of-Things (IoT).

7.2. Limitations & Future Work

The authors identified several directions for future work:

Complex Scenarios: Extending the approach to more complex scenarios, such as dynamic network topologies (where the network structure changes over time) and multi-agent environments (where multiple intelligent agents interact).
Broader Terminal Devices: Broadening the range of terminal devices considered, implying greater diversity in their capabilities and mobility patterns.
Distributed Computing: Exploring distributed computing scenarios where computation is spread across many interconnected devices.
Advanced Network Systems: Leveraging advanced network systems for resource scheduling, specifically mentioning space-air-ground integrated networks (SAGIN), which involve satellites, aerial platforms (like UAVs), and ground stations.
Algorithmic Performance Enhancement: Focusing on further enhancing the algorithmic performance of their proposed active inference method.

While not explicitly stated as limitations, the need for these future directions implies current limitations in handling such complexities, the scope of device types, and the integration with advanced heterogeneous networks.

7.3. Personal Insights & Critique

This paper presents a compelling argument for moving beyond traditional DRL in the context of LLM offloading within 6G Symbiotic IoT, leveraging active inference with rewardless guidance.

Inspirations:

Bridging Neuroscience and AI: The use of active inference, a framework rooted in neuroscience, for a practical engineering problem like LLM offloading is highly inspiring. It suggests that biologically plausible models of intelligence can offer novel solutions to complex AI and networking challenges, especially in dynamic, uncertain environments.
Robustness in Dynamic Environments: The concept of rewardless guidance is particularly insightful. By moving away from brittle, hand-crafted reward functions, the agent inherently seeks better predictive models of its environment, which naturally leads to robust and adaptive behavior. This could be highly beneficial for IoT scenarios where environmental conditions (e.g., network load, device mobility) are constantly changing and unpredictable.
6G Potential: The paper effectively highlights how 6G's promised capabilities (low latency, high data rates) can be synergistically combined with intelligent offloading strategies. This could pave the way for truly omnipresent and responsive AI services, even on resource-constrained devices.

Potential Issues & Areas for Improvement:

Complexity of Active Inference: While rewardless guidance simplifies the reward engineering problem, active inference itself can be computationally complex due to its probabilistic nature and the need to maintain an internal generative model. The paper briefly mentions learnable parameters ( $\theta$ ) for the generative model but doesn't elaborate on the specific architecture or training overhead. A more detailed discussion on the computational cost and complexity of the active inference agent itself, especially for real-time deployment on actual edge devices or within 6G infrastructure, would strengthen the argument.
Real-world Deployment Challenges: The simulations are thorough, but translating these results to real-world 6G Symbiotic IoT deployments will involve significant challenges. Factors like channel fading, intermittent connectivity, device heterogeneity (beyond mobile/fixed categories), and security vulnerabilities are crucial. While the problem formulation includes price and risk, their integration into the active inference mechanism itself needs further exploration beyond just constraints.
Interpretability of Rewardless Guidance: The rewardless guidance function $rg(s_t, a_t) = tc \left( \frac{1}{L_{T_t}} + P_{T_t} \right)$ is a clever way to encode desired outcomes. However, the balance between minimizing latency and maximizing accuracy (the two terms in the sum) is set implicitly. Future work could explore how this balance can be dynamically adjusted or learned based on application-specific priorities, or if a more nuanced utility function could be incorporated directly into the free energy minimization objective without becoming an explicit reward function.
Comparison to Other AI Paradigms: While DRL is a strong baseline, comparing active inference against other AI paradigms for resource management, such as federated learning or multi-agent systems that are not strictly DRL (beyond $DRL+FL$ in references), could provide a broader perspective on its advantages and limitations.

Overall, this paper provides a valuable contribution to the field of LLM deployment in future networks, offering a fresh perspective based on active inference. Its rigorous simulation results demonstrate promising performance, setting a strong foundation for future research in this critical area.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.