Large Language Model Offloading using Active Inference in 6G Symbiotic IoT
TL;DR Summary
This paper presents an active inference-based offloading method for large language models in 6G symbiotic IoT, optimizing resource scheduling and computation through cloud-edge collaboration for enhanced system efficiency and intelligent inference services.
Abstract
1 Large Language Model Offloading using Active Inference in 6G Symbiotic IoT Xiaoming He, Member, IEEE , Yunzhe Jiang, Xiaoming Xu, Huajun Cui, Yinqiu Liu, Member, IEEE , Mingkai Chen, Member, IEEE , Yan Hong, Member, IEEE , and Jie Zhang, Member, IEEE Abstract —The increasing demand for Large Language Model (LLM) applications in mobile computing poses a challenge for devices with limited resources, as they struggle to efficiently handle complex inference tasks. Despite its traditional use for offloading tasks to remote servers, Deep Reinforcement Learning (DRL) exhibits notable limitations, such as data inefficiency, latency insensitivity, and poor adaptability to variable workloads, thereby adversely impacting the performance of LLMs. Deep Reinforcement Learning (DRL) is traditionally used to offload tasks to remote servers. However, it has several limitations which negatively affect the performance of LLMs. We present an approach which is based on active inference for task offloading in LLM and cloud-edge computing resource scheduling, especially relevant to emerging 6G networks. These networks are designed to provide enhanced connectivity, reduced l
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Large Language Model Offloading using Active Inference in 6G Symbiotic IoT
1.2. Authors
- Xiaoming He, Member, IEEE (College of Internet of Things, Nanjing University of Posts and Telecommunications, Nanjing, China)
- Yunzhe Jiang (College of Communication and Information Engineering, University of Electronic Science and Technology of China, Chengdu, China)
- Xiaoming Xu (Beijing KOAL Guoxin Technology Co., Ltd.)
- Huajun Cui (Digital Intelligence Research Institute, PowerChina, Beijing Engineering Corporation Limited, Beijing, China)
- Yinqiu Liu, Member, IEEE (College of Computing and Data Science, Nanyang Technological University, Singapore)
- Mingkai Chen, Member, IEEE (Key Laboratory of Broadband Wireless Communication and Sensor Network Technology, Nanjing University of Posts and Telecommunications, China)
- Yan Hong, Member, IEEE (College of Textile and Clothing Engineering, Soochow University, Suzhou 215021, China; Corresponding Author)
- Jie Zhang, Member, IEEE (Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China)
1.3. Journal/Conference
The specific journal or conference for this paper is not explicitly stated in the provided excerpt. However, based on the authors' affiliations and the referencing style (e.g., IEEE member status, IEEE conference/journal citations in the references), it is highly probable that this work is submitted to or published in an IEEE-affiliated journal or conference, particularly given that reference [12] is "IEEE Transactions on Mobile Computing" and reference [21] is "IEEE Vehicular Technology Conference (VTC)".
1.4. Publication Year
The publication year is not explicitly stated in the provided header or footer. However, the references largely consist of recent works from 2023 and 2024. For instance, reference [12] is from December 2024 and reference [21] is from October 2023, both with very similar titles and some shared authors. This suggests the paper is very recent, likely published or submitted in late 2023 or 2024.
1.5. Abstract
The increasing demand for Large Language Model (LLM) applications in mobile computing poses a challenge for devices with limited resources, as they struggle to efficiently handle complex inference tasks. Despite its traditional use for offloading tasks to remote servers, Deep Reinforcement Learning (DRL) exhibits notable limitations, such as data inefficiency, latency insensitivity, and poor adaptability to variable workloads, thereby adversely impacting the performance of LLMs. Deep Reinforcement Learning (DRL) is traditionally used to offload tasks to remote servers. However, it has several limitations which negatively affect the performance of LLMs. We present an approach which is based on active inference for task offloading in LLM and cloud-edge computing resource scheduling, especially relevant to emerging 6G networks. These networks are designed to provide enhanced connectivity, reduced latency, and increased data rates. Our approach capitalizes on these strengths to optimize task distribution and maximize resource utilization, fostering a symbiotic relationship between devices and networks. Simulations demonstrate that our method outperforms standard DRL by enhancing data efficiency and better adapting to varying loads, aligning with 6G’s emphasis on flexible and responsive networks. By integrating active inference into cloud-edge systems, we develop a more robust and adaptable LLM strategy that is well-suited for the 6G era, promoting a Symbiotic Internet-of-Things (IoT) where devices and networks dynamically collaborate and share resources to fulfill the requirements of advanced applications.
1.6. Original Source Link
/files/papers/69007e63ed47de95d44a3483/paper.pdf (This is a PDF link to the paper, likely a preprint or an internal file reference.)
2. Executive Summary
2.1. Background & Motivation
The core problem this paper aims to solve is the efficient and robust deployment of Large Language Model (LLM) applications in resource-constrained mobile computing environments, particularly within the context of emerging 6G Symbiotic IoT networks.
This problem is crucial because LLMs, despite their powerful capabilities, demand significant computational and memory resources, making their direct execution on mobile and IoT devices challenging. Traditional Deep Reinforcement Learning (DRL) methods, often used for task offloading to remote servers, suffer from several limitations:
- Data inefficiency: They require extensive data for training.
- Latency insensitivity: They may not adequately optimize for real-time responsiveness.
- Poor adaptability to variable workloads: Their performance degrades in dynamic environments with fluctuating demands.
These limitations directly impact the performance of
LLMsin mobile settings, creating a gap in effective offloading and resource scheduling strategies.
The paper's entry point is to leverage active inference as an alternative to traditional DRL for LLM task offloading and cloud-edge computing resource scheduling. This is especially pertinent to 6G networks, which promise enhanced connectivity, reduced latency, and higher data rates, offering an opportunity to optimize LLM deployment by fostering a symbiotic relationship between devices and network infrastructure. The innovative idea is to use a rewardless guidance mechanism within active inference to overcome the shortcomings of DRL, promoting more adaptive and efficient LLM operations.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
- Comprehensive System Model and Mathematical Formulation: The study presents a detailed
system modelandmathematical formulationspecifically tailored to theGPT-J-6B LLM. This model is grounded in empirical data obtained from a server cluster environment and covers both the training and inference stages of theLLMlifecycle, providing a solid analytical foundation. - Innovative Active Inference Framework: Capitalizing on recent advancements in
active inferenceapproaches, the paper introduces a novelframeworkdesigned to address the complexities ofinference task delegationandresource distributionforLLMs. This approach is claimed to surpass conventionalDRLtechniques in terms ofconvergenceandgeneralization capabilities. - Enhanced Performance in Simulations: Through rigorous
simulation analysis, the authors demonstrate that their proposedframeworkyields a strategy withenhanced convergence characteristics. It alsooutperforms mainstream DRL algorithmsin the context ofLLM inference tasks, showing superiordata efficiencyand betteradaptation to varying loads. These findings align with the6Gemphasis on flexible and responsive networks. - Robust and Adaptable LLM Strategy for 6G Symbiotic IoT: By integrating
active inferenceintocloud-edge systems, the research develops a more robust and adaptableLLM strategy. This strategy is specifically suited for the6G era, promoting aSymbiotic Internet-of-Things (IoT)where devices and networks dynamically collaborate and share resources to fulfill the requirements of advanced applications.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with the following key concepts:
- Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data to understand, generate, and process human language. They typically have billions of parameters, allowing them to perform complex tasks like translation, summarization, question answering, and content creation (e.g.,
GPTseries likeChatGPT). Their large size, however, makes them computationally intensive, especially duringinference(the process of using a trained model to make predictions or generate outputs). - 6G Networks: The sixth generation of wireless communication technology,
6Gis envisioned to succeed5G. It aims to provide even higher data rates, ultra-low latency, massive connectivity, and enhanced intelligence, enabling new applications like holographic communication, omnipresent AI, andSymbiotic IoT. - Symbiotic Internet-of-Things (IoT): In
IoT, physical devices, vehicles, home appliances, and other items are embedded with sensors, software, and other technologies to connect and exchange data over the internet. ASymbiotic IoTextends this by envisioning a highly collaborative ecosystem whereIoTdevices and network infrastructure dynamically interact, share resources, and adapt to each other's needs to achieve collective intelligence and efficiency. - Cloud-Edge Computing: This is a distributed computing paradigm that combines
cloud computing(centralized data centers with vast resources) withedge computing(computation performed closer to the data source, likeIoTdevices oredge servers). The goal is to reduce latency, save bandwidth, and improve responsiveness by processing data locally at the network edge rather than sending everything to a distant cloud.- Cloud Server (CS): A powerful, centralized server located in a data center, offering extensive computational resources.
- Multi-Access Edge Computing (MEC) Server: A server located closer to end-users (e.g., at a cellular base station), providing localized computation and storage to reduce latency.
- Deep Reinforcement Learning (DRL): A subfield of
machine learningthat combinesreinforcement learning(where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward) withdeep learning(using neural networks to learn representations from data).DRLagents learn optimal policies through trial and error, often in complex environments. - Active Inference: A theoretical framework from
neuroscienceandmachine learningthat posits that intelligent agents minimizefree energy(a measure of surprise or prediction error) to maintain their internal model of the world and act purposefully. UnlikeDRLwhich relies on external reward signals,active inferenceagents intrinsically seek to improve their predictive models and reduce uncertainty about their environment, leading to goal-oriented behavior. In this paper, it's used withrewardless guidance, meaning it doesn't need explicit, hand-crafted reward functions. - Partially Observable Markov Decision Process (POMDP): A mathematical framework for modeling decision-making in situations where the agent's actions are assumed to be a part of a
Markov process(where future states depend only on the current state, not the sequence of events that preceded it), but the agent cannot directly observe the underlying state of the environment. Instead, it receives observations that are probabilistically related to the state. This is more realistic for many real-world scenarios than a fully observableMarkov Decision Process (MDP). - Free Energy Principle: In
active inference, this principle states that any self-organizing system that is at equilibrium with its environment must minimize its variational free energy. This is achieved by updating its internal model to predict sensory inputs more accurately and by acting on the environment to make its sensory inputs more consistent with its predictions. It essentially frames all cognitive and biological processes as attempts to minimize long-term surprise. - Kullback-Leibler (KL) Divergence (): A measure from information theory that quantifies how one probability distribution is different from a second, reference probability distribution. A
KL divergenceof zero means the two distributions are identical. It's often used inmachine learningto measure the difference between a model's predicted distribution and the true distribution.
3.2. Previous Works
The paper contextualizes its work by referencing previous research in Large Language Models (LLMs) and Deep Reinforcement Learning (DRL).
3.2.1. Large Language Models (LLMs)
LLMs are characterized by their extensive parameter sets and computational demands. Prior work has focused on making LLM inference more efficient, especially on resource-constrained devices:
- Device-side inference engines: Xu et al. [22] proposed
LLMCadfor efficient execution of privacy-sensitive generative tasks on mobile applications. Yi et al. [23] introducedEdgeMoE, an on-device inference engine forhybrid expert-based LLMs, tackling issues of large parameter scale and high runtime cost on edge devices. - Security in IoT with LLMs: Ferrag et al. [24] proposed
SecurityBERT, aBERT-based architecture to identify cyber threats inIoTnetworks, showcasingLLMapplications beyond natural language processing. - LLM Inference Offloading: He et al. [12] (one of the authors of the current paper, or a very closely related work) and Fang et al. [21] (also closely related) explored
LLM inference offloadingandresource allocationincloud-edge computingusing anactive inferenceapproach, directly motivating the current study.
3.2.2. Decision Making with Deep Reinforcement Learning (DRL)
DRL algorithms combine deep neural networks and reinforcement learning to address complex decision and control problems by learning optimal strategies through environmental interactions, without relying on prior knowledge.
- Resource Management in IoT: Liu et al. [31] integrated
wireless power transferto address limited battery capacity and low computing power inIoTnodes, offloading computational tasks toedge computing servers. - Vehicular Networks: Zhang et al. [32] proposed an
urban vehicle-mounted cloud-assisted MEC networkforcomputing offloadingin dynamic traffic environments. - Resource Allocation: Wang et al. [33] introduced a (Federated Learning) intelligent resource allocation model to solve
communication congestionandquality of user experience (QoE) degradation. - General DRL Applications: The paper also references broader
DRLapplications in gaming, robotics, and general resource management [13-15], acknowledging its success but also highlighting its limitations regardingreward functions[16-17] and adaptability in dynamicIoTenvironments [18-20].
3.3. Technological Evolution
The evolution of LLM deployment in IoT and 6G contexts can be traced through:
- Early
LLMDevelopment: Initially,LLMswere large, monolithic models requiring significant data center resources for both training and inference (e.g.,GPTfamily mentioned). - Edge/Mobile Optimization: The challenge of deploying
LLMsonresource-constrained mobile and edge devicesled to specialized inference engines and optimization techniques (e.g.,LLMCad,EdgeMoE). - Reinforcement Learning for Offloading:
DRLemerged as a promising approach for dynamic task offloading and resource allocation inMECandIoTto improve efficiency. - Limitations of Traditional DRL: Despite successes,
DRLfaced issues like data inefficiency and rigid reward functions, particularly in highly dynamicIoTenvironments. - Emergence of
6G: The promise of6G(enhanced connectivity, lower latency, higher data rates) created new opportunities and challenges forLLMintegration, demanding more responsive and adaptable offloading strategies. - Active Inference for Enhanced Adaptability: This paper fits into the timeline by proposing
active inferencewithrewardless guidanceas the next step, aiming to overcomeDRLlimitations and fully leverage6Gcapabilities forSymbiotic IoT. It represents an evolution towards more biologically plausible and adaptable decision-making for resource management.
3.4. Differentiation Analysis
The core innovation of this paper, compared to the main methods in related work, lies in its use of active inference with a rewardless guidance mechanism, distinguishing it from traditional Deep Reinforcement Learning (DRL) approaches.
- DRL's Limitations: Traditional
DRL(as seen inRainbow DQN,PPO,SACbaselines) relies heavily on explicitly definedreward functions. These functions guide the agent's learning process by providing numerical feedback for actions. However, designing effective reward functions is challenging, often leading tosuboptimal generalizabilityanddata inefficiency, especially in dynamic and unpredictable environments likeSymbiotic IoT.DRLcan also belatency-insensitiveand poorly adaptable tovariable workloads. - Active Inference with Rewardless Guidance: The proposed method replaces these traditional
reward models. Instead of learning to maximize an external reward, the agent usingactive inferencewithrewardless guidanceintrinsically seeks to improve its internal predictive model of the environment and minimize itsvariational free energy(or "surprise"). This means:-
Intrinsic Motivation: The agent's learning is driven by an internal desire to understand its surroundings and anticipate future states, rather than chasing external rewards.
-
Enhanced Generalization: By developing a more sophisticated understanding of the environment without rigid reward functions, the agent can generalize better to unseen or dynamic conditions.
-
Adaptability: The
rewardless guidancemechanism directly incorporates task completion, latency, and prediction accuracy, allowing the agent to dynamically prioritize these factors without needing a predefined reward landscape. This leads to morecontext-aware decisions. -
Convergence and Efficiency: The paper claims this approach offers superior
convergenceanddata efficiencycompared toDRL, as it leverages the fundamental principles ofactive inferencefor more robust learning.In essence, while
DRLis about learningwhat to doto get rewards, this active inference approach is about learninghow the world worksand then acting to make sensory inputs consistent with its predictions, which implicitly achieves desired outcomes like minimal latency and high accuracy.
-
4. Methodology
4.1. Principles
The core idea behind the proposed method is to leverage active inference for LLM task offloading and resource allocation within cloud-edge networks, particularly for 6G Symbiotic IoT environments. The theoretical basis is that intelligent agents can optimize their behavior by minimizing variational free energy, which acts as an intrinsic drive to improve their internal models of the world and reduce prediction errors.
Instead of relying on explicit reward functions—a common challenge in Deep Reinforcement Learning (DRL)—this approach uses a rewardless guidance mechanism. This mechanism guides the agent towards states that naturally minimize latency and maximize task success and prediction accuracy, aligning with the operational goals of LLM services in IoT. By dynamically adapting to the environment and its own predictions, the agent aims to achieve robust and efficient offloading and resource scheduling in highly dynamic 6G settings.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. System Configuration
The system operates within a multi-LLM user environment that interacts with both a cloud computing hub and edge computing servers.
-
Cloud Server (CS): Equipped with substantial and robust computing resources, capable of accommodating the computational demands of
LLM inference tasks. -
Multi-Access Edge Computing (MEC) Centers: Located close to users, designed to efficiently offload
LLM inference tasksby offering reduced latency due to proximity. -
Terminal Devices (Dev): A diverse array of devices initiating
LLM inference task requests.Mobile terminals(): Devices with dynamic locations, such as smartphones, drones, and connected vehicles.Fixed terminals(): Devices with static locations, such as personal computers and workstations.- Collectively, these are represented as , where is the total number of devices.
-
Servers (Ser): The
MECandCSare collectively denoted as , where is the total number of servers. -
Task Request: At any given time , a terminal generates a random offload request for an
LLMtask . A decision algorithm then delegates task to a server and allocates necessary network, computational, and graphics memory resources.The following figure (Figure 1 from the original paper) shows the system architecture:
该图像是图1,展示了用于大语言模型卸载的云边框架示意图,体现了协同架构下的共生物联网系统,以及基于环境状态获取、卸载决策和策略优化的流程。
4.2.2. Task Model Formulation
This section details the inference process for LLM tasks, using the GPT-J-6B model as a representative case. This model has 6 billion parameters and a layered encoder-decoder structure.
The execution of inference within the GPT-J-6B model is broken down into sequential stages:
-
Input Encoding: The input text is converted into a suitable format by a
tokenizer. The resulting sequence is denoted as , where is the -th token or vocabulary element. -
Vector Embedding: The sequence is mapped into a vector space by an
embedding layer, yielding the vector sequence , where represents the -th vector in embedding space. -
Positional Embedding: A
positional encodingis applied to the embedding vectors to incorporate the order of sequence elements, resulting in the positionally encoded sequence , where is the position encoding matrix.Following these initial steps, the detailed mechanics of the
attentionandmulti-head self-attention mechanismsare elucidated.
-
Query, Key, and Value Matrices: The query matrix , key matrix , and value matrix are defined as: where , , and are
parameter matrices(learnable weights), andPEis the positionally encoded sequence. -
Attention Layer: The
Attention_layerfunction computes theattention scoresand applies them to the values. where:- , , : Query, Key, and Value matrices.
- : Dot product between Query and Key matrices, measuring similarity.
- : Scaling factor, where is the dimension of the key vectors. This prevents the dot products from growing too large, pushing the
softmaxinto regions with tiny gradients. - : Normalization function that converts raw scores into probabilities, ensuring they sum to 1.
- : Value matrix, whose rows are weighted and summed based on the attention scores.
-
Multi-Head Self-Attention Layer: The
MultiHead_layerperforms multipleattentioncomputations in parallel (heads) and then concatenates their results. where:-
: The concatenation of outputs from different attention heads.
-
: A linear projection matrix that combines the concatenated outputs of the attention heads.
-
: The output of the -th attention head.
-
, , : Parameter matrices specific to the -th head, projecting the input , , into different subspaces. is the model dimension, , , are the dimensions of the query, key, and value vectors for a single head.
The output of the
self-attention layer, denoted as , is derived from themulti-head self-attentioncomputation: This procedure is iterated across multiple layers, corresponding to the model's depth.
-
-
Feedforward Neural Network: The subsequent stage involves a
feedforward neural network. The attention mechanism's output serves as input to amultilayer perceptron (MLP)to determine the subsequent output: where encapsulates linear mappings and non-linearactivation functions. -
Task Request and Response: The
inferenceoccurs on server . A task request from is dispatched as a -sized packet to . Upon processing, returns a -sized packet containing the predicted text to .
4.2.3. Terminal Mobility Model
The system model categorizes nodes into stationary and mobile.
-
Stationary Nodes:
MECservers,CS, and fixed terminals (). -
Mobile Nodes: Mobile terminals () such as internet-connected vehicles, smartphones, and unmanned aerial vehicles (UAVs). These move at a predetermined velocity, intermittently generating
LLM inference task requests.To determine the proximity of mobile terminal to computational endpoints (
MECandCS), theEuclidean distancemetric is used. Given coordinates for and for the endpoint, the distance is calculated using the standard Euclidean distance formula.
The paper provides a detailed characterization of communication channels:
- Ground-to-Ground (G2G) Channels: Both transmitter and receiver are terrestrial devices. The
path lossis: where is the distance between the transmitter and receiver in kilometers. - Ground-to-Air (G2A) Channels: One end is a ground device, and the other is an aerial device. The
path lossis: where:- :
Path loss exponent, affected by environmental factors (building density, type, height, vegetation). - :
Euclidean distancebetween sender and receiver. - : A constant depending on operating frequency and antenna gain.
- :
- Air-to-Air (A2A) Channels: Facilitate relay communication between
UAVs.Path lossis defined as: where:- :
Path loss exponent, which can be a relatively small value for high-altitudeUAVsoperating inline-of-sightconditions. - :
Aerial distancebetweenUAVs.
- :
4.2.4. Communication Model
In the system model, terminal initiates task at time . The decision algorithm offloads this request to server , which processes it and returns the result to . Both offloading and result transmission depend on the wireless communication channel's quality. The distance is assumed constant during an effective communication session.
Drawing from information theory, the data transmission rate is given by the Shannon-Hartley theorem-like formula:
where:
-
:
Communication channel bandwidth(in Hz). -
: The device's
transmission power(in Watts). -
: The
channel gain(unitless). -
: The
noise powerdue to thermal fluctuations within the channel (in Watts).The
channel gainis further defined based onantenna gain,path loss, andshadow fading: where: -
:
Antenna gain(in dB), unique to the receiving antenna. -
PL:Path loss(in dB), a characteristic of the channel (e.g., , , ). -
:
Shadow fading factor(in dB), commonly represented as aGaussian random variablewith a mean of zero, , guaranteeing it is always positive. and are typically considered constants.
4.2.5. Data Transfer Model
Data transfer is divided into two phases: task offloading and result return.
-
Task Offloading Phase: The decision algorithm dispatches a
task packetassociated with task from terminal to server . This phase encompasses four stages:- Transmission: Time taken to send the packet.
- Propagation: Time taken for the signal to travel across the distance.
- Queuing: Time spent waiting in the server's task queue.
- Computation: Time spent by the server processing the task.
The transmission latency is , where is the task packet size and is the data transmission rate. The propagation latency is , where is the distance between sender and receiver, and is the speed of light.
Queuing time() depends on the server's queue status and remaining processing time of tasks ahead.Task processing time() depends on server computational capabilities andinference frameworks.
-
Result Return Phase: The server returns a
result packetto . This phase involves onlytransmissionandpropagation. The transmission latency is , and the propagation latency is , where is the distance during result return (which may vary from ). -
Total Time Delay: The total time delay for a successfully offloaded task is computed as: where:
- : Size of the task packet.
- : Size of the result packet.
- : Data transmission rate.
- : Distance during task offloading.
- : Distance during result return.
- : Speed of light.
- : Queuing latency.
- : Computation latency.
-
Delay Constraint: The maximum acceptable delay for all task requests is constrained by . If , the task is considered
infeasibleand abandoned.The following figure (Figure 2 from the original paper) shows the correlation between CPU allocation and time cost:
该图像是图表,展示了CPU分配比例与时间开销之间的关系。图中对比了边缘计算(MEC)和云计算(Cloud)在不同CPU使用率下的时间开销,体现了随着CPU分配增加,时间开销均有所下降,但云计算整体时间开销低于MEC。
4.2.6. Problem Formulation
The primary objective is to find the most efficient strategy for delegating computationally demanding LLM inference operations to either edge or cloud computing infrastructures, considering resource limitations at the end-user level. The goal is to reduce the mean response time for all LLM inference jobs while enhancing the precision of the model's predictive outcomes.
-
System Utility Function: The total system utility is encapsulated in the following function: where:
- : Represents the
average function. - : Represents the set of latencies for all tasks . The term maximizes utility by minimizing average latency.
- : Represents the set of prediction accuracies for all tasks . The term maximizes utility by maximizing average prediction accuracy.
- : Represents the
-
Optimization Problem: The ultimate objective is to maximize this overall system utility : where:
- , , : Represent the
residual bandwidth,computational, andgraphics memory resourcesavailable at eachMECandCSat time . These must be non-negative. - : Each task's total delay must not exceed the maximum acceptable delay.
- , , : Represent the
-
Enhanced Problem Formulation (with Economic and Risk Factors): The formulation can be extended to include economic considerations and risk management, which are crucial in practical
cloud-edge computing systems. where:- : The cost associated with offloading tasks.
- : The risk associated with offloading tasks (e.g., security, reliability).
- : The maximum allowable cost.
- : The maximum allowable risk. This integration allows for a more comprehensive approach to managing resources by considering economic and reliability aspects in offloading decisions.
4.2.7. Active Inference Based Offloading Strategy
4.2.7.1. State and Action Representations
The agent (the decision-making entity) needs to understand the environment state and define its possible actions.
- Environment State (): The state of server is represented as .
- : Remaining
computational resourcesof server . - : Remaining
bandwidthof server . - : Remaining
graphics memory resourcesof server . The overall state is formed by integrating thedistance matrixfrom terminal to with the status of all servers: .
- : Remaining
- Agent Action (): The agent's actions involve delegating the
LLM inference taskto a specific server and distributing resources. The vector of actions executed by the agent at time is: where:- : The unique index of the server () to which the task is delegated.
- : The amount of
computational resourcesallocated for by . - : The amount of
channel bandwidthallocated. - : The amount of
graphics memoryallocated. An action isinvalidif allocated resources exceed the server's remaining resources, meaning cannot be executed at that time.
4.2.7.2. Rewardless Guidance in Active Inference
The proposed algorithm uses a rewardless guidance mechanism, moving away from conventional DRL's reliance on environmental reward signals. The focus of offloading decisions is to ensure tasks are executed with minimal latency and maximal success rate.
- Rewardless Guidance Function (
rg): This concept is formalized as: where:tc: A binary variable. It is 1 if task is completed successfully, and 0 otherwise. This ensures that only successfully completed tasks contribute to the guidance.- : The total time delay for task . The term indicates that
rewardless guidanceincreases as latency decreases (i.e., faster completion is better). - : The prediction accuracy for task .
Rewardless guidanceincreases as prediction accuracy increases. A higher value indicates that action is more aligned with state underrewardless guidance, increasing the likelihood of selecting . This effectively guides the agent towards actions that achieve both low latency and high accuracy without an explicit external reward signal.
4.2.7.3. Framework of Active Inference Decision
The active inference-based decision-making approach enables the agent to make internal decisions and interact with its environment. The agent-environment interaction is defined as a Partially Observable Markov Decision Process (POMDP) [35].
-
POMDP Description:
- At time
t-1, the agent is in state and selects action with probability . - It then transitions to state at time . This transition, , is probabilistic.
- In a
POMDP, the agent cannot always observe the true environmental state. Instead, it continuously receivesobservationsbased on the probability .
- At time
-
Generative Model: The agent uses to predict external environmental conditions, where denotes
learnable parametersof its internal model. -
Free Energy Minimization: Adhering to the
free energy principle, the agent reducesfree energythrough two processes:- Internal Model Building: It engages in
POMDP, gathers observations , and builds aninternal modelto represent the environment. This model acts as the agent's internal representation of the external world. - Action Planning: During
action planning, the agent uses this model to identify and execute actions that reducefree energy. Thisactive inference mechanismimproves the agent's ability to understand its surroundings, anticipate future states, and perform goal-oriented actions.
- Internal Model Building: It engages in
-
Free Energy (): The goal of
active inference optimizationis to enhance the evidence of the agent'sgenerative model, thereby reducingfree energy. By setting expected preferences, can be directed toward achieving this goal state. The agent seeks to minimizefree energy, denoted as: where:- :
Kullback-Leibler (KL) divergence. - : Represents the agent's
beliefabout future variables (its approximate posterior distribution over states and parameters). - : The agent's
generative model(its prior distribution over observations, states, and parameters). - : Also known as the
evidence lower bound (ELBO)[36], guides the agent's strategy selection. The agent selects the strategy that minimizes .
- :
-
Anticipated Future Free Energy (): The agent's objective extends beyond a single time point . It seeks to minimize the
anticipated future free energy, formulated as: where:- : The sequence of observations made by the agent from time 0 to .
- : The sequence of states experienced by the agent over the same interval.
- : The agent's
subjective probability distributionover future variables, conditioned on a policy . - : The agent's
generative model, with being the parameters of the underlyingneural network. Theoptimal policyis derived by minimizing . This minimization is achieved by aligning thegenerative model's output distributionwith thetrue state distribution, such that: This implies that the agent's beliefs about future observations and states, given its policy, become identical to its generative model's predictions, minimizing surprise and uncertainty.
4.2.7.4. Algorithm 1: Guided Active Inference Offloading without Rewards
The full process of the proposed active inference algorithm is detailed as follows:
Algorithm 1: Guided Active Inference Offloading without Rewards
- Require:
Transition probabilities.- Initial policy (denoted as in the algorithm, implying an empty or initial policy).
- Optimization involves iterations.
- Considers potential policies, from which the top are selected.
- Process executed over episodes, each with steps.
Ensemble modelparameterized by .
- Goal: Refine the initial policy into an optimized strategy .
- Ensure: Optimized strategy .
- for
every episodedo -
`EPY t` (Initialize time step to 0). -
`Reset` (Reset the environment state for a new episode). -
**for** `every step` **do** -
`generates task` . -
**for** `every iteration` **do** -
`A set of J potential policies derived from` . -
**for** `each candidate policy` **do** -
`Get` (Sample a candidate policy from the distribution of policies). -
`Compute` by minimizing (Calculate a component of the rewardless guidance based on minimizing anticipated future free energy). -
`Compute` (Calculate another component of the rewardless guidance). -
**end for** -
`Rank policies` `based on` `and select the top` . (Policies are evaluated using the combined rewardless guidance , and the best policies are chosen). -
**end for** -
`Adjust` `based on the top` `policies`; (The current policy is updated using information from the selected top policies). -
`Choose action` `based on` ; (The agent selects an action according to the updated policy ). -
`Obtain` , `and check completion by applying` ; (The environment transitions to state , and the latency and prediction accuracy for the task are observed, along with task completion status). -
`Store` (The experience tuple is stored for later learning). -
`Update` ; (The current state is updated for the next step). -
**end for** -
`Train ensemble model` ; (The parameters of the internal generative model are updated using the collected experience). - end for
returnEMPY(The optimized policy is returned).
5. Experimental Setup
5.1. Datasets
The experimental validation utilized the HumanEval dataset [38], published by OpenAI.
- Source: OpenAI.
- Scale and Characteristics: The dataset comprises 164 programming problems. Each problem includes:
- Function signatures.
- String annotations (docstrings).
- Code bodies (the solution).
- Test units (to verify correctness).
- Domain: Programming tasks, specifically code generation and completion.
- Creation Method: The problems were created manually (
handwritten creation) to ensure accuracy and non-repeatability. - Language: Problems are articulated in Python, with descriptive sections (like comments) in English.
- Purpose: This dataset is suitable for evaluating
LLMson their ability to generate functional and correct code, making it relevant for tasks that involveLLM inferencewhere the output quality (and thus accuracy ) is critical.
5.2. Evaluation Metrics
The paper uses several standard metrics to evaluate the performance of its proposed method and the baseline DRL algorithms. While the paper does not provide explicit mathematical formulas for these metrics, their conceptual definitions are standard in reinforcement learning and system performance evaluation.
-
Total Reward (or Sum Reward):
- Conceptual Definition: In traditional
DRL,total rewardis the cumulative sum of rewards an agent receives over an episode or a series of interactions with the environment. It is the primary objective function forDRLagents, indicating how well the agent is achieving its goals. In this paper, for the proposed method,total rewardis used solely for performance assessment, decoupled from the action selection mechanism, which usesrewardless guidanceinstead. For baselines, it's the direct optimization target. - Mathematical Formula: For an episode of length : $ \text{Total Reward} = \sum_{t=0}^{T} R_t $
- Symbol Explanation:
- : The reward received by the agent at time step .
- : The total number of time steps (or horizon) in an episode.
- Conceptual Definition: In traditional
-
Task Completion Rate:
- Conceptual Definition: This metric quantifies the percentage of
LLM inference tasksthat are successfully completed within their respectivemaximum acceptable delay() and without invalid resource allocations. It measures the reliability and effectiveness of the offloading strategy in handling the workload. - Mathematical Formula: $ \text{Task Completion Rate} = \frac{\text{Number of Successfully Completed Tasks}}{\text{Total Number of Tasks Requested}} \times 100% $
- Symbol Explanation:
Number of Successfully Completed Tasks: Count of tasks that meet all constraints (latency, resources) and produce a result.Total Number of Tasks Requested: Total number ofLLM inference tasksinitiated by terminals.
- Conceptual Definition: This metric quantifies the percentage of
-
Mean Latency:
- Conceptual Definition:
Mean latencyrefers to the averagetotal time delay() for allLLM inference tasksfrom initiation to result return. Minimizing latency is crucial for responsivemobile computingandreal-time IoT applications. - Mathematical Formula: $ \text{Mean Latency} = \frac{1}{N_{completed}} \sum_{k=1}^{N_{completed}} L_{T_k} $
- Symbol Explanation:
- : The number of successfully completed tasks.
- : The
total time delayfor the -th successfully completed task, as defined in Section 4.2.5.
- Conceptual Definition:
-
Mean Pass@100:
- Conceptual Definition: This metric, specifically mentioned in the context of the
HumanEval dataset, likely refers to the averageprediction accuracyfor tasks, scaled or aggregated.Pass@kis a common metric forcode generation models(likeGPT-J-6BonHumanEval), indicating the proportion of problems for which at least one of generated solutions passes the unit tests. Here,pass@100would mean out of 100 generated solutions per problem, how many pass. Themean pass@100would be the average of these scores across all problems. In the context of the paper's utility function, it directly relates to , theprediction accuracyof task . A higher value indicates better quality or accuracy of theLLM's output. - Mathematical Formula: If is the accuracy for task (e.g., its
pass@100score), then: $ \text{Mean Pass@100} = \frac{1}{N_{completed}} \sum_{k=1}^{N_{completed}} P_{T_k} $ - Symbol Explanation:
- : The number of successfully completed tasks.
- : The
prediction accuracy(e.g.,pass@100score) for the -th successfully completed task.
- Conceptual Definition: This metric, specifically mentioned in the context of the
5.3. Baselines
The paper compared its proposed method against three prominent Deep Reinforcement Learning (DRL) algorithms:
-
Rainbow DQN [39]:
- Description:
Rainbow DQNis an advancedDRLalgorithm that combines several key improvements to the originalDeep Q-Network (DQN). These improvements includeDouble DQN(to reduce overestimation bias),Prioritized Experience Replay(to sample more important experiences more frequently),Dueling Networks(to separate state value and advantage functions),Multi-step Learning(to use returns from multiple steps),NoisyNets(for exploration), andDistributional RL(to learn a distribution over returns rather than just the expected return). It's known for its strong performance in environments with discrete action spaces. - Representativeness: It represents a state-of-the-art value-based
DRLmethod, known for its stability and performance in variousRLbenchmarks.
- Description:
-
PPO (Proximal Policy Optimization) [40]:
- Description:
PPOis apolicy gradient DRLalgorithm that strikes a balance between ease of implementation, sample efficiency, and performance. It works by optimizing asurrogate objective functionwith aclipping mechanismto constrain policy updates, preventing them from becoming too large and destabilizing training.PPOis widely used and performs well in both discrete and continuous action spaces. - Representativeness: It is one of the most popular and robust
policy gradientmethods, often considered a strong baseline for manyDRLtasks.
- Description:
-
SAC (Soft Actor-Critic) [41]:
-
Description:
SACis anoff-policy actor-critic DRLalgorithm that optimizes astochastic policyin anentropy-regularized reinforcement learningframework. Theentropy regularizationencourages exploration and helps prevent the policy from collapsing to a single action.SACis known for its stability, sample efficiency, and effectiveness in continuous control tasks. -
Representativeness: It represents a state-of-the-art
actor-criticmethod, particularly effective in continuous control settings relevant to resource allocation.These algorithms were chosen as benchmarks because they are
state-of-the-art (SOTA)and effective in both discrete and continuous domains, providing a solid comparative baseline for the proposedactive inferencemethod.
-
5.4. Environmental Configuration and Resource Constraints
The experimental environment was configured to align with the maximum demands identified during the training phase, with the algorithm's efficacy assessed under fluctuating workloads during testing.
- LLM Model:
GPT-J-6Bwas used, consisting of 28 layers, with a model dimension () of 4096 and a feedforward dimension () of 16384. It included attention heads, each with a dimension () of 256.Rotary Positional Embedding (RoPE)was utilized for dimensions per head. The model was trained using a tokenization vocabulary of , employing the sameByte Pair Encoding (BPE)scheme asGPT-2andGPT-3[37]. - Edge Server: Featured an
NVIDIA 3090 GPU. ForLLM inference, it used theGPT-J-6Bmodel devoid of any acceleration methods. - Cloud Server: Leveraged
Triton serverto enhance theinference performanceof theGPT-J-6Bmodel. - Resource Disparity: Consequently, the
computation times() and the number of passes (pass@100) required to offload theGPT-J-6B taskdiffered significantly between thecloudandedgedue to their disparate hardware capabilities and optimization frameworks. - Simulation Parameters:
- Maximum time steps (): Ranged from 1 to 15 seconds (in 1-second intervals) for latency variation analysis. For training, it was set to 15 seconds.
- Number of tasks (): Set to 100 for both training and latency variation analysis.
- Cloud-to-Edge Server Resource Ratio: 1:4 (This ratio significantly impacts offloading decisions, especially for latency-sensitive tasks).
6. Results & Analysis
6.1. Core Results Analysis
The simulation results are presented through benchmarking against prominent DRL algorithms (Rainbow DQN, PPO, and SAC) across two main phases: training phase performance and latency variation analysis.
6.1.1. Training Phase Performance
This section compares the performance of the proposed method with existing ones during the training phase, with seconds and .
The following figure (Figure 3 from the original paper) shows training phase performance metrics:
该图像是图表,展示了论文中多个深度强化学习方法在训练过程中的性能对比,包括累积回报、任务完成率、平均延迟和平均通过率四个指标,图中方法包括AI w/RewardlessGuidance、RainbowDQN、SAC和PPO。
-
Convergence Speed and Total Reward (Figure 3a):
- The proposed
active inferencemethod withrewardless guidance(AI w/RewardlessGuidance) demonstratessuperior convergence speed, achieving stable performance around episode 200. - Initially (first 50 episodes), it underperforms due to environmental complexity, but quickly surpasses
Rainbow DQN, which converges rapidly but thenplateaus. SAClags significantly, whilePPOshows slow initial convergence but eventually matchesRainbow DQN's performance, though still outperformed by the proposed method.- This indicates that the
active inferenceapproach learns to effectively manageLLM offloadingfaster and reaches a higher performance level compared to traditionalDRLtechniques.
- The proposed
-
Task Completion Rate (Figure 3b):
- The proposed method achieves a
remarkably high task completion rateof approximately . - This significantly exceeds
Rainbow DQN's ,PPO's , andSAC's . - This highlights the method's ability to
reliably execute a large proportion of taskswithout compromising individual task requirements, which is critical forLLMservices.
- The proposed method achieves a
-
Mean Latency (Figure 3c):
- The proposed method exhibits an
average task completion latencyof about 8 seconds, outperformingSAC(around 10 seconds). - It is slightly lower than both
Rainbow DQNandPPO. - The
shaded regionsin the graph represent thevariability (standard deviation)of average latency during training. These regions are larger for mainstreamDRLstrategies, reflecting greater instability and fluctuation in their latency performance. TheAI w/RewardlessGuidanceshows much tighter variability, underscoring itsstabilityand importance forlow-latency IoTapplications in dynamic environments.
- The proposed method exhibits an
-
Mean Pass@100 (Figure 3d):
- The proposed method achieves an
average pass@100of approximately 0.175, surpassingRainbow DQNandPPO's 0.15, andSAC's 0.14. - This suggests that the
active inferencemethod tends tooffload tasks more frequently to high-accuracy edge nodesor makes better decisions leading to higher qualityLLMoutputs. - While mainstream
DRLsmight prioritize low latency, the proposed method effectivelybalances both latency and accuracygoals.
- The proposed method achieves an
6.1.2. Latency Variation Analysis
This section evaluates algorithm performance under varying maximum latency thresholds (), ranging from 1 to 15 seconds (in 1-second intervals), with .
The following figure (Figure 4 from the original paper) shows a benchmark comparison of several prominent deep reinforcement learning algorithms under different maximum latency thresholds:
该图像是图4,展示了在不同最大延迟阈值下几种主要深度强化学习算法的性能基准比较,包括总奖励、任务完成率、平均延迟及平均通过@100四个指标。
-
Total Reward (Figure 4a):
- The
CloudandMEC(presumably separate baselines or configurations, though not explicitly detailed in the text accompanying the figure) yield the highest reward when seconds. This suggests that with sufficient time, centralized or closer-to-edge resources can maximize rewards. - The proposed
AI w/RewardlessGuidanceconsistently shows higher rewards than theDRLbaselines across various values, especially as increases.
- The
-
Task Completion Rate (Figure 4b):
- No algorithm can complete tasks when seconds. This is attributed to the
unavoidable combined inference and transmission timesexceeding this threshold, highlighting a fundamental physical limitation of the system. - Within the range of seconds, the proposed method and
Rainbow DQNmaintain atask completion rateabove . In contrast,PPOandSACdrop below this rate, indicating poorer performance in moderately constrained latency scenarios. - The text notes that
minimum inference time surpasses 9 secondsgivenedge constraintsandwireless transmission delay, and tasks can only be offloaded to thecloudduring this period (which aligns with the 1:4 cloud-to-edge resource ratio). - When seconds, the proposed method achieves
near-100% task completion, demonstrating its robustness under less stringent latency constraints.
- No algorithm can complete tasks when seconds. This is attributed to the
-
Mean Latency (Figure 4c):
- The figure shows a
declining curveformean latencyas increases, particularly noticeable ascloud offloadingbecomes possible. This implies that greater tolerance for delay allows tasks to be processed by more powerful (but potentially more distant) cloud resources, thereby reducing the average processing time. - The proposed method consistently maintains lower
mean latencyacross different values compared to theDRLbaselines.
- The figure shows a
-
Mean Pass@100 (Figure 4d):
-
When seconds, the
average pass@100remains near zero for all algorithms, corresponding to the inability to complete tasks. -
For seconds, the
average pass@100stays below 0.05 for all algorithms, indicating that even completed tasks during this constrained period have low accuracy. -
However, when seconds,
AI w/Rewardless Guidanceachieves anaverage pass@100of approximately 0.15, significantly outperformingRainbow DQNandPPO(around 0.12) andSAC(about 0.1). This suggests that given sufficient latency tolerance, the proposed method optimizes for higher-qualityLLMoutputs.Overall, the
active inferencemethod withrewardless guidanceconsistently demonstratessuperior performanceacross all evaluated metrics and varyinglatency thresholds. This indicates its strong adaptability to diverselatency demands, which is a key requirement for efficienttask offloadingandexecutioninIoTenvironments.
-
6.2. Data Presentation (Tables)
The provided text excerpt does not include any tables of experimental results. All results are presented graphically in Figure 3 and Figure 4.
6.3. Ablation Studies / Parameter Analysis
The paper presents an analysis of performance under varying maximum latency thresholds (), which can be considered a form of parameter analysis or sensitivity analysis rather than a formal ablation study. This analysis (Section V.C) investigates how the algorithms perform as the t_max parameter changes, revealing their robustness and adaptability to different latency requirements. The results discussed in Section 6.1.2 detail how the algorithms' total reward, task completion rate, mean latency, and mean pass@100 are affected by this crucial parameter, showing the proposed method's superior performance across a wide range of t_max values. No explicit ablation studies (where specific components of the proposed method are removed to assess their individual contribution) are described in the provided text.
7. Conclusion & Reflections
7.1. Conclusion Summary
This study successfully proposed an active inference approach utilizing rewardless guidance to address the critical challenge of resource scarcity for Large Language Models (LLMs) during inference in IoT cloud-edge computing environments. The authors developed a robust cloud-edge network system designed for efficient handling of LLM inference tasks, from request initiation at terminals to result delivery. Extensive simulations empirically validated the effectiveness of the proposed method. It was shown to outperform conventional Deep Reinforcement Learning (DRL) techniques in several key aspects: training convergence speed, tolerance to maximum latency thresholds during testing, and overall task load management. The research highlights the potential of active inference to create more adaptable and robust LLM strategies suitable for the demands of the 6G era and the Symbiotic Internet-of-Things (IoT).
7.2. Limitations & Future Work
The authors identified several directions for future work:
-
Complex Scenarios: Extending the approach to more complex scenarios, such as
dynamic network topologies(where the network structure changes over time) andmulti-agent environments(where multiple intelligent agents interact). -
Broader Terminal Devices: Broadening the range of
terminal devicesconsidered, implying greater diversity in their capabilities and mobility patterns. -
Distributed Computing: Exploring
distributed computing scenarioswhere computation is spread across many interconnected devices. -
Advanced Network Systems: Leveraging advanced network systems for
resource scheduling, specifically mentioningspace-air-ground integrated networks(SAGIN), which involve satellites, aerial platforms (like UAVs), and ground stations. -
Algorithmic Performance Enhancement: Focusing on further enhancing the
algorithmic performanceof their proposedactive inferencemethod.While not explicitly stated as limitations, the need for these future directions implies current limitations in handling such complexities, the scope of device types, and the integration with advanced heterogeneous networks.
7.3. Personal Insights & Critique
This paper presents a compelling argument for moving beyond traditional DRL in the context of LLM offloading within 6G Symbiotic IoT, leveraging active inference with rewardless guidance.
Inspirations:
- Bridging Neuroscience and AI: The use of
active inference, a framework rooted inneuroscience, for a practical engineering problem likeLLM offloadingis highly inspiring. It suggests that biologically plausible models of intelligence can offer novel solutions to complexAIand networking challenges, especially in dynamic, uncertain environments. - Robustness in Dynamic Environments: The concept of
rewardless guidanceis particularly insightful. By moving away from brittle, hand-crafted reward functions, the agent inherently seeks better predictive models of its environment, which naturally leads to robust and adaptive behavior. This could be highly beneficial forIoTscenarios where environmental conditions (e.g., network load, device mobility) are constantly changing and unpredictable. - 6G Potential: The paper effectively highlights how
6G's promised capabilities (low latency, high data rates) can be synergistically combined with intelligent offloading strategies. This could pave the way for truly omnipresent and responsiveAIservices, even onresource-constrained devices.
Potential Issues & Areas for Improvement:
-
Complexity of Active Inference: While
rewardless guidancesimplifies the reward engineering problem,active inferenceitself can be computationally complex due to its probabilistic nature and the need to maintain an internal generative model. The paper briefly mentionslearnable parameters() for thegenerative modelbut doesn't elaborate on the specific architecture or training overhead. A more detailed discussion on the computational cost and complexity of theactive inferenceagent itself, especially for real-time deployment on actualedge devicesor within6Ginfrastructure, would strengthen the argument. -
Real-world Deployment Challenges: The simulations are thorough, but translating these results to real-world
6G Symbiotic IoTdeployments will involve significant challenges. Factors like channel fading, intermittent connectivity, device heterogeneity (beyond mobile/fixed categories), and security vulnerabilities are crucial. While the problem formulation includes price and risk, their integration into theactive inferencemechanism itself needs further exploration beyond just constraints. -
Interpretability of
Rewardless Guidance: Therewardless guidancefunction is a clever way to encode desired outcomes. However, the balance between minimizing latency and maximizing accuracy (the two terms in the sum) is set implicitly. Future work could explore how this balance can be dynamically adjusted or learned based on application-specific priorities, or if a more nuancedutility functioncould be incorporated directly into thefree energy minimizationobjective without becoming an explicitreward function. -
Comparison to Other AI Paradigms: While
DRLis a strong baseline, comparingactive inferenceagainst otherAIparadigms for resource management, such asfederated learningormulti-agent systemsthat are not strictlyDRL(beyond in references), could provide a broader perspective on its advantages and limitations.Overall, this paper provides a valuable contribution to the field of
LLM deploymentin future networks, offering a fresh perspective based onactive inference. Its rigorous simulation results demonstrate promising performance, setting a strong foundation for future research in this critical area.
Similar papers
Recommended via semantic vector search.