EdgeShard: Efficient LLM Inference via Collaborative Edge Computing
TL;DR Summary
EdgeShard uses collaborative edge computing to shard LLMs across distributed devices, optimizing latency and throughput via dynamic programming, reducing inference delay by 50% and doubling throughput while addressing cloud dependency challenges.
Abstract
Large language models (LLMs) have shown great potential in natural language processing and content generation. However, current LLMs heavily rely on cloud computing, leading to prolonged latency, high bandwidth cost, and privacy concerns. Edge computing is promising to address such concerns by deploying LLMs on edge devices, closer to data sources. Some works try to leverage model quantization to reduce the model size to fit the resource-constraint edge devices, but they lead to accuracy loss. Other works use cloud-edge collaboration, suffering from unstable network connections. In this work, we leverage collaborative edge computing to facilitate the collaboration among edge devices and cloud servers for jointly performing efficient LLM inference. We propose a general framework to partition the LLM model into shards and deploy on distributed devices. To achieve efficient LLM inference, we formulate an adaptive joint device selection and model partition problem and design an efficient dynamic programming algorithm to optimize the inference latency and throughput, respectively. Experiments of Llama2 serial models on a heterogeneous physical prototype demonstrate that EdgeShard achieves up to 50% latency reduction and 2x throughput improvement over baseline methods.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
EdgeShard: Efficient LLM Inference via Collaborative Edge Computing
1.2. Authors
Mingjin Zhang, Jiannong Cao, Fellow, IEEE, Xiaoming Shen, Zeyang Cui
1.3. Journal/Conference
The paper is published as a preprint on arXiv, indicated by arXiv preprint arXiv:2405.14371, 2024. As an arXiv preprint, it has not yet undergone formal peer review by a journal or conference, but arXiv is a highly respected open-access repository for preprints in fields like physics, mathematics, computer science, etc., serving as a primary means for rapid dissemination of research findings. Jiannong Cao is listed as a Fellow, IEEE, which indicates a high level of recognition in the Institute of Electrical and Electronics Engineers, suggesting expertise in the field.
1.4. Publication Year
2024 (Specifically, 2024-05-23T09:46:22.000Z, which is May 23, 2024).
1.5. Abstract
The paper addresses the challenges of deploying Large Language Models (LLMs), such as prolonged latency, high bandwidth cost, and privacy concerns, which arise from their heavy reliance on cloud computing. While edge computing offers a promising alternative by placing LLMs closer to data sources, existing solutions like model quantization often lead to accuracy loss, and cloud-edge collaboration suffers from unstable network connections. This work introduces EdgeShard, a framework that leverages collaborative edge computing among edge devices and cloud servers for efficient LLM inference. EdgeShard partitions an LLM into shards and deploys them across distributed devices. To optimize performance, it formulates an adaptive joint device selection and model partition problem and proposes an efficient dynamic programming algorithm to optimize for inference latency and throughput. Experimental evaluations using Llama2 models on a heterogeneous physical prototype demonstrate that EdgeShard achieves up to 50% latency reduction and 2x throughput improvement compared to baseline methods.
1.6. Original Source Link
Official Source: https://arxiv.org/abs/2405.14371v1
PDF Link: https://arxiv.org/pdf/2405.14371v1.pdf
Status: Preprint (Version 1).
2. Executive Summary
2.1. Background & Motivation
The explosion of Large Language Models (LLMs) has brought unprecedented capabilities in natural language processing (NLP) and content generation. However, their immense computational and memory requirements currently necessitate heavy reliance on cloud computing. This reliance presents several critical issues:
-
Prolonged Latency: Sending data to and from distant cloud servers introduces delays, making
real-time applications(e.g., robotics, navigation) impractical. -
High Bandwidth Cost: Transmitting large volumes of data (text, images, video, IoT sensor data) to the cloud consumes significant network bandwidth and incurs costs.
-
Privacy Concerns: Handling sensitive data (e.g., medical records, financial data, personal mobile data) in a centralized cloud raises substantial privacy and security risks.
Edge computingis a promising paradigm to mitigate these issues by moving computation closer to the data sources, onedge devices(e.g., edge servers, mobile phones). However, LLMs arecomputation-intensiveandresource-greedy, often exceeding the memory and computational capacity of single edge devices. Existing approaches to adapt LLMs for edge environments face their own limitations: -
Model Quantization: While it reduces model size, it frequently leads toaccuracy loss. -
Cloud-Edge Collaboration: Partitioning models between edge and cloud can be effective but often suffers fromhigh and unstable network latencybetween edge devices and distant cloud servers.The paper identifies a gap: previous
edge computingresearch often focuses on vertical collaboration (cloud-edge-end devices) but neglectshorizontal edge-to-edge collaborations. With the continuous growth ofedge computing powerand the deployment of numerousedge serversandedge clouds, there's an opportunity to leverage a broader pool of distributed resources. This motivates the authors to explorecollaborative edge computing (CEC)—integrating geo-distributed edge devices and cloud servers—as a more robust solution for LLM inference.
2.2. Main Contributions / Findings
The primary contributions of EdgeShard are threefold:
- General LLM Inference Framework: Introduction of
EdgeShard, a novel framework for efficientcollaborative LLM inferenceacross heterogeneous edge devices and cloud servers. This framework enables the dynamic selection of participating devices and the layer-wise partitioning of LLMs to optimize performance in acollaborative edge computingenvironment. - Adaptive Optimization Problem Formulation and Algorithm: The authors mathematically formulate a
joint device selection and model partition problem. This problem aims to optimize eitherinference latencyorthroughputby considering heterogeneous computation and networking resources, as well as memory constraints. To solve this complex problem, they design an efficientdynamic programming (DP) algorithm. - Experimental Validation on a Physical Testbed: Extensive experiments are conducted using
Llama2 serial models(7B, 13B, 70B) on a heterogeneous physical prototype. The results demonstrate thatEdgeShardsignificantly outperforms various baseline methods, achieving:-
Up to
50% latency reductioncompared toEdge-SoloandCloud-Edge-Optfor Llama2-7B. -
2x throughput improvementoverEdge-SoloandCloud-Edge-Optfor Llama2-7B. -
The ability to deploy larger models like Llama2-70B, which are
out-of-memory (OOM)for baseline methods. -
Adaptability to varying network conditions and device heterogeneity.
These findings highlight that
EdgeShardprovides a flexible and adaptive method for serving LLMs by efficiently utilizing ubiquitous computing devices, particularly beneficial in scenarios where cloud bandwidth is limited or traditional edge deployments are memory-constrained.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Large Language Models (LLMs)
LLMs are a class of deep learning models, typically decoder-based transformer models, characterized by their massive scale—often billions or even hundreds of billions of parameters. They are trained on vast amounts of text data, enabling them to understand, generate, and process human language with remarkable fluency and coherence.
-
Transformer Architecture: The fundamental architecture underpinning most modern LLMs. Introduced in "Attention Is All You Need" (Vaswani et al., 2017), it relies heavily on
self-attention mechanismsto weigh the importance of different parts of the input sequence when processing each element. -
Decoder-based Models: Unlike
encoder-based models(e.g., BERT) which are good for understanding existing text,decoder-based models(e.g., GPT series, Llama) are primarily designed forgenerative tasks, producing new text sequentially. -
Autoregressive Nature of Inference: LLM inference, especially for generation, is
autoregressive. This means that each new token generated depends on all the previously generated tokens and the initial input. The process typically involves two phases:- Prompt Processing (Prefill) Phase: The model takes an initial input sequence (the
prompt) and processes it to generate the first new token. During this phase, it computes and stores intermediate representations (likeKey-Value caches) for all input tokens. - Autoregressive Generation Phase: After the prefill phase, the model generates tokens one by one. In each step, it takes the current input (initial prompt + all previously generated tokens) and predicts the next token. This process repeats until an end-of-sequence token is generated or a maximum length is reached.
- Prompt Processing (Prefill) Phase: The model takes an initial input sequence (the
-
Key-Value Caching (KV Caching): A crucial optimization for
autoregressive generation. Since each new token calculation depends on all previous tokens, recomputingKeyandValuematrices for the entire sequence at each step would be very inefficient.KV cachingstores these matrices from previous computations, allowing new tokens to be generated by only computing theKeyandValuefor the new token and concatenating it with the cached ones, significantly reducing computational workload and improving response times.The following figure (Figure 2 from the original paper) illustrates the autoregressive nature of LLM inference:
该图像是一张示意图,展示了EdgeShard框架中LLM模型与异构网络设备的联合设备选择和模型分区过程,以及通过顺序推理与流水线并行推理实现协作推理的流程。
Fig. 2. LLM inference has an autoregressive nature.
3.1.2. Edge Computing
Edge computing is a distributed computing paradigm that brings computation and data storage closer to the data sources (the "edge" of the network), rather than relying solely on a centralized cloud data center. This proximity offers several advantages:
- Reduced Latency: Faster response times for applications by minimizing the distance data travels.
- Lower Bandwidth Consumption: Less data needs to be sent to the cloud, reducing network congestion and costs.
- Enhanced Privacy and Security: Sensitive data can be processed locally, reducing exposure to public networks and centralized repositories.
- Improved Reliability: Services can operate even with intermittent or unreliable cloud connectivity.
Edge devicescan range from small IoT sensors and mobile phones to more powerfuledge serversandedge gateways.
3.1.3. Collaborative Edge Computing (CEC)
Collaborative Edge Computing (CEC) extends the concept of edge computing by emphasizing the collaboration among various geo-distributed edge devices and cloud servers to jointly perform computational tasks. Instead of isolated edge devices or simple vertical offloading to a single cloud, CEC envisions a network of interconnected computing resources forming a shared resource pool. This paradigm aims to:
- Enlarge Resource Pool: Combine the compute, memory, and storage capacities of multiple heterogeneous devices.
- Optimize Resource Utilization: Distribute workloads efficiently across available resources.
- Provide Low-Latency Services: Keep processing close to data sources while leveraging broader capabilities.
- Expand Service Region: Enable wider coverage and availability of AI services.
The authors highlight that CEC differs from traditional
edge computingby focusing on horizontal (edge-to-edge) as well as vertical (cloud-edge) collaborations.
The following figure (Figure 1 from the original paper) illustrates the concept of collaborative edge computing:
该图像是示意图,展示了协作边缘计算如何整合广泛分布的设备计算资源以共同执行任务,体现了资源池扩大、低延迟处理、灵活接入和服务区域扩展等优势。
Fig. 1. Collaborative edge computing integrates the computing resources of ubiquitous geo-distributed devices for jointly performing computational tasks, with great benefits of enlarged resource pool, low-latency data processing, flexible device access, and expanded service region.
3.2. Previous Works
The paper discusses several existing approaches for deploying LLMs, primarily focusing on addressing their resource demands.
3.2.1. Model Quantization
Model quantization is a technique to reduce the memory footprint and computational cost of neural networks by representing their weights and/or activations with lower precision (e.g., 8-bit integers or 4-bit integers instead of 32-bit floating points).
- GPTQ [8]: An early work that quantizes LLMs with billions of parameters to 3-4 bits using approximate second-order information. It primarily focuses on
weight-only quantization. - AWQ (Activation-aware Weight Quantization) [10]: Optimizes channel scaling to preserve important weights, aiming to reduce quantization error. Also primarily
weight-only. - SmoothQuant [11] and Agile-Quant [7]: Take a more comprehensive approach by quantizing not only model
weightsbut alsoactivations. This can lead to greater memory and computational savings. - Limitation: While effective in reducing model size, quantization often comes with a trade-off: it
may lead to accuracy losscompared to full-precision models. The paper also points out that even quantized LLMs might still exceed the capacity of resource-constrained edge devices (e.g., Llama2-70B 4-bit still requires 35GB).
3.2.2. Cloud-Edge Collaboration
These approaches partition an LLM and offload parts of its computation workload between an edge device and a powerful cloud server.
- Wang et al. [13]: Focus on distributing computation between cloud servers and edge devices to increase
throughputand reducecommunication overheadby leveraging the low-rank property of residual activations. - Chen et al. [14]: Propose using location-based information of edge devices for personalized prompt completion in a collaborative edge-cloud LLM serving scenario.
- Limitation: A major drawback of these
vertical collaborationmethods is that thelatency between edge devices and cloud servers is usually high and unstable, which can negatively impact the overall inference or fine-tuning performance of LLMs. They also typically involve only two computing entities (one edge, one cloud).
3.2.3. Distributed Training/Inference in Cloud Data Centers
These works focus on distributing large neural networks across multiple GPUs within a high-performance cloud data center.
- Gpipe [17]: A system for
pipeline parallelismthat splits a deep neural network into stages, assigning each stage to a different GPU. It focuses on efficient training of very large models. - PipeDream [18]: Another system for
generalized pipeline parallelismdesigned forDNN training, aiming to maximize hardware utilization. - Differentiation from EdgeShard: The paper explicitly states that these cloud-centric solutions are not directly applicable to
edge computingenvironments due to key differences:- Homogeneity vs. Heterogeneity: Cloud servers typically have
homogeneous GPUsconnected byhigh-bandwidth networks(e.g., InfiniBand, Nvlinks, up to 600 GB/s). Edge devices, in contrast, are inherentlyheterogeneousin their computation capabilities and are connected byheterogeneous and low-bandwidth networks(e.g., dozens of Kbps to 1000 Mbps). - Resource Constraints: Cloud data centers have abundant resources, whereas edge environments are
resource-constrained. Solutions designed for the cloud often neglect theseheterogeneousandresource-constrainedcharacteristics of edge computing.
- Homogeneity vs. Heterogeneity: Cloud servers typically have
3.3. Technological Evolution
The evolution of deploying large AI models has generally followed a path from centralized, powerful cloud data centers towards distributed, localized edge environments. Initially, the sheer scale of deep learning models necessitated cloud infrastructure. As models grew larger into LLMs, techniques like quantization emerged to squeeze them onto smaller devices, albeit with potential accuracy compromises. Simultaneously, cloud-edge collaboration offered a hybrid approach, but its performance remained tethered to the stability and speed of cloud connectivity.
This paper's work on EdgeShard represents a significant step in this evolution by proposing collaborative edge computing. It moves beyond the limitations of single-device edge deployment and simple cloud-edge pairings, embracing a broader network of heterogeneous edge devices and cloud servers. This approach allows for more flexible and robust deployment by dynamically leveraging available resources, bringing the benefits of distributed processing (like those seen in cloud data centers via pipeline parallelism) to the more challenging and diverse edge computing landscape.
3.4. Differentiation Analysis
EdgeShard differentiates itself from previous works in several critical ways:
- Collaborative Edge Computing Paradigm: Unlike
quantizationwhich reduces model size at the cost of potential accuracy, orcloud-edge collaborationwhich is limited by unstable cloud links and typically involves only two nodes,EdgeShardembraces a trulycollaborative edge computingenvironment. It integrates multiple heterogeneous edge devices and cloud servers into a single resource pool. This allows for horizontal (edge-to-edge) collaboration in addition to vertical (cloud-edge) collaboration, creating a more resilient and powerful system. - Adaptive Joint Optimization: Instead of fixed partitions or relying on simple offloading,
EdgeShardformulates anadaptive joint device selection and model partition problem. This allows it to intelligently decide which devices to use and how to partition the LLM layers among them, dynamically optimizing for specific goals likelatencyorthroughput. This contrasts with fixed partitioning strategies (e.g.,Cloud-Edge-Even) which may not account for heterogeneity. - Handling Heterogeneity:
EdgeShardis explicitly designed forheterogeneous computing devicesandheterogeneous network connectionstypical of edge environments. This is a key departure from cloud-based distributed systems (likeGpipeorPipeDream) that assumehomogeneous GPUsandhigh-bandwidth interconnects. EdgeShard'sdynamic programming algorithminherently factors in these varying computational capabilities and bandwidths. - Flexibility in Objectives: The framework provides distinct optimization algorithms for both
minimizing inference latency(suitable for single-user, real-time scenarios) andmaximizing inference throughput(suitable for multi-user or batch processing scenarios), offering tailored performance solutions. - Memory Wall Solution without Accuracy Loss: By sharding the LLM across multiple devices,
EdgeShardenables the deployment of models (like Llama2-70B) that would otherwise exceed the memory capacity of any single edge device or even simple cloud-edge setups, without resorting toquantizationand its associatedaccuracy loss.
4. Methodology
The EdgeShard framework facilitates efficient collaborative LLM inference on distributed devices by intelligently partitioning the LLM and allocating its shards to a selected subset of available computing devices. The overall process consists of three main stages: Profiling, Scheduling Optimization, and Collaborative Inference.
The following figure (Figure 3 from the original paper) provides an overview of the EdgeShard framework:
该图像是论文中关于协同边缘计算加速LLM推理的示意图,展示了(a)顺序推理和(b)流水线并行推理两种执行方式的时间分布,体现了不同设备上模型层或批次的执行情况。
4.1. Principles
The core idea behind EdgeShard is to leverage the collective computational and memory resources of geo-distributed edge devices and cloud servers to overcome the limitations of deploying large LLMs on individual, resource-constrained devices. It views an LLM as a stack of layers and aims to distribute these layers (or shards) across a network of heterogeneous devices. The principle is that by optimally selecting devices and partitioning the model, the system can minimize overall inference latency or maximize throughput, even with varying computational capabilities and network bandwidths among the devices. The theoretical basis lies in dynamic programming for solving sequential decision-making problems, which naturally fits the layered structure of LLMs.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Profiling Stage
Profiling is an offline, one-time step that gathers essential runtime characteristics of the LLM layers on different devices and network conditions. This information is crucial for the subsequent scheduling optimization.
The profiled traces include:
- Execution time of each layer on different devices: This involves measuring the time it takes for a specific LLM layer to compute on each available device. For LLMs, this profiling differentiates between the
prefill stageand theautoregressive stagetoken generation. If a device cannot hold the full model for profiling, adynamic model loading technologyis used to load layers consecutively. - Size of activations and memory consumption for each layer: This determines the data transfer volume between layers and the memory footprint of each layer on a device.
- Available memory of each device: The total memory capacity of each potential computing device.
- Bandwidth among devices: The network communication speed between any two devices in the collaborative network.
4.2.2. Scheduling Optimization Stage
At this stage, an intelligent scheduler uses the profiled information to determine an optimal deployment strategy. This strategy involves:
-
Device Selection: Deciding which subset of available computing devices (edge devices and cloud servers) will participate in the inference task.
-
Model Partition: Dividing the LLM into multiple
shards. The partitioning islayer-wise, meaning specific layers are grouped together to form a shard. -
Shard Allocation: Assigning each model shard to a selected device.
The
scheduling optimizationconsiders: -
Heterogeneous resources: The varying computational capabilities and memory budgets of different devices.
-
Memory budget of devices: Ensuring that the assigned layers do not exceed a device's memory capacity.
-
Privacy constraint: If applicable, specific layers might need to remain on certain devices for privacy reasons.
This stage formulates the problem as a
joint device selection and model partition problemand solves it using adynamic programming algorithmto optimize for eitherinference latencyorthroughput.
4.2.3. Collaborative Inference Stage
Once the LLM model partition and allocation strategy are determined, the selected devices execute the inference collaboratively.
- KV Cache Pre-allocation: Memory space is pre-allocated on each participating device for
Key-Value (KV) cachesto optimize autoregressive generation. - Two Cases for Collaborative Inference:
- Sequential Inference: Devices process their allocated
model shardsin sequence.- Description: If an LLM is partitioned into shards allocated to devices , processes the input, sends intermediate
activations/outputsto , which processes and transmits to , and so on. - Suitability: Ideal for serving a single user in scenarios like smart homes, where minimizing the
latencyfor one request is paramount. - Resource Utilization: Not resource-efficient, as devices other than the currently processing one remain idle.
- Description: If an LLM is partitioned into shards allocated to devices , processes the input, sends intermediate
- Pipeline Parallel Inference: Input data is split into
micro-batchesand processed in a pipeline fashion to improve resource utilization.-
Description: Device processes
micro-batch B1, then immediately startsmicro-batch B2while processesB1(sent from ). This overlapping execution keeps all devices busy. -
Suitability: Maximizes
throughputby keeping devices utilized concurrently. -
Inspiration: Similar to techniques used in cloud servers like
Gpipe [17]andPipeDream [18].The following figure (Figure 4 from the original paper) illustrates
collaborative LLM inferencein bothsequentialandpipeline parallelmodes:
该图像是论文EdgeShard中的示意图,展示了在不同设备之间分配模型分片的布局方案。图中显示了在四个设备上如何分布模型的不同部分,以实现协同推理。
-
- Sequential Inference: Devices process their allocated
Fig. 4. Collaborative LLM inference
4.2.4. System Model
The system considers an LLM with layers and a network of heterogeneous edge devices and cloud servers.
-
: Total number of layers in the LLM.
-
: Size of activations (output) of layer , for .
-
: Memory consumption required for layer .
-
: Total number of computing devices.
-
: Memory budget of device , for .
-
: Bandwidth between device and device , for .
-
Source Node: Device 0 is designated as the source node where input tokens originate.
The main notations used in the paper are listed in Table II: The following are the results from [Table II] of the original paper:
| Symbol | Descriptions |
|---|---|
| binary variable, whether layer of a model is allocated to device | |
| computation time of layer on device | |
| computation time of layer to layer on device | |
| communication time to transmit activations of layer `i-1` from device to device | |
| `DP(i, j)` | minimal total execution time of the first layers if layer is allocated to device |
| `g(i, S, k)` | processing time of the slowest node to process the first layers with device set |
4.2.5. Optimize LLM inference latency
Problem Formulation
The goal is to minimize the total inference latency (). The allocation strategy is defined by a binary variable :
- if layer is allocated to device .
- otherwise.
Each layer must be allocated to exactly one device:
$
\sum_{j=0}^{M-1} X_{i,j} = 1, \forall i
$
The
communication timeto transmit activations of layeri-1from device to device , denoted as , is calculated as: Where is the size of activations of layeri-1, and is the bandwidth between device and device . If layersi-1and are on the same node, communication time is zero.
The total inference time () is the sum of computation times for all layers and communication times between layers if they are on different devices:
Where is the computation time of layer on device .
The problem of minimizing LLM inference latency is formulated as: Subject to constraints:
- Privacy Constraint: The first layer must be allocated to the source node (device 0) to avoid raw input data transmission.
- Memory Constraint: The total memory required by all layers allocated to device cannot exceed its memory budget. Where is the memory consumption of layer , and is the memory budget of device .
Solution: Dynamic Programming Algorithm
A dynamic programming (DP) algorithm is designed to solve this problem by leveraging its optimal sub-problem property.
Let DP(i, j) denote the minimal total execution time of the first layers, given that layer is allocated to device .
The state transition equation is formulated as:
This equation represents the minimal execution time for the first layers ending with layer on device , by considering all possible devices that could host layer i-1. The term is the minimal execution time up to layer i-1 ending on device .
For the last layer N-1, an additional communication time to send the generated token back to the source node (device 0) is included:
The authors' provided equation (6) in the paper for DP(i, j) seems to be a simplified representation and doesn't explicitly show the special case for the last layer's return to the source node. Based on the accompanying text, the term is added for the last layer. The provided formula (6) in the paper is:
The initialization for the first layer (layer 0) is based on the privacy constraint ():
The minimal total execution time for the entire LLM is then found by taking the minimum across all devices :
The optimal allocation strategy can be reconstructed by backtracking the choice made at each step of the DP algorithm.
The algorithm to find the optimal LLM partition and allocation strategy for minimizing inference latency (Algorithm 1 in the paper) is presented as follows:
Algorithm 1: Joint device selection and LLM partition for optimizing latency
Input: A LLM model; Computing device ; Profiled traces; bandwidth
Output: the device selection and LLM partition strategy
// initialization
1 Initialize DP table , and choice table to record the strategy;
2 Enforce first layer to be allocated to node 0 by and
// fill in the DP table
3 for to do
4 for to do
5 if then // Check if current layer and all preceding layers allocated to device j exceed its memory. This check seems problematic as it accumulates memory for _all_ previous layers on _one_ device. A more accurate check would be for layers assigned to device j within its contiguous block. Based on the context of lines 12-14 in the algorithm, it should be checking memory for layers (or a block of layers) currently being allocated to device j. However, I will strictly adhere to the paper's formula (from line 5). The sum in (5) is for *all* layers allocated to device j, not necessarily a contiguous block. The problem statement (5) and line 5 of the algorithm are contradictory in terms of memory checks. I'll stick to the algorithm's direct translation and note the potential inconsistency if observed. Re-reading line 5, it checks , which means "if the memory budget of device j is *less than or equal to* the requirement of layer i". This condition seems reversed or incorrectly stated; it should likely be to indicate an OOM scenario, or . Given the initial memory constraint from section IV, "The memory budget of a device j is Mem_j," and constraint (5) "", the algorithm's check on line 5 means if device j cannot even hold layer i. This is a very basic check. The actual sum constraint (5) is managed implicitly by the DP structure. For clarity, I will interpret line 5 as a check that layer *itself* can fit on device . The full memory constraint (Eq. 5) is usually implicitly handled by preventing allocations that would lead to memory overruns when building up the solution.
Rereading line 5:
This means if the memory capacity of device is less than or equal to the requirement of layer , then skip this device. This is a basic filter. The original formulation (5) is the overall constraint. The DP needs to keep track of the cumulative memory usage for each device as layers are assigned. The provided algorithm (Algo 1) does not explicitly track the cumulative memory for *all* layers assigned to . Line 13 implies is being reduced, which would be for *current* usage, not the total budget. This implies in the algorithm is perhaps `remaining memory` on device . This is a common pattern in DP for resource constraints. However, without explicit definition of how is updated, it's ambiguous.
Given the constraint , the DP would need to ensure this. A typical way is to pass remaining memory as a state, or to check the sum of `Req` for all layers *assigned so far to that specific device*. The algorithm as written (Line 5 and Line 13) is a bit simplified. Line 13 is crucial and underspecified. I will assume it means is the *available* memory and is updated to reflect the memory consumed by layer . However, the constraint (5) applies to *all* layers on device , not just layer . If in Line 5 is the *total capacity*, and is for one layer, then means `device_j_capacity <= layer_i_requirement`, which would imply that cannot fit on . This is a reasonable initial check. Line 13 "Update memory " is still the most ambiguous part regarding how the sum in (5) is maintained. A common approach is to track memory used for each device as part of the DP state or via a separate data structure. Given the strict instruction to follow the paper, I will present the algorithm exactly as written, acknowledging this potential ambiguity in a footnote if necessary. For now, I will assume in line 13 implies updating a *temporary* memory usage tracker for device as layers are hypothetically assigned. If is total capacity, then it should not be updated. If it is *remaining capacity*, it should be . The constraint (5) is a global check *after* all allocations are decided. The DP is building *towards* that.
Let's re-interpret the memory check logic: in lines 5 and 13 is likely *not* the total memory budget, but rather a temporary variable tracking the *memory consumed by layers currently allocated to device within the current DP path*. So, in line 5 would still be a minimum capacity check. The constraint (5) is usually handled by meaning total capacity and checking before adding layer to device . Or, if in the DP state means remaining memory, then would be the check. Given the phrasing and , it's still slightly unclear, but I'll stick to the exact text. For a rigorous academic assistant, I should highlight this ambiguity if possible, but the prompt asks to be faithful. I will translate it literally. If is total capacity, then means the device's total memory is too small for layer . If refers to available memory, then should be compared against it. Let's assume in Line 5 refers to the *total memory capacity* of device . Then implies device cannot even hold layer . The update in Line 13 is problematic for this interpretation.
The most consistent interpretation for Line 5 and Line 13, given the overall problem, is that Line 5 is a placeholder check and the *actual* memory constraint (Eq. 5) is checked at the *end* for the *entire* allocation, or Line 13 implies a cumulative memory sum that is checked *against* (total capacity) at each step. If in line 5 and line 13 refers to the total memory *budget* of device , then line 5 means: if device 's total capacity is less than or equal to layer 's requirement, then it cannot host layer . Line 13 would then be incorrect, as the budget doesn't change.
A more common interpretation for `DP(i, j)` where is the device, if memory is constrained, is that the state should also include memory. Or the algorithm implicitly assumes the "memory budget of devices" constraint is checked when `t_total` is computed, perhaps the profiler provides `t_comp` as `INF` if OOM.
However, I must strictly follow the *given algorithm text*. Line 5 reads
Line 13 reads . This indicates in the algorithm is a *dynamic* variable. This means is likely meant to be the *remaining memory* on device . If so, then is memory needed for layer . would mean , which is the condition for OOM. This interpretation makes more sense.
Let's re-state based on this interpretation: in the algorithm (lines 5, 13) represents the *remaining memory* on device *for the current path of layers being assigned to it*. The initial value of this (which is not explicitly set in the algorithm's initialization for each iteration) would need to be the full capacity, `Mem_j_total_capacity`. Then line 5 makes sense as . Line 13 would then mean . However, the DP state `DP(i, j)` only captures time, not remaining memory. This is a subtle contradiction.
The paper's overall constraint is (5): .
The algorithm's line 5: (here is the *total* memory budget, as it's not a state variable for `DP(i,j)` or dynamically tracked within the loop without more context).
The algorithm's line 13: . This line is the most problematic if is the total budget.
Perhaps in the algorithm refers to the remaining capacity *for the current contiguous block of layers being assigned to device j*. This is complex.
The simplest interpretation (and most charitable to the algorithm as written without additional state variables) is that Line 5 acts as a basic filter, checking if the *total capacity* of device is not even enough for layer . The in line 13 is likely a conceptual placeholder for how *future* memory state might be passed or tracked, or it's simply an error in pseudo-code for the *actual cumulative memory tracking* needed for constraint (5). Given the prompt's strictness, I must write it as and , and for `DP(i,j)`, the explicit memory tracking is not in its state. The global constraint (5) is usually satisfied by backtracking and summing for allocated layers. I will present the algorithm exactly as written.
Algorithm 1: Joint device selection and LLM partition for optimizing latency Input: A LLM model; Computing devices ; Profiled traces; bandwidth Output: the device selection and LLM partition strategy
// initialization 1 Initialize DP table , and choice table to record the strategy; 2 Enforce first layer to be allocated to node 0 by and
// fill in the DP table
3 for to do
4 for to do
5 if then // Check if device j's memory budget is too small for layer i
6 Continue;
7 end
8 else
9 for to do
10 Calculate the total execution time by Eq. (6) and assign it to :
// Eq. (6) is:
// (with the special case for i = N-1 for return to source)
11 if then
12 Update DP(i, j) by assigning
13 Update memory // This line is ambiguous in the paper's pseudocode. It likely implies tracking cumulative memory usage for device up to layer and ensuring it doesn't exceed (total capacity). However, the pseudocode literally says , which if is the global budget, would be incorrect. Given the strict instruction, I will reproduce it as-is.
14 Record allocation plan
15 end
16 end
17 end
18 end
19 end
// backtrace for allocation strategy 20 Initialize optimal strategy . 21 Find the last selected node . 22 Add to . // This is incorrect, is a device index, not a layer. It should be adding the assignment to . 23 for to 0 do 24 Find the previous node // Original says which is also confusing. Assume to avoid overwriting. 25 Add to . // This adds the previous layer's device. For , this would be layer -1. Needs careful re-interpretation. 26 Update // Update for the next iteration of backtracing 27 end 28 Reverse : // would contain device assignments from last to first layer. Reversing makes it first to last. 29 return .
*Correction on Algorithm 1 Interpretation:* The pseudo-code for backtracing (lines 22, 25) is slightly inconsistent with typical DP backtracking. Usually, would store pairs of (layer\_index, device\_index). Line 22 implies adding the *device* of the last layer. Line 25 implies adding the *previous* layer's device. If is intended to contain `(layer_idx, device_idx)` pairs, then line 22 should be `Add`(N-1, N_{last})`to`R. Also, the loop condition for backtracing typically goes down to 0, and the `choice` for layer 0 (i.e. ) is `0`, meaning layer 0 is on device 0. The current logic would add `(-1, device_idx)` if it runs to . I will strictly reproduce the pseudocode as given in the paper, acknowledging the potential minor ambiguities if any, but will try to keep my descriptive text consistent with the likely intent.
**Computational Complexity**: The complexity of Algorithm 1 is , where is the number of layers and is the number of devices.
### 4.2.6. Optimize LLM inference throughput
#### Problem Formulation
For optimizing `throughput`, `pipeline parallelism` is adopted. The goal is to maximize throughput, which is equivalent to minimizing the latency of the slowest device. Communication and computation times can be overlapped.
For a selected set of devices , the maximum latency for any device is determined by the maximum of its computation time and incoming communication time for a specific task segment.
Where is the computation time of layers from to on device , and is the communication time to transmit activations of layer `i-1` from device to device .
The problem of maximizing inference throughput is then formulated as minimizing the maximum latency across all selected devices:
#### Solution: Dynamic Programming Algorithm
Similar to latency optimization, this problem also exhibits `optimal sub-problem property`, solvable using `dynamic programming`.
Let `g(i, S, k)` denote the minimum time for the slowest node to process the first layers, with the set of used devices , and device being the last node used ().
The state transition equation for `g(m, S', j)` (processing the first layers with device set , where is the last device and ) is:
This equation means that to reach state `(m, S', j)`, we consider all previous states `(i, S, k)` and pick the one that minimizes the maximum of: (1) the slowest time up to layer using set ending with , (2) the communication time from to , and (3) the computation time on for the layers to .
Constraints for state transition:
1. **Memory Constraint**: The memory required for layers to on device must not exceed its budget.
Where is the sum of memory requirements for layers through .
2. **Privacy Constraint**: Similar to latency, the first layer is on device 0.
This initializes the state for the first layer (layer 0) using device 0. Here, implies processing 1 layer (layer 0), with device set `{0}`, and `0` as the last device.
The final optimal solution is the minimum over all possible device sets and last devices .
The algorithm for optimizing throughput (Algorithm 2 in the paper) is provided as follows:
Algorithm 2: Joint device selection and LLM partition for optimizing throughput Input: A LLM model; Computing devices ; Profiled traces; bandwidth Output: the device selection and LLM partition strategy
// initialization 1 Initialize DP table , and choice table to record the strategy; 2 Enforce first layer to be allocated to node 0 by and
// fill in DP table 3 for to do // i is the upper layer index of the previous block 4 for each subset do // S is the set of devices used for first i layers 5 for last node do // k is the last device in set S 6 for to do // m is the upper layer index of the current block 7 for do // j is a new device not in S to host layers i+1 to m 8 if then // Check if layers i to m exceed device j's memory 9 Continue; 10 end 11 else 12 Get by adding node to the selected device set 13 Calculate current maximum execution time via Eq. (11) for the maximum execution time in all stages; // Eq. (11) is: 14 end 15 if then // If new path leads to a smaller max latency for this state 16 17 Record the current strategy // Records the previous state (i, S, k) and new device j 18 end 19 end 20 end 21 end 22 end 23 end
// backtrace for optimal allocation 24 Initialize optimal strategy 25 Find selected device set and the last selected node by 26 Initialize 27 while do 28 // Retrieves the previous state from which current state was optimized 29 Add to . // This implies layers from prev_layer to layer are on N_last_opt. 30 Update , , and 31 end 32 return .
**Computational Complexity**: The complexity of Algorithm 2 is , where is the number of layers and is the number of devices. This is significantly higher than Algorithm 1 due to iterating over all subsets of devices .
### 4.2.7. Pipeline Execution Optimization
The ideal `pipeline parallel inference` assumes no device idleness. However, `decoder-based LLM inference` has an `autoregressive nature`, meaning token generation is sequential and relies on previously generated tokens. This can lead to `bubbles` (idle time) in the pipeline if a device waits for a complete `micro-batch` to finish all its token generations before starting the next.
* **EdgeShard-Bubbles**: The traditional pipeline approach where devices wait for full `micro-batches`. As shown in Fig. 5(a), there are periods where devices are idle, leading to resource underutilization.
* **EdgeShard-No-bubbles**: This proposed optimization aims to reduce these bubbles by allowing `immediate token generation` without waiting for the completion of all `micro-batches` in an iteration. As illustrated in Fig. 5(b), after the `prefill stage` (`P1`) of the first `micro-batch` ends, Device 1 immediately starts the `token generation` (`G1A`) for that `micro-batch`. Similarly, upon completion of `G1A`, Device 1 proceeds to the next iteration of token generation (`G1B`). This approach mitigates device idle time, enhancing resource utilization and improving `throughput`.
The following figure (Figure 5 from the original paper) contrasts `bubbles` and `no-bubbles` pipeline execution:

*该图像是一张实验平台的照片,展示了多个异构边缘设备和云服务器的实际硬件部署,图中设备通过蓝色网线互联,展示了论文中EdgeShard系统的测试环境。*
Fig. 5. This figure from the paper contrasts pipeline execution with `Bubbles` (top) and `No-bubbles` (bottom) for collaborative LLM inference. In the `Bubbles` approach, devices wait for the completion of entire micro-batches before proceeding, leading to idle times. In the `No-bubbles` approach, devices immediately move to the next token generation step of a micro-batch, reducing idle periods and improving resource utilization.
# 5. Experimental Setup
## 5.1. Datasets
* **Benchmark Tasks**: The experiments focus on `text generation` tasks, which is a core application of LLMs.
* **LLM Models**: `Llama2 serial models` are used: `Llama2-7B`, `Llama2-13B`, and `Llama2-70B`. These models, released by Meta in 2023, are popular and powerful open-source LLMs.
* **Dataset**: The `WikiText-2 dataset [21]` from HuggingFace is utilized.
* **Characteristics**: A collection of high-quality articles from Wikipedia, primarily used for language modeling tasks.
* **Input/Output Configuration**: A subset of samples is extracted with an `input token length of 32` and `generation of 96 tokens`.
* **Precision**: All experiments use `full-precision model inference`, avoiding `quantization` to isolate the performance benefits of `EdgeShard`'s collaboration strategy.
## 5.2. Evaluation Metrics
The performance of `EdgeShard` and baseline methods are evaluated using two key metrics:
1. **Latency**:
* **Conceptual Definition**: Latency, in this context, refers to the average time taken to generate a single token during the LLM inference process. It measures the responsiveness of the system. Lower latency indicates faster responses, which is critical for real-time interactive applications.
* **Unit**: Milliseconds per token (`milliseconds/token`).
* **Mathematical Formula**: While not explicitly provided in the paper, average latency per token is typically calculated as:
\$
\text{Average Latency} = \frac{\text{Total Inference Time}}{\text{Total Number of Generated Tokens}}
\$
Where `Total Inference Time` includes both computation and communication time for all layers to generate all tokens, and `Total Number of Generated Tokens` is the number of new tokens produced by the model (e.g., 96 tokens in their setup).
2. **Throughput**:
* **Conceptual Definition**: Throughput measures the rate at which the system can process tokens, specifically the number of tokens generated per second. It indicates the system's capacity to handle workloads, particularly useful for batch processing or serving multiple users concurrently. Higher throughput means more tokens can be generated in a given time period.
* **Unit**: Tokens per second (`tokens/second`).
* **Mathematical Formula**: While not explicitly provided, throughput is generally calculated as:
\$
\text{Throughput} = \frac{\text{Total Number of Generated Tokens}}{\text{Total Inference Time}}
\$
This is the inverse of latency when considering a single batch/request. For multiple batches/requests, it's the total tokens generated across all batches divided by the total time. The paper mentions `maximum batch size`, implying throughput is measured for the maximum load the system can handle.
## 5.3. Baselines
`EdgeShard`'s performance is compared against several baseline methods, designed to cover different deployment strategies in edge and cloud environments. `Cloud-only` is explicitly excluded as a baseline due to `privacy concerns` regarding raw input transmission.
1. **Edge-Solo**:
* **Description**: In this baseline, the entire LLM is deployed and inferred locally on a `single edge device` without any model partitioning or collaboration.
* **Representativeness**: Represents the simplest, fully on-device deployment. It is limited by the memory and computational capacity of that single edge device.
2. **Cloud-Edge-Even**:
* **Description**: The LLM is `evenly partitioned` into two parts. One part is deployed on an `edge device`, and the other on a `cloud server`. The partitioning is fixed and doesn't adapt to device capabilities.
* **Representativeness**: Represents a naive `cloud-edge collaboration` strategy, splitting the model equally without optimization.
3. **Cloud-Edge-Opt**:
* **Description**: The LLM is partitioned into two shards, with one on an `edge device` and another on a `cloud server`. However, unlike `Cloud-Edge-Even`, the `partitioning strategy` is optimized using the `dynamic programming algorithm` proposed in the paper (Algorithms 1 or 2), but constrained to only two devices (one edge, one cloud).
* **Representativeness**: Represents an optimized `vertical cloud-edge collaboration` and serves as a direct comparison for how `EdgeShard`'s algorithm performs when restricted to a two-device scenario.
## 5.4. Testbed
The experiments are conducted on a `heterogeneous physical testbed` designed to simulate a `collaborative edge computing` environment.
* **Devices**:
* **Edge Devices**: 12 `Jetson AGX Orin` units and 2 `Jetson Orin NX` units.
* **Cloud Server**: 1 server equipped with an `RTX 3090 GPU`.
* **Total Devices**: 15 devices in total.
The specifications of these devices are detailed in Table III:
The following are the results from [Table III] of the original paper:
Category
Device
Memory
AI Performance
Edge Device
Jetson AGX Orin
32GB
3.33 TFLOPS
Edge Device
Jetson Orin NX
16GB
1.88 TFLOPS
Cloud Server
RTX 3090
24GB
36 TFLOPS
* **Network Configuration**:
* All devices are interconnected via a `router` and a `switch`.
* The base bandwidth between any two devices is `1000Mbps`.
* The `Linux TC tool [20]` is used to programmatically vary `network bandwidth` and `communication latency` between devices, allowing for testing under diverse network conditions.
* **Specific Network Settings for Overall Evaluation**:
* Source node: `AGX Orin`.
* Bandwidth between source node and cloud server: `1Mbps`.
* Bandwidth between other computing devices: `50Mbps` with a `20% variance`.
* **Batch Size**: For throughput tests, the `batch size` is set to the `maximum batch size` that the participating devices can support.
The following figure (Figure 6 from the original paper) shows the physical testbed:

*该图像是图表,展示了网络带宽对协同LLM推理延迟的影响,包括Llama2-7B、Llama2-13B和Llama2-70B三个模型在不同带宽下的延迟变化情况,比较了多种部署方案的性能。*
Fig. 6. Our testbed has heterogeneous edge devices and cloud server. Their specifications are shown in Table III.
# 6. Results & Analysis
## 6.1. Core Results Analysis
The experiments evaluate `EdgeShard` against baseline methods across different `Llama2 models` under varying network conditions and source node capabilities, focusing on `inference latency` and `throughput`.
### 6.1.1. Overall Evaluation (Fixed Network Conditions)
The initial evaluation sets the source node as an `AGX Orin`, with 1Mbps bandwidth between the source and cloud, and 50Mbps (20% variance) between other devices. The batch size is set to the maximum supported by participating devices.
The following are the results from [Table IV] of the original paper:
Llama2-7B
Llama2-13B
Llama2-70B
Latency
Throughput
Latency
Throughput
Latency
Throughput
Edge-Solo
140.34
24.36
OOM
OOM
OOM
OOM
Cloud-Edge-Even
227.35
7.56
319.44
4.68
OOM
OOM
Cloud-Edge-Opt
140.34
24.36
243.45
4.74
OOM
OOM
EdgeShard
75.88
52.45
173.43
10.45
3086.43
1.25
**Key Observations from Table IV:**
* **Deployment Capability for Large Models**: For `Llama2-70B`, `Edge-Solo`, `Cloud-Edge-Even`, and `Cloud-Edge-Opt` all suffer from `Out-Of-Memory (OOM)` errors. This is because Llama2-70B requires 280GB memory, far exceeding individual device capacities or simple two-device splits. `EdgeShard`, however, successfully deploys and infers `Llama2-70B` (with 3086.43 ms latency and 1.25 tokens/s throughput) by splitting it into shards across multiple devices. This highlights EdgeShard's ability to tackle the memory wall for very large LLMs.
* **Latency Reduction for Smaller Models**: For `Llama2-7B`, `EdgeShard` achieves `75.88 ms` latency, which is approximately `1.85x faster` than `Edge-Solo` and `Cloud-Edge-Opt` (both `140.34 ms`), and about `3x faster` than `Cloud-Edge-Even` (`227.35 ms`).
* **Throughput Improvement for Smaller Models**: For `Llama2-7B`, `EdgeShard` achieves `52.45 tokens/s` throughput, which is around `2.2x higher` than `Edge-Solo` and `Cloud-Edge-Opt` (`24.36 tokens/s`), and about `7x higher` than `Cloud-Edge-Even` (`7.56 tokens/s`).
* **Similar improvements are observed for `Llama2-13B`**: `EdgeShard` achieves `45.7%` and `28.8%` lower latency, and `2.23x` and `2.2x` higher throughput compared to `Cloud-Edge-Even` and `Cloud-Edge-Opt`, respectively.
* **Impact of Limited Cloud Bandwidth**: For `Llama2-7B`, `Cloud-Edge-Opt` performs identically to `Edge-Solo`. This is because the `1Mbps bandwidth` between the source node and the cloud server is severely limited. The `optimal strategy` for `Cloud-Edge-Opt` under such constraint is to perform local execution, effectively behaving like `Edge-Solo` to avoid high communication costs.
### 6.1.2. Effects of Bandwidth
This experiment varies the bandwidth between the source node (AGX Orin) and the cloud server from 1Mbps to 50Mbps. For `Llama2-13B` and `Llama2-70B`, baselines `Edge-Solo`, `Cloud-Edge-Even`, and `Cloud-Edge-Opt` are not applicable due to `OOM` or similar performance as `Edge-Solo` at low bandwidths. `EdgeShard` is compared with its variant `EdgeShard-Even` (evenly partitioned across devices).
The following figure (Figure 7 from the original paper) shows the impact of network bandwidth on the latency of collaborative LLM inference:

*该图像是图8,展示了带宽对协作型LLM推理吞吐量的影响。图中三个子图分别对应Llama2-7B、Llama2-13B和Llama2-70B模型,横轴为带宽(Mbps),纵轴为吞吐量(tokens/s),显示了EdgeShard相比其他方法在不同带宽下的吞吐量变化趋势。*
Fig. 7. Impact of Network Bandwidth to Latency of Collaborative LLMs inference
**Observations on Latency (Fig. 7):**
* **Bandwidth Sensitivity**: Except for `Edge-Solo` (which is unaffected by cloud bandwidth as it's local), the latency of collaboration-based methods (`Cloud-Edge-Even`, `Cloud-Edge-Opt`, `EdgeShard`) decreases as bandwidth increases. This is due to reduced data transmission time.
* **Bandwidth Saturation**: A `dramatic latency reduction` is observed when bandwidth increases from `1Mbps to 10Mbps`. Beyond `10Mbps` (e.g., up to `50Mbps`), the latency reduction becomes minor, suggesting that `computation time` becomes the `bottleneck` as communication costs become less dominant.
* **Cloud-Edge vs. Edge-Solo**: When bandwidth is , `cloud-edge collaboration methods` (`Cloud-Edge-Even`, `Cloud-Edge-Opt`) start outperforming `Edge-Solo` because the powerful cloud server can accelerate computation, outweighing communication costs.
* **EdgeShard vs. Cloud-Edge-Opt**: Interestingly, for `Llama2-7B` and `Llama2-13B`, when bandwidth is , `EdgeShard` shows nearly identical latency performance to `Cloud-Edge-Opt`. The authors found that in these conditions, `EdgeShard`'s optimization algorithm converged to the same model partition and allocation policies as `Cloud-Edge-Opt`, effectively indicating that `Cloud-Edge-Opt` is a special case of `EdgeShard` (i.e., using only two devices when optimal).
* **EdgeShard-Even for Llama2-70B**: `EdgeShard` consistently outperforms `EdgeShard-Even` for `Llama2-70B`, demonstrating the benefit of adaptive partitioning even when most devices are homogeneous, especially with the inclusion of the much more powerful `RTX 3090` cloud server. The improvement is less pronounced due to the presence of 11 identical AGX Orin devices.
The following figure (Figure 8 from the original paper) shows the impact of bandwidth on the throughput of collaborative LLM inference:

*该图像是图表,展示了图9中不同方法在Llama2-7B模型上的延迟和吞吐量性能对比。上图为延迟(ms),下图为吞吐量(tokens/s),不同颜色代表两种设备(AGX Orin与Orin NX),部分方法出现OOM(内存溢出)情况。*
Fig. 8. Impact of Bandwidth to Throughput of Collaborative LLMs Inference
**Observations on Throughput (Fig. 8):**
* **Similar Patterns**: Throughput generally follows similar patterns to latency, increasing with bandwidth for collaborative methods.
* **Memory-driven Throughput for Llama2-13B**: For `Llama2-13B`, `EdgeShard` shows a `2x higher throughput` than `Cloud-Edge-Opt` at `10Mbps`, despite similar latency at higher bandwidths. This is because `Cloud-Edge-Opt` (with only two devices) experiences high memory consumption (`95%-98%`) on the `RTX 3090` and `source node`, limiting it to a `maximum batch size of 4`. `EdgeShard`, by involving *several* edge devices, significantly reduces the memory consumption per device, allowing for a `larger batch size (8)`, thus boosting throughput.
* **EdgeShard-Even for Llama2-70B**: `EdgeShard-Even` shows stable throughput across bandwidths, as its fixed partitioning strategy does not change. `EdgeShard` offers a slight improvement over `EdgeShard-Even` for `Llama2-70B` due to adaptive resource utilization.
### 6.1.3. Effects of Source Node
This experiment investigates how the choice of the `source node` (where input tokens originate) impacts performance, comparing `AGX Orin` vs. `Orin NX` as the source, with cloud-source bandwidth fixed at `1Mbps`.
The following figure (Figure 9 from the original paper) shows the impact of the source node:

*该图像是图表,展示了Llama2-7B和Llama2-13B模型在不同方法下的吞吐量比较。图中分别以蓝色和橙色条形区分EdgeShard-no-bubble与EdgeShard-bubble两种方案,显现EdgeShard方案在吞吐量上显著优于Cloud-Edge方法。*
Fig. 9. Impact of Source Node
**Observations (Fig. 9):**
* **Memory Constraints with Weaker Source Node**: When the source node is `Orin NX` (which has lower memory, 16GB, compared to `AGX Orin`'s 32GB), `Edge-Solo` and `Cloud-Edge-Even` encounter `OOM` errors for `Llama2-7B`. This demonstrates that a single, less capable device cannot even host the full model or an evenly partitioned half.
* **Performance Gap in Cloud-Edge-Opt**: The performance difference between `AGX Orin` and `Orin NX` as source nodes is much more pronounced for `Cloud-Edge-Opt` (around `60ms latency gap`) than for `EdgeShard` (around `5ms latency gap`).
* This is because `Cloud-Edge-Opt`, being a two-device setup, tends to place more layers on the source node. If the source node is weaker (`Orin NX`), it becomes a significant bottleneck.
* **EdgeShard's Adaptability**: `EdgeShard` involves *more devices* and strategically places *fewer model layers* on the source node, effectively `filling the computation capacity gap` between heterogeneous source nodes. This leads to more stable performance regardless of the source node's individual capabilities.
* **Throughput Advantage**: Similarly, with `AGX Orin` as the source, `Cloud-Edge-Opt` achieves `6x higher throughput` than with `Orin NX`. In contrast, `EdgeShard` shows only a `2x higher throughput` difference. This further highlights `EdgeShard`'s ability to leverage the broader network resources to compensate for weaker individual devices.
### 6.1.4. Effects of Pipeline Execution Strategy
This evaluation compares `EdgeShard-No-bubble` against `EdgeShard-Bubble`, with cloud-source bandwidth set to `1Mbps`.
**Observations (Fig. 10):**
* **EdgeShard-No-bubble Superiority**: `EdgeShard-No-bubble` consistently `outperforms EdgeShard-Bubble` across all methods for `Llama2-7B` and `Llama2-13B`.
* For `Llama2-7B`, `EdgeShard-No-bubble` improves throughput by `0.36 tokens/s` for `Cloud-Edge-Even` and `6.96 tokens/s` for `EdgeShard`. (Note: For `Cloud-Edge-Opt`, it selects local execution at 1Mbps, so no pipeline is involved, thus throughput is the same for both).
* For `Llama2-13B`, improvements are `1.69 tokens/s` for `Cloud-Edge-Even`, `1.89 tokens/s` for `Cloud-Edge` (likely `Cloud-Edge-Opt`), and `5.21 tokens/s` for `EdgeShard`.
* **Reduced Idle Time**: `EdgeShard-No-bubble` is superior because it mitigates `device idle time` by allowing immediate token generation without waiting for all `micro-batches` in an iteration to complete. This effectively reduces `bubbles` in the pipeline and leads to higher `throughput`.
## 6.2. Data Presentation (Tables)
The table has been transcribed in section 6.1.1.
## 6.3. Ablation Studies / Parameter Analysis
The paper presents several implicit ablation and parameter analyses:
* **Adaptive vs. Even Partitioning**: The comparison between `EdgeShard` and `EdgeShard-Even` for `Llama2-70B` (in "Effects of Bandwidth") acts as an ablation study for the adaptive partitioning strategy. `EdgeShard`'s superior performance demonstrates the value of its optimization algorithm over a simpler, fixed even split, especially when device heterogeneity is present.
* **Impact of Network Bandwidth**: The "Effects of Bandwidth" section analyzes a crucial network parameter. It shows that `EdgeShard` adapts to bandwidth changes, converging to `Cloud-Edge-Opt` strategies when bandwidth is abundant but outperforming it when bandwidth is constrained or when more devices are needed to handle memory/computation. This highlights the adaptability of the dynamic programming approach.
* **Impact of Source Node Capabilities**: The "Effects of Source Node" section implicitly ablates the contribution of a powerful source node. When the source node is weak (`Orin NX`), `EdgeShard`'s ability to distribute the workload across more devices becomes more critical, demonstrating its robustness against individual device limitations.
* **Pipeline Execution Strategy**: The comparison between `EdgeShard-Bubble` and `EdgeShard-No-bubble` directly evaluates the effectiveness of the proposed `no-bubbles` optimization. This is a clear ablation showing that reducing idle time in the pipeline significantly boosts throughput.
These analyses collectively demonstrate that `EdgeShard`'s `adaptive device selection and model partition` (driven by the DP algorithms) and `pipeline execution optimizations` are crucial for achieving superior performance in heterogeneous collaborative edge environments.
# 7. Conclusion & Reflections
## 7.1. Conclusion Summary
This work introduces `EdgeShard`, a pioneering framework for the efficient deployment and distributed inference of `Large Language Models (LLMs)` within a `collaborative edge computing` environment. By intelligently partitioning LLMs into `shards` and adaptively allocating them across heterogeneous edge devices and cloud servers, `EdgeShard` effectively addresses the challenges of `high latency`, `bandwidth costs`, and `memory limitations` inherent in cloud-centric or single-device edge deployments. The core of `EdgeShard` lies in its `dynamic programming algorithms`, which are designed to solve a `joint device selection and model partition problem`, optimizing specifically for either `inference latency` or `throughput`. Extensive experiments on a physical prototype with `Llama2 models` conclusively demonstrate that `EdgeShard` significantly outperforms baseline methods, achieving up to `50% latency reduction` and `2x throughput improvement`. The findings highlight `EdgeShard`'s adaptability to diverse network conditions and its ability to enable `LLM inference` on models that would otherwise be `out-of-memory` for traditional edge or simple cloud-edge setups.
## 7.2. Limitations & Future Work
The authors acknowledge several open issues and suggest future research directions:
* **Incentive Mechanisms**: The current framework assumes a cooperative environment where devices either belong to a single entity or are willing to share resources. For scenarios involving devices owned by different stakeholders, robust `incentive mechanisms` are needed to encourage resource sharing and participation in collaborative inference.
* **Batch Size Aware Optimization**: While `EdgeShard` implicitly benefits from larger batch sizes by allowing more memory due to sharding, the current `dynamic programming algorithm` does not explicitly incorporate the `batch size` as a direct optimization variable. Future work could integrate `batch size` into the optimization problem to achieve even greater throughput improvements.
* **Dynamic Re-partitioning**: The profiling and scheduling are currently an offline step. In highly dynamic edge environments, network conditions and device availability can change rapidly. Future work could explore `dynamic re-partitioning` or `adaptive rescheduling` mechanisms to cope with real-time changes.
* **Energy Efficiency**: As edge devices are often battery-powered, optimizing for `energy consumption` alongside latency and throughput could be a critical future direction, especially for long-term deployments.
* **Fault Tolerance**: In a distributed system with many heterogeneous devices, device failures or intermittent connectivity can occur. Investigating `fault-tolerant mechanisms` to ensure continuous LLM inference amidst device unreliability would be important.
## 7.3. Personal Insights & Critique
This paper presents a highly relevant and well-executed approach to a critical problem in the burgeoning field of `Edge AI`. The shift from monolithic cloud LLMs to distributed edge deployments is inevitable given the demands for `low latency`, `privacy`, and `bandwidth efficiency`. `EdgeShard` provides a solid foundational framework for this transition.
One of the paper's strengths is its `rigorous experimental validation` on a `physical heterogeneous testbed`, which adds significant credibility over simulation-based studies. The detailed analysis of `bandwidth effects`, `source node capabilities`, and `pipeline execution strategies` provides valuable insights into the practical challenges and opportunities of `collaborative edge computing`. The `no-bubbles` pipeline optimization is a clever detail that addresses a specific characteristic of `autoregressive LLM inference`, showcasing a deep understanding of the problem domain.
A minor critique, as noted in the methodology section, is the slight ambiguity in the pseudo-code for Algorithm 1 regarding the memory constraint and how is updated. While typically managed in DP, the exact mechanics could have been more explicitly detailed for complete clarity for someone implementing the algorithm solely from the paper.
The methods and conclusions of `EdgeShard` are highly transferable. The `dynamic programming approach` for partitioning sequential models across heterogeneous resources could be applied to other `layered deep learning models` beyond LLMs (e.g., large `vision models` like `ViT`) or even other sequential computational tasks in edge environments. The principle of adaptive joint device selection and workload distribution is fundamental to optimizing any complex distributed task.
Beyond the authors' suggested future work, I believe further exploration could include:
* **Security and Trust**: In a multi-stakeholder scenario, beyond just incentives, ensuring the security and trustworthiness of each participating device (especially when processing sensitive intermediate activations) is paramount. `Homomorphic encryption` or `secure multi-party computation` could be explored, though they often introduce performance overheads.
* **Quantization Integration**: While `EdgeShard` currently uses full-precision models to avoid accuracy loss, combining the collaborative sharding with `quantization` (e.g., using different quantization levels for different shards based on device capabilities or sensitivity requirements) could offer even greater flexibility and resource efficiency for extreme edge devices.
* **Hierarchical Orchestration**: For extremely large-scale deployments, a hierarchical orchestration layer might be needed to manage clusters of collaborating edge devices, potentially involving `federated learning` principles for collaborative model updates or fine-tuning.
Overall, `EdgeShard` is a well-designed and thoroughly evaluated solution that lays important groundwork for the future of `LLM deployment` in increasingly distributed and heterogeneous computing landscapes.
Similar papers
Recommended via semantic vector search.