Paper status: completed

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

Published:05/23/2024

LLM Reasoning Capacity Enhancement (39)Training-Free Acceleration Methods (22)Collaborative Edge Computing Inference (1)Model Sharding Deployment (1)Dynamic Programming Optimization (1)

Original Link PDF

Price: 0.100000

6 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

EdgeShard uses collaborative edge computing to shard LLMs across distributed devices, optimizing latency and throughput via dynamic programming, reducing inference delay by 50% and doubling throughput while addressing cloud dependency challenges.

Abstract

Large language models (LLMs) have shown great potential in natural language processing and content generation. However, current LLMs heavily rely on cloud computing, leading to prolonged latency, high bandwidth cost, and privacy concerns. Edge computing is promising to address such concerns by deploying LLMs on edge devices, closer to data sources. Some works try to leverage model quantization to reduce the model size to fit the resource-constraint edge devices, but they lead to accuracy loss. Other works use cloud-edge collaboration, suffering from unstable network connections. In this work, we leverage collaborative edge computing to facilitate the collaboration among edge devices and cloud servers for jointly performing efficient LLM inference. We propose a general framework to partition the LLM model into shards and deploy on distributed devices. To achieve efficient LLM inference, we formulate an adaptive joint device selection and model partition problem and design an efficient dynamic programming algorithm to optimize the inference latency and throughput, respectively. Experiments of Llama2 serial models on a heterogeneous physical prototype demonstrate that EdgeShard achieves up to 50% latency reduction and 2x throughput improvement over baseline methods.

Mind Map

In-depth Reading

English Analysis~20 min read · 23,762 chars

1. Bibliographic Information

1.1. Title

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

1.2. Authors

Mingjin Zhang, Jiannong Cao, Fellow, IEEE, Xiaoming Shen, Zeyang Cui

1.3. Journal/Conference

The paper is published as a preprint on arXiv, indicated by arXiv preprint arXiv:2405.14371, 2024. As an arXiv preprint, it has not yet undergone formal peer review by a journal or conference, but arXiv is a highly respected open-access repository for preprints in fields like physics, mathematics, computer science, etc., serving as a primary means for rapid dissemination of research findings. Jiannong Cao is listed as a Fellow, IEEE, which indicates a high level of recognition in the Institute of Electrical and Electronics Engineers, suggesting expertise in the field.

1.4. Publication Year

2024 (Specifically, 2024-05-23T09:46:22.000Z, which is May 23, 2024).

1.5. Abstract

The paper addresses the challenges of deploying Large Language Models (LLMs), such as prolonged latency, high bandwidth cost, and privacy concerns, which arise from their heavy reliance on cloud computing. While edge computing offers a promising alternative by placing LLMs closer to data sources, existing solutions like model quantization often lead to accuracy loss, and cloud-edge collaboration suffers from unstable network connections. This work introduces EdgeShard, a framework that leverages collaborative edge computing among edge devices and cloud servers for efficient LLM inference. EdgeShard partitions an LLM into shards and deploys them across distributed devices. To optimize performance, it formulates an adaptive joint device selection and model partition problem and proposes an efficient dynamic programming algorithm to optimize for inference latency and throughput. Experimental evaluations using Llama2 models on a heterogeneous physical prototype demonstrate that EdgeShard achieves up to 50% latency reduction and 2x throughput improvement compared to baseline methods.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2405.14371v1 PDF Link: https://arxiv.org/pdf/2405.14371v1.pdf Status: Preprint (Version 1).

2. Executive Summary

2.1. Background & Motivation

The explosion of Large Language Models (LLMs) has brought unprecedented capabilities in natural language processing (NLP) and content generation. However, their immense computational and memory requirements currently necessitate heavy reliance on cloud computing. This reliance presents several critical issues:

Prolonged Latency: Sending data to and from distant cloud servers introduces delays, making real-time applications (e.g., robotics, navigation) impractical.
High Bandwidth Cost: Transmitting large volumes of data (text, images, video, IoT sensor data) to the cloud consumes significant network bandwidth and incurs costs.
Privacy Concerns: Handling sensitive data (e.g., medical records, financial data, personal mobile data) in a centralized cloud raises substantial privacy and security risks.

Edge computing is a promising paradigm to mitigate these issues by moving computation closer to the data sources, on edge devices (e.g., edge servers, mobile phones). However, LLMs are computation-intensive and resource-greedy, often exceeding the memory and computational capacity of single edge devices. Existing approaches to adapt LLMs for edge environments face their own limitations:
Model Quantization: While it reduces model size, it frequently leads to accuracy loss.
Cloud-Edge Collaboration: Partitioning models between edge and cloud can be effective but often suffers from high and unstable network latency between edge devices and distant cloud servers.

The paper identifies a gap: previous edge computing research often focuses on vertical collaboration (cloud-edge-end devices) but neglects horizontal edge-to-edge collaborations. With the continuous growth of edge computing power and the deployment of numerous edge servers and edge clouds, there's an opportunity to leverage a broader pool of distributed resources. This motivates the authors to explore collaborative edge computing (CEC)—integrating geo-distributed edge devices and cloud servers—as a more robust solution for LLM inference.

2.2. Main Contributions / Findings

The primary contributions of EdgeShard are threefold:

General LLM Inference Framework: Introduction of EdgeShard, a novel framework for efficient collaborative LLM inference across heterogeneous edge devices and cloud servers. This framework enables the dynamic selection of participating devices and the layer-wise partitioning of LLMs to optimize performance in a collaborative edge computing environment.
Adaptive Optimization Problem Formulation and Algorithm: The authors mathematically formulate a joint device selection and model partition problem. This problem aims to optimize either inference latency or throughput by considering heterogeneous computation and networking resources, as well as memory constraints. To solve this complex problem, they design an efficient dynamic programming (DP) algorithm.
Experimental Validation on a Physical Testbed: Extensive experiments are conducted using Llama2 serial models (7B, 13B, 70B) on a heterogeneous physical prototype. The results demonstrate that EdgeShard significantly outperforms various baseline methods, achieving:
- Up to 50% latency reduction compared to Edge-Solo and Cloud-Edge-Opt for Llama2-7B.
- 2x throughput improvement over Edge-Solo and Cloud-Edge-Opt for Llama2-7B.
- The ability to deploy larger models like Llama2-70B, which are out-of-memory (OOM) for baseline methods.
- Adaptability to varying network conditions and device heterogeneity.
  
  These findings highlight that EdgeShard provides a flexible and adaptive method for serving LLMs by efficiently utilizing ubiquitous computing devices, particularly beneficial in scenarios where cloud bandwidth is limited or traditional edge deployments are memory-constrained.

3.1. Foundational Concepts

3.1.1. Large Language Models (LLMs)

LLMs are a class of deep learning models, typically decoder-based transformer models, characterized by their massive scale—often billions or even hundreds of billions of parameters. They are trained on vast amounts of text data, enabling them to understand, generate, and process human language with remarkable fluency and coherence.

Transformer Architecture: The fundamental architecture underpinning most modern LLMs. Introduced in "Attention Is All You Need" (Vaswani et al., 2017), it relies heavily on self-attention mechanisms to weigh the importance of different parts of the input sequence when processing each element.
Decoder-based Models: Unlike encoder-based models (e.g., BERT) which are good for understanding existing text, decoder-based models (e.g., GPT series, Llama) are primarily designed for generative tasks, producing new text sequentially.
Autoregressive Nature of Inference: LLM inference, especially for generation, is autoregressive. This means that each new token generated depends on all the previously generated tokens and the initial input. The process typically involves two phases:
- Prompt Processing (Prefill) Phase: The model takes an initial input sequence (the prompt) and processes it to generate the first new token. During this phase, it computes and stores intermediate representations (like Key-Value caches) for all input tokens.
- Autoregressive Generation Phase: After the prefill phase, the model generates tokens one by one. In each step, it takes the current input (initial prompt + all previously generated tokens) and predicts the next token. This process repeats until an end-of-sequence token is generated or a maximum length is reached.
Key-Value Caching (KV Caching): A crucial optimization for autoregressive generation. Since each new token calculation depends on all previous tokens, recomputing Key and Value matrices for the entire sequence at each step would be very inefficient. KV caching stores these matrices from previous computations, allowing new tokens to be generated by only computing the Key and Value for the new token and concatenating it with the cached ones, significantly reducing computational workload and improving response times.

The following figure (Figure 2 from the original paper) illustrates the autoregressive nature of LLM inference:

该图像是一张示意图，展示了EdgeShard框架中LLM模型与异构网络设备的联合设备选择和模型分区过程，以及通过顺序推理与流水线并行推理实现协作推理的流程。

Fig. 2. LLM inference has an autoregressive nature.

3.1.2. Edge Computing

Edge computing is a distributed computing paradigm that brings computation and data storage closer to the data sources (the "edge" of the network), rather than relying solely on a centralized cloud data center. This proximity offers several advantages:

Reduced Latency: Faster response times for applications by minimizing the distance data travels.
Lower Bandwidth Consumption: Less data needs to be sent to the cloud, reducing network congestion and costs.
Enhanced Privacy and Security: Sensitive data can be processed locally, reducing exposure to public networks and centralized repositories.
Improved Reliability: Services can operate even with intermittent or unreliable cloud connectivity. Edge devices can range from small IoT sensors and mobile phones to more powerful edge servers and edge gateways.

3.1.3. Collaborative Edge Computing (CEC)

Collaborative Edge Computing (CEC) extends the concept of edge computing by emphasizing the collaboration among various geo-distributed edge devices and cloud servers to jointly perform computational tasks. Instead of isolated edge devices or simple vertical offloading to a single cloud, CEC envisions a network of interconnected computing resources forming a shared resource pool. This paradigm aims to:

Enlarge Resource Pool: Combine the compute, memory, and storage capacities of multiple heterogeneous devices.
Optimize Resource Utilization: Distribute workloads efficiently across available resources.
Provide Low-Latency Services: Keep processing close to data sources while leveraging broader capabilities.
Expand Service Region: Enable wider coverage and availability of AI services. The authors highlight that CEC differs from traditional edge computing by focusing on horizontal (edge-to-edge) as well as vertical (cloud-edge) collaborations.

The following figure (Figure 1 from the original paper) illustrates the concept of collaborative edge computing:

Fig. 1. Collaborative edge computing integrates the computing resources of ubiquitous geo-distributed devices for jointly performing computational tasks, with great benefits of enlarged resource pool… 该图像是示意图，展示了协作边缘计算如何整合广泛分布的设备计算资源以共同执行任务，体现了资源池扩大、低延迟处理、灵活接入和服务区域扩展等优势。

3.2. Previous Works

The paper discusses several existing approaches for deploying LLMs, primarily focusing on addressing their resource demands.

3.2.1. Model Quantization

Model quantization is a technique to reduce the memory footprint and computational cost of neural networks by representing their weights and/or activations with lower precision (e.g., 8-bit integers or 4-bit integers instead of 32-bit floating points).

GPTQ [8]: An early work that quantizes LLMs with billions of parameters to 3-4 bits using approximate second-order information. It primarily focuses on weight-only quantization.
AWQ (Activation-aware Weight Quantization) [10]: Optimizes channel scaling to preserve important weights, aiming to reduce quantization error. Also primarily weight-only.
SmoothQuant [11] and Agile-Quant [7]: Take a more comprehensive approach by quantizing not only model weights but also activations. This can lead to greater memory and computational savings.
Limitation: While effective in reducing model size, quantization often comes with a trade-off: it may lead to accuracy loss compared to full-precision models. The paper also points out that even quantized LLMs might still exceed the capacity of resource-constrained edge devices (e.g., Llama2-70B 4-bit still requires 35GB).

3.2.2. Cloud-Edge Collaboration

These approaches partition an LLM and offload parts of its computation workload between an edge device and a powerful cloud server.

Wang et al. [13]: Focus on distributing computation between cloud servers and edge devices to increase throughput and reduce communication overhead by leveraging the low-rank property of residual activations.
Chen et al. [14]: Propose using location-based information of edge devices for personalized prompt completion in a collaborative edge-cloud LLM serving scenario.
Limitation: A major drawback of these vertical collaboration methods is that the latency between edge devices and cloud servers is usually high and unstable, which can negatively impact the overall inference or fine-tuning performance of LLMs. They also typically involve only two computing entities (one edge, one cloud).

3.2.3. Distributed Training/Inference in Cloud Data Centers

These works focus on distributing large neural networks across multiple GPUs within a high-performance cloud data center.

Gpipe [17]: A system for pipeline parallelism that splits a deep neural network into stages, assigning each stage to a different GPU. It focuses on efficient training of very large models.
PipeDream [18]: Another system for generalized pipeline parallelism designed for DNN training, aiming to maximize hardware utilization.
Differentiation from EdgeShard: The paper explicitly states that these cloud-centric solutions are not directly applicable to edge computing environments due to key differences:
- Homogeneity vs. Heterogeneity: Cloud servers typically have homogeneous GPUs connected by high-bandwidth networks (e.g., InfiniBand, Nvlinks, up to 600 GB/s). Edge devices, in contrast, are inherently heterogeneous in their computation capabilities and are connected by heterogeneous and low-bandwidth networks (e.g., dozens of Kbps to 1000 Mbps).
- Resource Constraints: Cloud data centers have abundant resources, whereas edge environments are resource-constrained. Solutions designed for the cloud often neglect these heterogeneous and resource-constrained characteristics of edge computing.

3.3. Technological Evolution

The evolution of deploying large AI models has generally followed a path from centralized, powerful cloud data centers towards distributed, localized edge environments. Initially, the sheer scale of deep learning models necessitated cloud infrastructure. As models grew larger into LLMs, techniques like quantization emerged to squeeze them onto smaller devices, albeit with potential accuracy compromises. Simultaneously, cloud-edge collaboration offered a hybrid approach, but its performance remained tethered to the stability and speed of cloud connectivity.

This paper's work on EdgeShard represents a significant step in this evolution by proposing collaborative edge computing. It moves beyond the limitations of single-device edge deployment and simple cloud-edge pairings, embracing a broader network of heterogeneous edge devices and cloud servers. This approach allows for more flexible and robust deployment by dynamically leveraging available resources, bringing the benefits of distributed processing (like those seen in cloud data centers via pipeline parallelism) to the more challenging and diverse edge computing landscape.

3.4. Differentiation Analysis

EdgeShard differentiates itself from previous works in several critical ways:

Collaborative Edge Computing Paradigm: Unlike quantization which reduces model size at the cost of potential accuracy, or cloud-edge collaboration which is limited by unstable cloud links and typically involves only two nodes, EdgeShard embraces a truly collaborative edge computing environment. It integrates multiple heterogeneous edge devices and cloud servers into a single resource pool. This allows for horizontal (edge-to-edge) collaboration in addition to vertical (cloud-edge) collaboration, creating a more resilient and powerful system.
Adaptive Joint Optimization: Instead of fixed partitions or relying on simple offloading, EdgeShard formulates an adaptive joint device selection and model partition problem. This allows it to intelligently decide which devices to use and how to partition the LLM layers among them, dynamically optimizing for specific goals like latency or throughput. This contrasts with fixed partitioning strategies (e.g., Cloud-Edge-Even) which may not account for heterogeneity.
Handling Heterogeneity: EdgeShard is explicitly designed for heterogeneous computing devices and heterogeneous network connections typical of edge environments. This is a key departure from cloud-based distributed systems (like Gpipe or PipeDream) that assume homogeneous GPUs and high-bandwidth interconnects. EdgeShard's dynamic programming algorithm inherently factors in these varying computational capabilities and bandwidths.
Flexibility in Objectives: The framework provides distinct optimization algorithms for both minimizing inference latency (suitable for single-user, real-time scenarios) and maximizing inference throughput (suitable for multi-user or batch processing scenarios), offering tailored performance solutions.
Memory Wall Solution without Accuracy Loss: By sharding the LLM across multiple devices, EdgeShard enables the deployment of models (like Llama2-70B) that would otherwise exceed the memory capacity of any single edge device or even simple cloud-edge setups, without resorting to quantization and its associated accuracy loss.

4. Methodology

The EdgeShard framework facilitates efficient collaborative LLM inference on distributed devices by intelligently partitioning the LLM and allocating its shards to a selected subset of available computing devices. The overall process consists of three main stages: Profiling, Scheduling Optimization, and Collaborative Inference.

The following figure (Figure 3 from the original paper) provides an overview of the EdgeShard framework:

Fig. 4. Collaborative LLM inference 该图像是论文中关于协同边缘计算加速LLM推理的示意图，展示了(a)顺序推理和(b)流水线并行推理两种执行方式的时间分布，体现了不同设备上模型层或批次的执行情况。

4.1. Principles

The core idea behind EdgeShard is to leverage the collective computational and memory resources of geo-distributed edge devices and cloud servers to overcome the limitations of deploying large LLMs on individual, resource-constrained devices. It views an LLM as a stack of layers and aims to distribute these layers (or shards) across a network of heterogeneous devices. The principle is that by optimally selecting devices and partitioning the model, the system can minimize overall inference latency or maximize throughput, even with varying computational capabilities and network bandwidths among the devices. The theoretical basis lies in dynamic programming for solving sequential decision-making problems, which naturally fits the layered structure of LLMs.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Profiling Stage

Profiling is an offline, one-time step that gathers essential runtime characteristics of the LLM layers on different devices and network conditions. This information is crucial for the subsequent scheduling optimization. The profiled traces include:

Execution time of each layer on different devices: This involves measuring the time it takes for a specific LLM layer to compute on each available device. For LLMs, this profiling differentiates between the prefill stage and the autoregressive stage token generation. If a device cannot hold the full model for profiling, a dynamic model loading technology is used to load layers consecutively.
Size of activations and memory consumption for each layer: This determines the data transfer volume between layers and the memory footprint of each layer on a device.
Available memory of each device: The total memory capacity of each potential computing device.
Bandwidth among devices: The network communication speed between any two devices in the collaborative network.

4.2.2. Scheduling Optimization Stage

At this stage, an intelligent scheduler uses the profiled information to determine an optimal deployment strategy. This strategy involves:

Device Selection: Deciding which subset of available computing devices (edge devices and cloud servers) will participate in the inference task.
Model Partition: Dividing the LLM into multiple shards. The partitioning is layer-wise, meaning specific layers are grouped together to form a shard.
Shard Allocation: Assigning each model shard to a selected device.

The scheduling optimization considers:
Heterogeneous resources: The varying computational capabilities and memory budgets of different devices.
Memory budget of devices: Ensuring that the assigned layers do not exceed a device's memory capacity.
Privacy constraint: If applicable, specific layers might need to remain on certain devices for privacy reasons.

This stage formulates the problem as a joint device selection and model partition problem and solves it using a dynamic programming algorithm to optimize for either inference latency or throughput.

4.2.3. Collaborative Inference Stage

Once the LLM model partition and allocation strategy are determined, the selected devices execute the inference collaboratively.

KV Cache Pre-allocation: Memory space is pre-allocated on each participating device for Key-Value (KV) caches to optimize autoregressive generation.
Two Cases for Collaborative Inference:
- Sequential Inference: Devices process their allocated model shards in sequence.
  - Description: If an LLM is partitioned into $S$ shards allocated to devices $D_1, D_2, \dots, D_S$ , $D_1$ processes the input, sends intermediate activations/outputs to $D_2$ , which processes and transmits to $D_3$ , and so on.
  - Suitability: Ideal for serving a single user in scenarios like smart homes, where minimizing the latency for one request is paramount.
  - Resource Utilization: Not resource-efficient, as devices other than the currently processing one remain idle.
- Pipeline Parallel Inference: Input data is split into micro-batches and processed in a pipeline fashion to improve resource utilization.
  - Description: Device $D_1$ processes micro-batch B1, then immediately starts micro-batch B2 while $D_2$ processes B1 (sent from $D_1$ ). This overlapping execution keeps all devices busy.
  - Suitability: Maximizes throughput by keeping devices utilized concurrently.
  - Inspiration: Similar to techniques used in cloud servers like Gpipe [17] and PipeDream [18].
    
    The following figure (Figure 4 from the original paper) illustrates collaborative LLM inference in both sequential and pipeline parallel modes:
    
    该图像是论文EdgeShard中的示意图，展示了在不同设备之间分配模型分片的布局方案。图中显示了在四个设备上如何分布模型的不同部分，以实现协同推理。

Fig. 4. Collaborative LLM inference

4.2.4. System Model

The system considers an LLM with $N$ layers and a network of $M$ heterogeneous edge devices and cloud servers.

$N$ : Total number of layers in the LLM.
$O_i$ : Size of activations (output) of layer $i$ , for $0 \le i \le N-1$ .
$Req_i$ : Memory consumption required for layer $i$ .
$M$ : Total number of computing devices.
$Mem_j$ : Memory budget of device $j$ , for $0 \le j \le M-1$ .
$B_{k,j}$ : Bandwidth between device $k$ and device $j$ , for $0 \le k, j \le M-1$ .
Source Node: Device 0 is designated as the source node where input tokens originate.

The main notations used in the paper are listed in Table II: The following are the results from [Table II] of the original paper:

Symbol	Descriptions
$X_{i,j}$	binary variable, whether layer $i$ of a model is allocated to device $j$
$t_{comp}^{i,j}$	computation time of layer $i$ on device $j$
$t_{comp}^{i \to m,j}$	computation time of layer $i$ to layer $m$ on device $j$
$t_{comm}^{i-1,k,j}$	communication time to transmit activations of layer `i-1` from device $k$ to device $j$
`DP(i, j)`	minimal total execution time of the first $i$ layers if layer $i$ is allocated to device $j$
`g(i, S, k)`	processing time of the slowest node to process the first $i$ layers with device set $S$

4.2.5. Optimize LLM inference latency

Problem Formulation

The goal is to minimize the total inference latency ( $T_{tol}$ ). The allocation strategy is defined by a binary variable $X_{i,j}$ :

$X_{i,j} = 1$ if layer $i$ is allocated to device $j$ .
$X_{i,j} = 0$ otherwise. Each layer must be allocated to exactly one device: $ \sum_{j=0}^{M-1} X_{i,j} = 1, \forall i $ The communication time to transmit activations of layer i-1 from device $k$ to device $j$ , denoted as $t_{comm}^{i-1,k,j}$ , is calculated as: $t_{comm}^{i-1,k,j} = \left\{ \begin{array}{ll} \frac{O_{i-1}}{B_{k,j}}, & if \ k \neq j \\ 0, & \mathrm{otherwise}. \end{array} \right.$ Where $O_{i-1}$ is the size of activations of layer i-1, and $B_{k,j}$ is the bandwidth between device $k$ and device $j$ . If layers i-1 and $i$ are on the same node, communication time is zero.

The total inference time ( $T_{tol}$ ) is the sum of computation times for all layers and communication times between layers if they are on different devices: $T_{tol} = \sum_{i=0}^{N-1} \sum_{j=0}^{M-1} X_{i,j} \cdot t_{comp}^{i,j} + \sum_{i=1}^{N-1} \sum_{j=0}^{M-1} \sum_{k=0}^{M-1} X_{i-1,k} \cdot X_{i,j} \cdot t_{comm}^{i-1,k,j}$ Where $t_{comp}^{i,j}$ is the computation time of layer $i$ on device $j$ .

The problem of minimizing LLM inference latency is formulated as: $\operatorname*{min} T_{tol}$ Subject to constraints:

Privacy Constraint: The first layer must be allocated to the source node (device 0) to avoid raw input data transmission. $X_{0,0} = 1$
Memory Constraint: The total memory required by all layers allocated to device $j$ cannot exceed its memory budget. $\sum_{i=0}^{N-1} X_{i,j} \cdot Req_{i} \leq Mem_{j}$ Where $Req_i$ is the memory consumption of layer $i$ , and $Mem_j$ is the memory budget of device $j$ .

Solution: Dynamic Programming Algorithm

A dynamic programming (DP) algorithm is designed to solve this problem by leveraging its optimal sub-problem property. Let DP(i, j) denote the minimal total execution time of the first $i$ layers, given that layer $i$ is allocated to device $j$ . The state transition equation is formulated as: $DP(i, j) = \operatorname*{min}_{0 \leq k < M} \left( DP(i-1, k) + t_{comp}^{i,j} + t_{comm}^{i-1,k,j} \right)$ This equation represents the minimal execution time for the first $i$ layers ending with layer $i$ on device $j$ , by considering all possible devices $k$ that could host layer i-1. The term $DP(i-1, k)$ is the minimal execution time up to layer i-1 ending on device $k$ . For the last layer N-1, an additional communication time to send the generated token back to the source node (device 0) is included: $DP(N-1, j) = \operatorname*{min}_{0 \leq k < M} \left( DP(N-2, k) + t_{comp}^{N-1,j} + t_{comm}^{N-2,k,j} + t_{comm}^{N-1,j,0} \right)$ The authors' provided equation (6) in the paper for DP(i, j) seems to be a simplified representation and doesn't explicitly show the special case for the last layer's return to the source node. Based on the accompanying text, the $t_{comm}^{N-1,j,0}$ term is added for the last layer. The provided formula (6) in the paper is: $DP(i, j) = \left\{ \begin{array}{ll} \underset{0 \leq k < M}{\operatorname*{min}} \left( DP(i-1, k) + t_{comp}^{i,j} + t_{comm}^{i-1,k,j} \right) & \text{if } i < N-1 \\ \underset{0 \leq k < M}{\operatorname*{min}} \left( DP(i-1, k) + t_{comp}^{i,j} + t_{comm}^{i-1,k,j} + t_{comm}^{i,j,0} \right) & \text{if } i = N-1 \end{array} \right.$ The initialization for the first layer (layer 0) is based on the privacy constraint ( $X_{0,0}=1$ ): $DP(0, 0) = t_{comp}^{0,0}$ The minimal total execution time for the entire LLM is then found by taking the minimum $DP(N-1, j)$ across all devices $j$ : $\min_{j=0, \dots, M-1} (DP(N-1, j))$ The optimal allocation strategy can be reconstructed by backtracking the choice made at each step of the DP algorithm.

The algorithm to find the optimal LLM partition and allocation strategy for minimizing inference latency (Algorithm 1 in the paper) is presented as follows:

Algorithm 1: Joint device selection and LLM partition for optimizing latency
Input: A LLM model; Computing device  $M$ ; Profiled traces; bandwidth  $B_{k,j}$ 
Output: the device selection and LLM partition strategy  $R$ 

// initialization
1 Initialize DP table  $DP(i, j) = INF$ , and choice table  $choice(i, j) = NULL$  to record the strategy;
2 Enforce first layer to be allocated to node 0 by  $DP(0, 0) = t_{comp}^{0,0}$  and  $choice(0, 0) = 0$ 

// fill in the DP table
3 for  $i = 1$  to  $N - 1$  do
4   for  $j = 0$  to  $M - 1$  do
5     if  $\sum_{l=0}^{i} Req_{l} > Mem_j$  then // Check if current layer and all preceding layers allocated to device j exceed its memory. This check seems problematic as it accumulates memory for _all_ previous layers on _one_ device. A more accurate check would be for layers assigned to device j within its contiguous block. Based on the context of lines 12-14 in the algorithm, it should be checking memory for layers (or a block of layers) currently being allocated to device j. However, I will strictly adhere to the paper's formula  $Mem_j <= Req_i$  (from line 5). The sum in (5) is for *all* layers allocated to device j, not necessarily a contiguous block. The problem statement (5) and line 5 of the algorithm are contradictory in terms of memory checks. I'll stick to the algorithm's direct translation and note the potential inconsistency if observed. Re-reading line 5, it checks  $Mem_j <= Req_i$ , which means "if the memory budget of device j is *less than or equal to* the requirement of layer i". This condition seems reversed or incorrectly stated; it should likely be  $Req_i > Mem_j$  to indicate an OOM scenario, or  $current_allocated_memory_on_j + Req_i > Mem_j$ . Given the initial memory constraint from section IV, "The memory budget of a device j is Mem_j," and constraint (5) " $\sum_{i=0}^{N-1} X_{i,j} \cdot Req_{i} \leq Mem_{j}$ ", the algorithm's check  $Mem_j <= Req_i$  on line 5 means if device j cannot even hold layer i. This is a very basic check. The actual sum constraint (5) is managed implicitly by the DP structure. For clarity, I will interpret line 5 as a check that layer  $i$  *itself* can fit on device  $j$ . The full memory constraint (Eq. 5) is usually implicitly handled by preventing allocations that would lead to memory overruns when building up the solution.

    Rereading line 5:  $if Mem_j <= Req_i then Continue;$ 
    This means if the memory capacity of device  $j$  is less than or equal to the requirement of layer  $i$ , then skip this device. This is a basic filter. The original formulation (5)  $\sum_{i=0}^{N-1} X_{i,j} * Req_{i} \leq Mem_{j}$  is the overall constraint. The DP needs to keep track of the cumulative memory usage for each device as layers are assigned. The provided algorithm (Algo 1) does not explicitly track the cumulative memory for *all* layers assigned to  $j$ . Line 13  $Update memory Mem_j$  implies  $Mem_j$  is being reduced, which would be for *current* usage, not the total budget. This implies  $Mem_j$  in the algorithm is perhaps `remaining memory` on device  $j$ . This is a common pattern in DP for resource constraints. However, without explicit definition of how  $Mem_j$  is updated, it's ambiguous.
    Given the constraint  $\sum_{i=0}^{N-1} X_{i,j} * Req_{i} \leq Mem_{j}$ , the DP would need to ensure this. A typical way is to pass remaining memory as a state, or to check the sum of `Req` for all layers *assigned so far to that specific device*. The algorithm as written (Line 5 and Line 13) is a bit simplified. Line 13  $Update memory Mem_j$  is crucial and underspecified. I will assume it means  $Mem_j$  is the *available* memory and is updated to reflect the memory consumed by layer  $i$ . However, the constraint (5) applies to *all* layers on device  $j$ , not just layer  $i$ . If  $Mem_j$  in Line 5 is the *total capacity*, and  $Req_i$  is for one layer, then  $Mem_j <= Req_i$  means `device_j_capacity <= layer_i_requirement`, which would imply that  $layer_i$  cannot fit on  $device_j$ . This is a reasonable initial check. Line 13 "Update memory  $Mem_j$ " is still the most ambiguous part regarding how the sum in (5) is maintained. A common approach is to track memory used for each device as part of the DP state or via a separate data structure. Given the strict instruction to follow the paper, I will present the algorithm exactly as written, acknowledging this potential ambiguity in a footnote if necessary. For now, I will assume  $Mem_j$  in line 13 implies updating a *temporary* memory usage tracker for device  $j$  as layers are hypothetically assigned. If  $Mem_j$  is total capacity, then it should not be updated. If it is *remaining capacity*, it should be  $Mem_j = Mem_j - Req_i$ . The constraint (5) is a global check *after* all allocations are decided. The DP is building *towards* that.

    Let's re-interpret the memory check logic:  $Mem_j$  in lines 5 and 13 is likely *not* the total memory budget, but rather a temporary variable tracking the *memory consumed by layers currently allocated to device  $j$  within the current DP path*. So,  $Mem_j <= Req_i$  in line 5 would still be a minimum capacity check. The constraint (5) is usually handled by  $Mem_j$  meaning total capacity and checking  $current_used_memory[j] + Req_i > Mem_j$  before adding layer  $i$  to device  $j$ . Or, if  $Mem_j$  in the DP state means remaining memory, then  $Req_i > Mem_j$  would be the check. Given the phrasing  $Mem_j <= Req_i$  and  $Update memory Mem_j$ , it's still slightly unclear, but I'll stick to the exact text. For a rigorous academic assistant, I should highlight this ambiguity if possible, but the prompt asks to be faithful. I will translate it literally. If  $Mem_j$  is total capacity, then  $Mem_j <= Req_i$  means the device's total memory is too small for layer  $i$ . If  $Mem_j$  refers to available memory, then  $Req_i$  should be compared against it. Let's assume  $Mem_j$  in Line 5 refers to the *total memory capacity* of device  $j$ . Then  $Mem_j <= Req_i$  implies device  $j$  cannot even hold layer  $i$ . The update in Line 13 is problematic for this interpretation.

    The most consistent interpretation for Line 5 and Line 13, given the overall problem, is that Line 5 is a placeholder check and the *actual* memory constraint (Eq. 5) is checked at the *end* for the *entire* allocation, or Line 13 implies a cumulative memory sum that is checked *against*  $Mem_j$  (total capacity) at each step. If  $Mem_j$  in line 5 and line 13 refers to the total memory *budget* of device  $j$ , then line 5  $if Mem_j <= Req_i then Continue;$  means: if device  $j$ 's total capacity is less than or equal to layer  $i$ 's requirement, then it cannot host layer  $i$ . Line 13  $Update memory Mem_j$  would then be incorrect, as the budget doesn't change.

    A more common interpretation for `DP(i, j)` where  $j$  is the device, if memory is constrained, is that the state should also include memory. Or the algorithm implicitly assumes the "memory budget of devices" constraint is checked when `t_total` is computed, perhaps the profiler provides `t_comp` as `INF` if OOM.
    However, I must strictly follow the *given algorithm text*. Line 5 reads  $if Mem_j <= Req_i then Continue;$ 
    Line 13 reads  $Update memory Mem_j$ . This indicates  $Mem_j$  in the algorithm is a *dynamic* variable. This means  $Mem_j$  is likely meant to be the *remaining memory* on device  $j$ . If so, then  $Req_i$  is memory needed for layer  $i$ .  $Mem_j <= Req_i$  would mean  $remaining_memory <= memory_needed_for_layer_i$ , which is the condition for OOM. This interpretation makes more sense.

    Let's re-state based on this interpretation:  $Mem_j$  in the algorithm (lines 5, 13) represents the *remaining memory* on device  $j$  *for the current path of layers being assigned to it*. The initial value of this  $Mem_j$  (which is not explicitly set in the algorithm's initialization for each iteration) would need to be the full capacity, `Mem_j_total_capacity`. Then line 5 makes sense as  $if remaining_memory <= Req_i then Continue;$ . Line 13  $Update memory Mem_j$  would then mean  $remaining_memory = remaining_memory - Req_i$ . However, the DP state `DP(i, j)` only captures time, not remaining memory. This is a subtle contradiction.

    The paper's overall constraint is (5):  $\sum_{i=0}^{N-1} X_{i,j} \cdot Req_{i} \leq Mem_{j}$ .
    The algorithm's line 5:  $if Mem_j <= Req_i then Continue;$  (here  $Mem_j$  is the *total* memory budget, as it's not a state variable for `DP(i,j)` or dynamically tracked within the loop without more context).
    The algorithm's line 13:  $Update memory Mem_j$ . This line is the most problematic if  $Mem_j$  is the total budget.
    Perhaps  $Mem_j$  in the algorithm refers to the remaining capacity *for the current contiguous block of layers being assigned to device j*. This is complex.

    The simplest interpretation (and most charitable to the algorithm as written without additional state variables) is that Line 5  $if Mem_j <= Req_i then Continue;$  acts as a basic filter, checking if the *total capacity* of device  $j$  is not even enough for layer  $i$ . The  $Update memory Mem_j$  in line 13 is likely a conceptual placeholder for how *future* memory state might be passed or tracked, or it's simply an error in pseudo-code for the *actual cumulative memory tracking* needed for constraint (5). Given the prompt's strictness, I must write it as  $Mem_j <= Req_i$  and  $Update memory Mem_j$ , and for `DP(i,j)`, the explicit memory tracking is not in its state. The global constraint (5) is usually satisfied by backtracking and summing  $Req_i$  for allocated layers. I will present the algorithm exactly as written.

Algorithm 1: Joint device selection and LLM partition for optimizing latency Input: A LLM model; Computing devices $M$ ; Profiled traces; bandwidth $B_{k,j}$ Output: the device selection and LLM partition strategy $R$

// initialization 1 Initialize DP table $DP(i, j) = INF$ , and choice table $choice(i, j) = NULL$ to record the strategy; 2 Enforce first layer to be allocated to node 0 by $DP(0, 0) = t_{comp}^{0,0}$ and $choice(0, 0) = 0$

// fill in the DP table 3 for $i = 1$ to $N - 1$ do 4 for $j = 0$ to $M - 1$ do 5 if $Mem_j \leq Req_i$ then // Check if device j's memory budget is too small for layer i 6 Continue; 7 end 8 else 9 for $k = 0$ to $M - 1$ do 10 Calculate the total execution time by Eq. (6) and assign it to $t_{total}$ : // Eq. (6) is: $DP(i, j) = \min_{0 \leq k < M} ( DP(i-1, k) + t_{comp}^{i,j} + t_{comm}^{i-1,k,j} )$ // (with the special case for i = N-1 for return to source) 11 if $t_{total} \le DP(i, j)$ then 12 Update DP(i, j) by assigning $DP(i, j) = t_{total}$ 13 Update memory $Mem_j$ // This line is ambiguous in the paper's pseudocode. It likely implies tracking cumulative memory usage for device $j$ up to layer $i$ and ensuring it doesn't exceed $Mem_j$ (total capacity). However, the pseudocode literally says $Update memory Mem_j$ , which if $Mem_j$ is the global budget, would be incorrect. Given the strict instruction, I will reproduce it as-is. 14 Record allocation plan $choice(i, j) = k$ 15 end 16 end 17 end 18 end 19 end

// backtrace for allocation strategy 20 Initialize optimal strategy $R$ . 21 Find the last selected node $N_{last} = argmin_{j} (DP(N - 1, j))$ . 22 Add $N_{last}$ to $R$ . // This is incorrect, $N_{last}$ is a device index, not a layer. It should be adding the assignment $(N-1, N_{last})$ to $R$ . 23 for $i = N - 1$ to 0 do 24 Find the previous node $N_{prev} = choice(i, N_{last})$ // Original says $N_{last} = choice(i, N_{last})$ which is also confusing. Assume $N_{prev}$ to avoid overwriting. 25 Add $(i-1, N_{prev})$ to $R$ . // This adds the previous layer's device. For $i=0$ , this would be layer -1. Needs careful re-interpretation. 26 Update $N_{last} = N_{prev}$ // Update for the next iteration of backtracing 27 end 28 Reverse $R$ : // $R$ would contain device assignments from last to first layer. Reversing makes it first to last. 29 return $R$ .

*Correction on Algorithm 1 Interpretation:* The pseudo-code for backtracing (lines 22, 25) is slightly inconsistent with typical DP backtracking. Usually,  $R$  would store pairs of (layer\_index, device\_index). Line 22 implies adding the *device* of the last layer. Line 25 implies adding the *previous* layer's device. If  $R$  is intended to contain `(layer_idx, device_idx)` pairs, then line 22 should be `Add`(N-1, N_{last})`to`R $, and line 25 should be `Add`(i-1, N_{prev})`to`R$ . Also, the loop condition for backtracing typically goes down to 0, and the `choice` for layer 0 (i.e.  $choice(0,0)$ ) is `0`, meaning layer 0 is on device 0. The current logic would add `(-1, device_idx)` if it runs to  $i=0$ . I will strictly reproduce the pseudocode as given in the paper, acknowledging the potential minor ambiguities if any, but will try to keep my descriptive text consistent with the likely intent.

**Computational Complexity**: The complexity of Algorithm 1 is  $O(N \times M \times M)$ , where  $N$  is the number of layers and  $M$  is the number of devices.

### 4.2.6. Optimize LLM inference throughput

#### Problem Formulation
For optimizing `throughput`, `pipeline parallelism` is adopted. The goal is to maximize throughput, which is equivalent to minimizing the latency of the slowest device. Communication and computation times can be overlapped.
For a selected set of devices  $S$ , the maximum latency for any device  $j \in S$  is determined by the maximum of its computation time and incoming communication time for a specific task segment.
 $T_{latency}^{j} = \operatorname*{max} \left\{ \begin{array}{ll} t_{comp}^{i \to m,j} \\ t_{comm}^{i-1,k,j} \end{array} \right.$ 
Where  $t_{comp}^{i \to m,j}$  is the computation time of layers from  $i$  to  $m$  on device  $j$ , and  $t_{comm}^{i-1,k,j}$  is the communication time to transmit activations of layer `i-1` from device  $k$  to device  $j$ .
The problem of maximizing inference throughput is then formulated as minimizing the maximum latency across all selected devices:
 $\operatorname*{min} \{ T_{latency}^{j} | j \in S \}$ 

#### Solution: Dynamic Programming Algorithm
Similar to latency optimization, this problem also exhibits `optimal sub-problem property`, solvable using `dynamic programming`.
Let `g(i, S, k)` denote the minimum time for the slowest node to process the first  $i$  layers, with the set of used devices  $S$ , and device  $k$  being the last node used ( $k \in S$ ).
The state transition equation for `g(m, S', j)` (processing the first  $m$  layers with device set  $S'$ , where  $j$  is the last device and  $S' = S \cup \{j\}$ ) is:
 $g(m, S', j) = \operatorname*{min}_{0 \leq i < m \leq N-1, k \in S} \operatorname*{max} \left\{ \begin{array}{ll} g(i, S, k) \\ t_{comm}^{i-1,k,j} \\ t_{comp}^{i \to m,j} \end{array} \right.$ 
This equation means that to reach state `(m, S', j)`, we consider all previous states `(i, S, k)` and pick the one that minimizes the maximum of: (1) the slowest time up to layer  $i$  using set  $S$  ending with  $k$ , (2) the communication time from  $k$  to  $j$ , and (3) the computation time on  $j$  for the layers  $i$  to  $m$ .

Constraints for state transition:
1.  **Memory Constraint**: The memory required for layers  $i$  to  $m$  on device  $j$  must not exceed its budget.
     $Req_{i \to m} \leq Mem_{j}$ 
    Where  $Req_{i \to m}$  is the sum of memory requirements for layers  $i$  through  $m$ .
2.  **Privacy Constraint**: Similar to latency, the first layer is on device 0.
     $g(1, \{0\}, 0) = t_{comp}^{0,0}$ 
    This initializes the state for the first layer (layer 0) using device 0. Here,  $(1, {0}, 0)$  implies processing 1 layer (layer 0), with device set `{0}`, and `0` as the last device.

The final optimal solution  $T_{throu}^{opt}$  is the minimum  $g(N-1, S', j)$  over all possible device sets  $S'$  and last devices  $j \in S'$ .

The algorithm for optimizing throughput (Algorithm 2 in the paper) is provided as follows:

Algorithm 2: Joint device selection and LLM partition for optimizing throughput Input: A LLM model; Computing devices $M$ ; Profiled traces; bandwidth $B_{k,j}$ Output: the device selection and LLM partition strategy $R$

// initialization 1 Initialize DP table $g(i, S, k) = INF$ , and choice table $choice(m, S, j) = NULL$ to record the strategy; 2 Enforce first layer to be allocated to node 0 by $g(1, \{0\}, 0) = t_{comp}^{0,0}$ and $choice(1, \{0\}, 0) = (0, 0, 0)$

// fill in DP table 3 for $i = 1$ to $N - 1$ do // i is the upper layer index of the previous block 4 for each subset $S \subseteq M$ do // S is the set of devices used for first i layers 5 for last node $k \in S$ do // k is the last device in set S 6 for $m = i + 1$ to $N - 1$ do // m is the upper layer index of the current block 7 for $j \in M \setminus S$ do // j is a new device not in S to host layers i+1 to m 8 if $\sum_{l=i}^{m} Req_l > Mem_j$ then // Check if layers i to m exceed device j's memory 9 Continue; 10 end 11 else 12 Get $S'$ by adding node $j$ to the selected device set $S$ 13 Calculate current maximum execution time $T_{max}$ via Eq. (11) for the maximum execution time in all stages; // Eq. (11) is: $T_{max} = \max (g(i, S, k), t_{comm}^{i-1,k,j}, t_{comp}^{i \to m,j})$ 14 end 15 if $T_{max} \leq g(m, S', j)$ then // If new path leads to a smaller max latency for this state 16 $g(m, S', j) = T_{max}$ 17 Record the current strategy $choice(m, S', j) = (i, j, k)$ // Records the previous state (i, S, k) and new device j 18 end 19 end 20 end 21 end 22 end 23 end

// backtrace for optimal allocation 24 Initialize optimal strategy $R$ 25 Find selected device set $S_{opt}$ and the last selected node $N_{last\_opt}$ by $S_{opt}, N_{last\_opt} = argmin_{S, k} (g(N - 1, S, k))$ 26 Initialize $layer = N - 1$ 27 while $layer > 0$ do 28 $(prev\_layer, prev\_S, prev\_k) = choice(layer, S_{opt}, N_{last\_opt})$ // Retrieves the previous state from which current state was optimized 29 Add $(prev\_layer \to layer, N_{last\_opt})$ to $R$ . // This implies layers from prev_layer to layer are on N_last_opt. 30 Update $layer = prev\_layer$ , $S_{opt} = prev\_S$ , and $N_{last\_opt} = prev\_k$ 31 end 32 return $R$ .

**Computational Complexity**: The complexity of Algorithm 2 is  $O(N^2 \times 2^M \times M^2)$ , where  $N$  is the number of layers and  $M$  is the number of devices. This is significantly higher than Algorithm 1 due to iterating over all subsets of devices  $S$ .

### 4.2.7. Pipeline Execution Optimization
The ideal `pipeline parallel inference` assumes no device idleness. However, `decoder-based LLM inference` has an `autoregressive nature`, meaning token generation is sequential and relies on previously generated tokens. This can lead to `bubbles` (idle time) in the pipeline if a device waits for a complete `micro-batch` to finish all its token generations before starting the next.

*   **EdgeShard-Bubbles**: The traditional pipeline approach where devices wait for full `micro-batches`. As shown in Fig. 5(a), there are periods where devices are idle, leading to resource underutilization.
*   **EdgeShard-No-bubbles**: This proposed optimization aims to reduce these bubbles by allowing `immediate token generation` without waiting for the completion of all `micro-batches` in an iteration. As illustrated in Fig. 5(b), after the `prefill stage` (`P1`) of the first `micro-batch` ends, Device 1 immediately starts the `token generation` (`G1A`) for that `micro-batch`. Similarly, upon completion of `G1A`, Device 1 proceeds to the next iteration of token generation (`G1B`). This approach mitigates device idle time, enhancing resource utilization and improving `throughput`.

    The following figure (Figure 5 from the original paper) contrasts `bubbles` and `no-bubbles` pipeline execution:

    ![Fig. 6. Our testbed has heterogeneous edge devices and cloud server. Their specifications are shown in Table III.](/files/papers/6901877330e1bdf721191ce5/images/6.jpg)
    *该图像是一张实验平台的照片，展示了多个异构边缘设备和云服务器的实际硬件部署，图中设备通过蓝色网线互联，展示了论文中EdgeShard系统的测试环境。*

Fig. 5. This figure from the paper contrasts pipeline execution with `Bubbles` (top) and `No-bubbles` (bottom) for collaborative LLM inference. In the `Bubbles` approach, devices wait for the completion of entire micro-batches before proceeding, leading to idle times. In the `No-bubbles` approach, devices immediately move to the next token generation step of a micro-batch, reducing idle periods and improving resource utilization.

# 5. Experimental Setup

## 5.1. Datasets
*   **Benchmark Tasks**: The experiments focus on `text generation` tasks, which is a core application of LLMs.
*   **LLM Models**: `Llama2 serial models` are used: `Llama2-7B`, `Llama2-13B`, and `Llama2-70B`. These models, released by Meta in 2023, are popular and powerful open-source LLMs.
*   **Dataset**: The `WikiText-2 dataset [21]` from HuggingFace is utilized.
    *   **Characteristics**: A collection of high-quality articles from Wikipedia, primarily used for language modeling tasks.
    *   **Input/Output Configuration**: A subset of samples is extracted with an `input token length of 32` and `generation of 96 tokens`.
*   **Precision**: All experiments use `full-precision model inference`, avoiding `quantization` to isolate the performance benefits of `EdgeShard`'s collaboration strategy.

## 5.2. Evaluation Metrics
The performance of `EdgeShard` and baseline methods are evaluated using two key metrics:
1.  **Latency**:
    *   **Conceptual Definition**: Latency, in this context, refers to the average time taken to generate a single token during the LLM inference process. It measures the responsiveness of the system. Lower latency indicates faster responses, which is critical for real-time interactive applications.
    *   **Unit**: Milliseconds per token (`milliseconds/token`).
    *   **Mathematical Formula**: While not explicitly provided in the paper, average latency per token is typically calculated as:
        \$
        \text{Average Latency} = \frac{\text{Total Inference Time}}{\text{Total Number of Generated Tokens}}
        \$
        Where `Total Inference Time` includes both computation and communication time for all layers to generate all tokens, and `Total Number of Generated Tokens` is the number of new tokens produced by the model (e.g., 96 tokens in their setup).
2.  **Throughput**:
    *   **Conceptual Definition**: Throughput measures the rate at which the system can process tokens, specifically the number of tokens generated per second. It indicates the system's capacity to handle workloads, particularly useful for batch processing or serving multiple users concurrently. Higher throughput means more tokens can be generated in a given time period.
    *   **Unit**: Tokens per second (`tokens/second`).
    *   **Mathematical Formula**: While not explicitly provided, throughput is generally calculated as:
        \$
        \text{Throughput} = \frac{\text{Total Number of Generated Tokens}}{\text{Total Inference Time}}
        \$
        This is the inverse of latency when considering a single batch/request. For multiple batches/requests, it's the total tokens generated across all batches divided by the total time. The paper mentions `maximum batch size`, implying throughput is measured for the maximum load the system can handle.

## 5.3. Baselines
`EdgeShard`'s performance is compared against several baseline methods, designed to cover different deployment strategies in edge and cloud environments. `Cloud-only` is explicitly excluded as a baseline due to `privacy concerns` regarding raw input transmission.

1.  **Edge-Solo**:
    *   **Description**: In this baseline, the entire LLM is deployed and inferred locally on a `single edge device` without any model partitioning or collaboration.
    *   **Representativeness**: Represents the simplest, fully on-device deployment. It is limited by the memory and computational capacity of that single edge device.

2.  **Cloud-Edge-Even**:
    *   **Description**: The LLM is `evenly partitioned` into two parts. One part is deployed on an `edge device`, and the other on a `cloud server`. The partitioning is fixed and doesn't adapt to device capabilities.
    *   **Representativeness**: Represents a naive `cloud-edge collaboration` strategy, splitting the model equally without optimization.

3.  **Cloud-Edge-Opt**:
    *   **Description**: The LLM is partitioned into two shards, with one on an `edge device` and another on a `cloud server`. However, unlike `Cloud-Edge-Even`, the `partitioning strategy` is optimized using the `dynamic programming algorithm` proposed in the paper (Algorithms 1 or 2), but constrained to only two devices (one edge, one cloud).
    *   **Representativeness**: Represents an optimized `vertical cloud-edge collaboration` and serves as a direct comparison for how `EdgeShard`'s algorithm performs when restricted to a two-device scenario.

## 5.4. Testbed
The experiments are conducted on a `heterogeneous physical testbed` designed to simulate a `collaborative edge computing` environment.

*   **Devices**:
    *   **Edge Devices**: 12 `Jetson AGX Orin` units and 2 `Jetson Orin NX` units.
    *   **Cloud Server**: 1 server equipped with an `RTX 3090 GPU`.
    *   **Total Devices**: 15 devices in total.

        The specifications of these devices are detailed in Table III:
The following are the results from [Table III] of the original paper:




Category
Device
Memory
AI Performance




Edge Device
Jetson AGX Orin
32GB
3.33 TFLOPS


Edge Device
Jetson Orin NX
16GB
1.88 TFLOPS


Cloud Server
RTX 3090
24GB
36 TFLOPS




*   **Network Configuration**:
    *   All devices are interconnected via a `router` and a `switch`.
    *   The base bandwidth between any two devices is `1000Mbps`.
    *   The `Linux TC tool [20]` is used to programmatically vary `network bandwidth` and `communication latency` between devices, allowing for testing under diverse network conditions.
*   **Specific Network Settings for Overall Evaluation**:
    *   Source node: `AGX Orin`.
    *   Bandwidth between source node and cloud server: `1Mbps`.
    *   Bandwidth between other computing devices: `50Mbps` with a `20% variance`.
*   **Batch Size**: For throughput tests, the `batch size` is set to the `maximum batch size` that the participating devices can support.

    The following figure (Figure 6 from the original paper) shows the physical testbed:

    ![Fig. 7. Impact of Network Bandwidth to Latency of Collaborative LLMs inference](/files/papers/6901877330e1bdf721191ce5/images/7.jpg)
    *该图像是图表，展示了网络带宽对协同LLM推理延迟的影响，包括Llama2-7B、Llama2-13B和Llama2-70B三个模型在不同带宽下的延迟变化情况，比较了多种部署方案的性能。*

Fig. 6. Our testbed has heterogeneous edge devices and cloud server. Their specifications are shown in Table III.

# 6. Results & Analysis

## 6.1. Core Results Analysis
The experiments evaluate `EdgeShard` against baseline methods across different `Llama2 models` under varying network conditions and source node capabilities, focusing on `inference latency` and `throughput`.

### 6.1.1. Overall Evaluation (Fixed Network Conditions)
The initial evaluation sets the source node as an `AGX Orin`, with 1Mbps bandwidth between the source and cloud, and 50Mbps (20% variance) between other devices. The batch size is set to the maximum supported by participating devices.

The following are the results from [Table IV] of the original paper:





Llama2-7B
Llama2-13B
Llama2-70B


Latency
Throughput
Latency
Throughput
Latency
Throughput




Edge-Solo
140.34
24.36
OOM
OOM
OOM
OOM


Cloud-Edge-Even
227.35
7.56
319.44
4.68
OOM
OOM


Cloud-Edge-Opt
140.34
24.36
243.45
4.74
OOM
OOM


EdgeShard
75.88
52.45
173.43
10.45
3086.43
1.25




**Key Observations from Table IV:**
*   **Deployment Capability for Large Models**: For `Llama2-70B`, `Edge-Solo`, `Cloud-Edge-Even`, and `Cloud-Edge-Opt` all suffer from `Out-Of-Memory (OOM)` errors. This is because Llama2-70B requires 280GB memory, far exceeding individual device capacities or simple two-device splits. `EdgeShard`, however, successfully deploys and infers `Llama2-70B` (with 3086.43 ms latency and 1.25 tokens/s throughput) by splitting it into shards across multiple devices. This highlights EdgeShard's ability to tackle the memory wall for very large LLMs.
*   **Latency Reduction for Smaller Models**: For `Llama2-7B`, `EdgeShard` achieves `75.88 ms` latency, which is approximately `1.85x faster` than `Edge-Solo` and `Cloud-Edge-Opt` (both `140.34 ms`), and about `3x faster` than `Cloud-Edge-Even` (`227.35 ms`).
*   **Throughput Improvement for Smaller Models**: For `Llama2-7B`, `EdgeShard` achieves `52.45 tokens/s` throughput, which is around `2.2x higher` than `Edge-Solo` and `Cloud-Edge-Opt` (`24.36 tokens/s`), and about `7x higher` than `Cloud-Edge-Even` (`7.56 tokens/s`).
*   **Similar improvements are observed for `Llama2-13B`**: `EdgeShard` achieves `45.7%` and `28.8%` lower latency, and `2.23x` and `2.2x` higher throughput compared to `Cloud-Edge-Even` and `Cloud-Edge-Opt`, respectively.
*   **Impact of Limited Cloud Bandwidth**: For `Llama2-7B`, `Cloud-Edge-Opt` performs identically to `Edge-Solo`. This is because the `1Mbps bandwidth` between the source node and the cloud server is severely limited. The `optimal strategy` for `Cloud-Edge-Opt` under such constraint is to perform local execution, effectively behaving like `Edge-Solo` to avoid high communication costs.

### 6.1.2. Effects of Bandwidth
This experiment varies the bandwidth between the source node (AGX Orin) and the cloud server from 1Mbps to 50Mbps. For `Llama2-13B` and `Llama2-70B`, baselines `Edge-Solo`, `Cloud-Edge-Even`, and `Cloud-Edge-Opt` are not applicable due to `OOM` or similar performance as `Edge-Solo` at low bandwidths. `EdgeShard` is compared with its variant `EdgeShard-Even` (evenly partitioned across devices).

The following figure (Figure 7 from the original paper) shows the impact of network bandwidth on the latency of collaborative LLM inference:

![Fig. 8. Impact of Bandwidth to Throughput of Collaborative LLMs Inference](/files/papers/6901877330e1bdf721191ce5/images/8.jpg)
*该图像是图8，展示了带宽对协作型LLM推理吞吐量的影响。图中三个子图分别对应Llama2-7B、Llama2-13B和Llama2-70B模型，横轴为带宽(Mbps)，纵轴为吞吐量(tokens/s)，显示了EdgeShard相比其他方法在不同带宽下的吞吐量变化趋势。*

Fig. 7. Impact of Network Bandwidth to Latency of Collaborative LLMs inference

**Observations on Latency (Fig. 7):**
*   **Bandwidth Sensitivity**: Except for `Edge-Solo` (which is unaffected by cloud bandwidth as it's local), the latency of collaboration-based methods (`Cloud-Edge-Even`, `Cloud-Edge-Opt`, `EdgeShard`) decreases as bandwidth increases. This is due to reduced data transmission time.
*   **Bandwidth Saturation**: A `dramatic latency reduction` is observed when bandwidth increases from `1Mbps to 10Mbps`. Beyond `10Mbps` (e.g., up to `50Mbps`), the latency reduction becomes minor, suggesting that `computation time` becomes the `bottleneck` as communication costs become less dominant.
*   **Cloud-Edge vs. Edge-Solo**: When bandwidth is  $> 10Mbps$ , `cloud-edge collaboration methods` (`Cloud-Edge-Even`, `Cloud-Edge-Opt`) start outperforming `Edge-Solo` because the powerful cloud server can accelerate computation, outweighing communication costs.
*   **EdgeShard vs. Cloud-Edge-Opt**: Interestingly, for `Llama2-7B` and `Llama2-13B`, when bandwidth is  $> 10Mbps$ , `EdgeShard` shows nearly identical latency performance to `Cloud-Edge-Opt`. The authors found that in these conditions, `EdgeShard`'s optimization algorithm converged to the same model partition and allocation policies as `Cloud-Edge-Opt`, effectively indicating that `Cloud-Edge-Opt` is a special case of `EdgeShard` (i.e., using only two devices when optimal).
*   **EdgeShard-Even for Llama2-70B**: `EdgeShard` consistently outperforms `EdgeShard-Even` for `Llama2-70B`, demonstrating the benefit of adaptive partitioning even when most devices are homogeneous, especially with the inclusion of the much more powerful `RTX 3090` cloud server. The improvement is less pronounced due to the presence of 11 identical AGX Orin devices.

    The following figure (Figure 8 from the original paper) shows the impact of bandwidth on the throughput of collaborative LLM inference:

    ![Fig. 9. Impact of Source Node](/files/papers/6901877330e1bdf721191ce5/images/9.jpg)
    *该图像是图表，展示了图9中不同方法在Llama2-7B模型上的延迟和吞吐量性能对比。上图为延迟(ms)，下图为吞吐量(tokens/s)，不同颜色代表两种设备（AGX Orin与Orin NX），部分方法出现OOM（内存溢出）情况。*

Fig. 8. Impact of Bandwidth to Throughput of Collaborative LLMs Inference

**Observations on Throughput (Fig. 8):**
*   **Similar Patterns**: Throughput generally follows similar patterns to latency, increasing with bandwidth for collaborative methods.
*   **Memory-driven Throughput for Llama2-13B**: For `Llama2-13B`, `EdgeShard` shows a `2x higher throughput` than `Cloud-Edge-Opt` at `10Mbps`, despite similar latency at higher bandwidths. This is because `Cloud-Edge-Opt` (with only two devices) experiences high memory consumption (`95%-98%`) on the `RTX 3090` and `source node`, limiting it to a `maximum batch size of 4`. `EdgeShard`, by involving *several* edge devices, significantly reduces the memory consumption per device, allowing for a `larger batch size (8)`, thus boosting throughput.
*   **EdgeShard-Even for Llama2-70B**: `EdgeShard-Even` shows stable throughput across bandwidths, as its fixed partitioning strategy does not change. `EdgeShard` offers a slight improvement over `EdgeShard-Even` for `Llama2-70B` due to adaptive resource utilization.

### 6.1.3. Effects of Source Node
This experiment investigates how the choice of the `source node` (where input tokens originate) impacts performance, comparing `AGX Orin` vs. `Orin NX` as the source, with cloud-source bandwidth fixed at `1Mbps`.

The following figure (Figure 9 from the original paper) shows the impact of the source node:

![该图像是图表，展示了Llama2-7B和Llama2-13B模型在不同方法下的吞吐量比较。图中分别以蓝色和橙色条形区分EdgeShard-no-bubble与EdgeShard-bubble两种方案，显现EdgeShard方案在吞吐量上显著优于Cloud-Edge方法。](/files/papers/6901877330e1bdf721191ce5/images/10.jpg)
*该图像是图表，展示了Llama2-7B和Llama2-13B模型在不同方法下的吞吐量比较。图中分别以蓝色和橙色条形区分EdgeShard-no-bubble与EdgeShard-bubble两种方案，显现EdgeShard方案在吞吐量上显著优于Cloud-Edge方法。*

Fig. 9. Impact of Source Node

**Observations (Fig. 9):**
*   **Memory Constraints with Weaker Source Node**: When the source node is `Orin NX` (which has lower memory, 16GB, compared to `AGX Orin`'s 32GB), `Edge-Solo` and `Cloud-Edge-Even` encounter `OOM` errors for `Llama2-7B`. This demonstrates that a single, less capable device cannot even host the full model or an evenly partitioned half.
*   **Performance Gap in Cloud-Edge-Opt**: The performance difference between `AGX Orin` and `Orin NX` as source nodes is much more pronounced for `Cloud-Edge-Opt` (around `60ms latency gap`) than for `EdgeShard` (around `5ms latency gap`).
    *   This is because `Cloud-Edge-Opt`, being a two-device setup, tends to place more layers on the source node. If the source node is weaker (`Orin NX`), it becomes a significant bottleneck.
*   **EdgeShard's Adaptability**: `EdgeShard` involves *more devices* and strategically places *fewer model layers* on the source node, effectively `filling the computation capacity gap` between heterogeneous source nodes. This leads to more stable performance regardless of the source node's individual capabilities.
*   **Throughput Advantage**: Similarly, with `AGX Orin` as the source, `Cloud-Edge-Opt` achieves `6x higher throughput` than with `Orin NX`. In contrast, `EdgeShard` shows only a `2x higher throughput` difference. This further highlights `EdgeShard`'s ability to leverage the broader network resources to compensate for weaker individual devices.

### 6.1.4. Effects of Pipeline Execution Strategy
This evaluation compares `EdgeShard-No-bubble` against `EdgeShard-Bubble`, with cloud-source bandwidth set to `1Mbps`.

**Observations (Fig. 10):**
*   **EdgeShard-No-bubble Superiority**: `EdgeShard-No-bubble` consistently `outperforms EdgeShard-Bubble` across all methods for `Llama2-7B` and `Llama2-13B`.
    *   For `Llama2-7B`, `EdgeShard-No-bubble` improves throughput by `0.36 tokens/s` for `Cloud-Edge-Even` and `6.96 tokens/s` for `EdgeShard`. (Note: For `Cloud-Edge-Opt`, it selects local execution at 1Mbps, so no pipeline is involved, thus throughput is the same for both).
    *   For `Llama2-13B`, improvements are `1.69 tokens/s` for `Cloud-Edge-Even`, `1.89 tokens/s` for `Cloud-Edge` (likely `Cloud-Edge-Opt`), and `5.21 tokens/s` for `EdgeShard`.
*   **Reduced Idle Time**: `EdgeShard-No-bubble` is superior because it mitigates `device idle time` by allowing immediate token generation without waiting for all `micro-batches` in an iteration to complete. This effectively reduces `bubbles` in the pipeline and leads to higher `throughput`.

## 6.2. Data Presentation (Tables)
The table has been transcribed in section 6.1.1.

## 6.3. Ablation Studies / Parameter Analysis
The paper presents several implicit ablation and parameter analyses:
*   **Adaptive vs. Even Partitioning**: The comparison between `EdgeShard` and `EdgeShard-Even` for `Llama2-70B` (in "Effects of Bandwidth") acts as an ablation study for the adaptive partitioning strategy. `EdgeShard`'s superior performance demonstrates the value of its optimization algorithm over a simpler, fixed even split, especially when device heterogeneity is present.
*   **Impact of Network Bandwidth**: The "Effects of Bandwidth" section analyzes a crucial network parameter. It shows that `EdgeShard` adapts to bandwidth changes, converging to `Cloud-Edge-Opt` strategies when bandwidth is abundant but outperforming it when bandwidth is constrained or when more devices are needed to handle memory/computation. This highlights the adaptability of the dynamic programming approach.
*   **Impact of Source Node Capabilities**: The "Effects of Source Node" section implicitly ablates the contribution of a powerful source node. When the source node is weak (`Orin NX`), `EdgeShard`'s ability to distribute the workload across more devices becomes more critical, demonstrating its robustness against individual device limitations.
*   **Pipeline Execution Strategy**: The comparison between `EdgeShard-Bubble` and `EdgeShard-No-bubble` directly evaluates the effectiveness of the proposed `no-bubbles` optimization. This is a clear ablation showing that reducing idle time in the pipeline significantly boosts throughput.

    These analyses collectively demonstrate that `EdgeShard`'s `adaptive device selection and model partition` (driven by the DP algorithms) and `pipeline execution optimizations` are crucial for achieving superior performance in heterogeneous collaborative edge environments.

# 7. Conclusion & Reflections

## 7.1. Conclusion Summary
This work introduces `EdgeShard`, a pioneering framework for the efficient deployment and distributed inference of `Large Language Models (LLMs)` within a `collaborative edge computing` environment. By intelligently partitioning LLMs into `shards` and adaptively allocating them across heterogeneous edge devices and cloud servers, `EdgeShard` effectively addresses the challenges of `high latency`, `bandwidth costs`, and `memory limitations` inherent in cloud-centric or single-device edge deployments. The core of `EdgeShard` lies in its `dynamic programming algorithms`, which are designed to solve a `joint device selection and model partition problem`, optimizing specifically for either `inference latency` or `throughput`. Extensive experiments on a physical prototype with `Llama2 models` conclusively demonstrate that `EdgeShard` significantly outperforms baseline methods, achieving up to `50% latency reduction` and `2x throughput improvement`. The findings highlight `EdgeShard`'s adaptability to diverse network conditions and its ability to enable `LLM inference` on models that would otherwise be `out-of-memory` for traditional edge or simple cloud-edge setups.

## 7.2. Limitations & Future Work
The authors acknowledge several open issues and suggest future research directions:
*   **Incentive Mechanisms**: The current framework assumes a cooperative environment where devices either belong to a single entity or are willing to share resources. For scenarios involving devices owned by different stakeholders, robust `incentive mechanisms` are needed to encourage resource sharing and participation in collaborative inference.
*   **Batch Size Aware Optimization**: While `EdgeShard` implicitly benefits from larger batch sizes by allowing more memory due to sharding, the current `dynamic programming algorithm` does not explicitly incorporate the `batch size` as a direct optimization variable. Future work could integrate `batch size` into the optimization problem to achieve even greater throughput improvements.
*   **Dynamic Re-partitioning**: The profiling and scheduling are currently an offline step. In highly dynamic edge environments, network conditions and device availability can change rapidly. Future work could explore `dynamic re-partitioning` or `adaptive rescheduling` mechanisms to cope with real-time changes.
*   **Energy Efficiency**: As edge devices are often battery-powered, optimizing for `energy consumption` alongside latency and throughput could be a critical future direction, especially for long-term deployments.
*   **Fault Tolerance**: In a distributed system with many heterogeneous devices, device failures or intermittent connectivity can occur. Investigating `fault-tolerant mechanisms` to ensure continuous LLM inference amidst device unreliability would be important.

## 7.3. Personal Insights & Critique
This paper presents a highly relevant and well-executed approach to a critical problem in the burgeoning field of `Edge AI`. The shift from monolithic cloud LLMs to distributed edge deployments is inevitable given the demands for `low latency`, `privacy`, and `bandwidth efficiency`. `EdgeShard` provides a solid foundational framework for this transition.

One of the paper's strengths is its `rigorous experimental validation` on a `physical heterogeneous testbed`, which adds significant credibility over simulation-based studies. The detailed analysis of `bandwidth effects`, `source node capabilities`, and `pipeline execution strategies` provides valuable insights into the practical challenges and opportunities of `collaborative edge computing`. The `no-bubbles` pipeline optimization is a clever detail that addresses a specific characteristic of `autoregressive LLM inference`, showcasing a deep understanding of the problem domain.

A minor critique, as noted in the methodology section, is the slight ambiguity in the pseudo-code for Algorithm 1 regarding the memory constraint and how  $Mem_j$  is updated. While typically managed in DP, the exact mechanics could have been more explicitly detailed for complete clarity for someone implementing the algorithm solely from the paper.

The methods and conclusions of `EdgeShard` are highly transferable. The `dynamic programming approach` for partitioning sequential models across heterogeneous resources could be applied to other `layered deep learning models` beyond LLMs (e.g., large `vision models` like `ViT`) or even other sequential computational tasks in edge environments. The principle of adaptive joint device selection and workload distribution is fundamental to optimizing any complex distributed task.

Beyond the authors' suggested future work, I believe further exploration could include:
*   **Security and Trust**: In a multi-stakeholder scenario, beyond just incentives, ensuring the security and trustworthiness of each participating device (especially when processing sensitive intermediate activations) is paramount. `Homomorphic encryption` or `secure multi-party computation` could be explored, though they often introduce performance overheads.
*   **Quantization Integration**: While `EdgeShard` currently uses full-precision models to avoid accuracy loss, combining the collaborative sharding with `quantization` (e.g., using different quantization levels for different shards based on device capabilities or sensitivity requirements) could offer even greater flexibility and resource efficiency for extreme edge devices.
*   **Hierarchical Orchestration**: For extremely large-scale deployments, a hierarchical orchestration layer might be needed to manage clusters of collaborating edge devices, potentially involving `federated learning` principles for collaborative model updates or fine-tuning.

    Overall, `EdgeShard` is a well-designed and thoroughly evaluated solution that lays important groundwork for the future of `LLM deployment` in increasingly distributed and heterogeneous computing landscapes.

Category	Device	Memory	AI Performance
Edge Device	Jetson AGX Orin	32GB	3.33 TFLOPS
Edge Device	Jetson Orin NX	16GB	1.88 TFLOPS
Cloud Server	RTX 3090	24GB	36 TFLOPS

	Llama2-7B	Llama2-13B	Llama2-70B
Edge-Solo	140.34	24.36	OOM	OOM	OOM	OOM
Cloud-Edge-Even	227.35	7.56	319.44	4.68	OOM	OOM
Cloud-Edge-Opt	140.34	24.36	243.45	4.74	OOM	OOM
EdgeShard	75.88	52.45	173.43	10.45	3086.43	1.25

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

	Llama2-7B		Llama2-13B		Llama2-70B
	Latency	Throughput	Latency	Throughput	Latency	Throughput
Edge-Solo	140.34	24.36	OOM	OOM	OOM	OOM
Cloud-Edge-Even	227.35	7.56	319.44	4.68	OOM	OOM
Cloud-Edge-Opt	140.34	24.36	243.45	4.74	OOM	OOM
EdgeShard	75.88	52.45	173.43	10.45	3086.43	1.25

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~20 min read · 23,762 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Large Language Models (LLMs)

3.1.2. Edge Computing

3.1.3. Collaborative Edge Computing (CEC)

3.2. Previous Works

3.2.1. Model Quantization

3.2.2. Cloud-Edge Collaboration

3.2.3. Distributed Training/Inference in Cloud Data Centers

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Profiling Stage

4.2.2. Scheduling Optimization Stage

4.2.3. Collaborative Inference Stage

4.2.4. System Model

4.2.5. Optimize LLM inference latency

Problem Formulation

Solution: Dynamic Programming Algorithm

Similar papers