Paper status: completed

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

Published:05/23/2024
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
6 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

EdgeShard uses collaborative edge computing to shard LLMs across distributed devices, optimizing latency and throughput via dynamic programming, reducing inference delay by 50% and doubling throughput while addressing cloud dependency challenges.

Abstract

Large language models (LLMs) have shown great potential in natural language processing and content generation. However, current LLMs heavily rely on cloud computing, leading to prolonged latency, high bandwidth cost, and privacy concerns. Edge computing is promising to address such concerns by deploying LLMs on edge devices, closer to data sources. Some works try to leverage model quantization to reduce the model size to fit the resource-constraint edge devices, but they lead to accuracy loss. Other works use cloud-edge collaboration, suffering from unstable network connections. In this work, we leverage collaborative edge computing to facilitate the collaboration among edge devices and cloud servers for jointly performing efficient LLM inference. We propose a general framework to partition the LLM model into shards and deploy on distributed devices. To achieve efficient LLM inference, we formulate an adaptive joint device selection and model partition problem and design an efficient dynamic programming algorithm to optimize the inference latency and throughput, respectively. Experiments of Llama2 serial models on a heterogeneous physical prototype demonstrate that EdgeShard achieves up to 50% latency reduction and 2x throughput improvement over baseline methods.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

1.2. Authors

Mingjin Zhang, Jiannong Cao, Fellow, IEEE, Xiaoming Shen, Zeyang Cui

1.3. Journal/Conference

The paper is published as a preprint on arXiv, indicated by arXiv preprint arXiv:2405.14371, 2024. As an arXiv preprint, it has not yet undergone formal peer review by a journal or conference, but arXiv is a highly respected open-access repository for preprints in fields like physics, mathematics, computer science, etc., serving as a primary means for rapid dissemination of research findings. Jiannong Cao is listed as a Fellow, IEEE, which indicates a high level of recognition in the Institute of Electrical and Electronics Engineers, suggesting expertise in the field.

1.4. Publication Year

2024 (Specifically, 2024-05-23T09:46:22.000Z, which is May 23, 2024).

1.5. Abstract

The paper addresses the challenges of deploying Large Language Models (LLMs), such as prolonged latency, high bandwidth cost, and privacy concerns, which arise from their heavy reliance on cloud computing. While edge computing offers a promising alternative by placing LLMs closer to data sources, existing solutions like model quantization often lead to accuracy loss, and cloud-edge collaboration suffers from unstable network connections. This work introduces EdgeShard, a framework that leverages collaborative edge computing among edge devices and cloud servers for efficient LLM inference. EdgeShard partitions an LLM into shards and deploys them across distributed devices. To optimize performance, it formulates an adaptive joint device selection and model partition problem and proposes an efficient dynamic programming algorithm to optimize for inference latency and throughput. Experimental evaluations using Llama2 models on a heterogeneous physical prototype demonstrate that EdgeShard achieves up to 50% latency reduction and 2x throughput improvement compared to baseline methods.

Official Source: https://arxiv.org/abs/2405.14371v1 PDF Link: https://arxiv.org/pdf/2405.14371v1.pdf Status: Preprint (Version 1).

2. Executive Summary

2.1. Background & Motivation

The explosion of Large Language Models (LLMs) has brought unprecedented capabilities in natural language processing (NLP) and content generation. However, their immense computational and memory requirements currently necessitate heavy reliance on cloud computing. This reliance presents several critical issues:

  • Prolonged Latency: Sending data to and from distant cloud servers introduces delays, making real-time applications (e.g., robotics, navigation) impractical.

  • High Bandwidth Cost: Transmitting large volumes of data (text, images, video, IoT sensor data) to the cloud consumes significant network bandwidth and incurs costs.

  • Privacy Concerns: Handling sensitive data (e.g., medical records, financial data, personal mobile data) in a centralized cloud raises substantial privacy and security risks.

    Edge computing is a promising paradigm to mitigate these issues by moving computation closer to the data sources, on edge devices (e.g., edge servers, mobile phones). However, LLMs are computation-intensive and resource-greedy, often exceeding the memory and computational capacity of single edge devices. Existing approaches to adapt LLMs for edge environments face their own limitations:

  • Model Quantization: While it reduces model size, it frequently leads to accuracy loss.

  • Cloud-Edge Collaboration: Partitioning models between edge and cloud can be effective but often suffers from high and unstable network latency between edge devices and distant cloud servers.

    The paper identifies a gap: previous edge computing research often focuses on vertical collaboration (cloud-edge-end devices) but neglects horizontal edge-to-edge collaborations. With the continuous growth of edge computing power and the deployment of numerous edge servers and edge clouds, there's an opportunity to leverage a broader pool of distributed resources. This motivates the authors to explore collaborative edge computing (CEC)—integrating geo-distributed edge devices and cloud servers—as a more robust solution for LLM inference.

2.2. Main Contributions / Findings

The primary contributions of EdgeShard are threefold:

  • General LLM Inference Framework: Introduction of EdgeShard, a novel framework for efficient collaborative LLM inference across heterogeneous edge devices and cloud servers. This framework enables the dynamic selection of participating devices and the layer-wise partitioning of LLMs to optimize performance in a collaborative edge computing environment.
  • Adaptive Optimization Problem Formulation and Algorithm: The authors mathematically formulate a joint device selection and model partition problem. This problem aims to optimize either inference latency or throughput by considering heterogeneous computation and networking resources, as well as memory constraints. To solve this complex problem, they design an efficient dynamic programming (DP) algorithm.
  • Experimental Validation on a Physical Testbed: Extensive experiments are conducted using Llama2 serial models (7B, 13B, 70B) on a heterogeneous physical prototype. The results demonstrate that EdgeShard significantly outperforms various baseline methods, achieving:
    • Up to 50% latency reduction compared to Edge-Solo and Cloud-Edge-Opt for Llama2-7B.

    • 2x throughput improvement over Edge-Solo and Cloud-Edge-Opt for Llama2-7B.

    • The ability to deploy larger models like Llama2-70B, which are out-of-memory (OOM) for baseline methods.

    • Adaptability to varying network conditions and device heterogeneity.

      These findings highlight that EdgeShard provides a flexible and adaptive method for serving LLMs by efficiently utilizing ubiquitous computing devices, particularly beneficial in scenarios where cloud bandwidth is limited or traditional edge deployments are memory-constrained.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Large Language Models (LLMs)

LLMs are a class of deep learning models, typically decoder-based transformer models, characterized by their massive scale—often billions or even hundreds of billions of parameters. They are trained on vast amounts of text data, enabling them to understand, generate, and process human language with remarkable fluency and coherence.

  • Transformer Architecture: The fundamental architecture underpinning most modern LLMs. Introduced in "Attention Is All You Need" (Vaswani et al., 2017), it relies heavily on self-attention mechanisms to weigh the importance of different parts of the input sequence when processing each element.

  • Decoder-based Models: Unlike encoder-based models (e.g., BERT) which are good for understanding existing text, decoder-based models (e.g., GPT series, Llama) are primarily designed for generative tasks, producing new text sequentially.

  • Autoregressive Nature of Inference: LLM inference, especially for generation, is autoregressive. This means that each new token generated depends on all the previously generated tokens and the initial input. The process typically involves two phases:

    • Prompt Processing (Prefill) Phase: The model takes an initial input sequence (the prompt) and processes it to generate the first new token. During this phase, it computes and stores intermediate representations (like Key-Value caches) for all input tokens.
    • Autoregressive Generation Phase: After the prefill phase, the model generates tokens one by one. In each step, it takes the current input (initial prompt + all previously generated tokens) and predicts the next token. This process repeats until an end-of-sequence token is generated or a maximum length is reached.
  • Key-Value Caching (KV Caching): A crucial optimization for autoregressive generation. Since each new token calculation depends on all previous tokens, recomputing Key and Value matrices for the entire sequence at each step would be very inefficient. KV caching stores these matrices from previous computations, allowing new tokens to be generated by only computing the Key and Value for the new token and concatenating it with the cached ones, significantly reducing computational workload and improving response times.

    The following figure (Figure 2 from the original paper) illustrates the autoregressive nature of LLM inference:

    该图像是一张示意图,展示了EdgeShard框架中LLM模型与异构网络设备的联合设备选择和模型分区过程,以及通过顺序推理与流水线并行推理实现协作推理的流程。 该图像是一张示意图,展示了EdgeShard框架中LLM模型与异构网络设备的联合设备选择和模型分区过程,以及通过顺序推理与流水线并行推理实现协作推理的流程。

Fig. 2. LLM inference has an autoregressive nature.

3.1.2. Edge Computing

Edge computing is a distributed computing paradigm that brings computation and data storage closer to the data sources (the "edge" of the network), rather than relying solely on a centralized cloud data center. This proximity offers several advantages:

  • Reduced Latency: Faster response times for applications by minimizing the distance data travels.
  • Lower Bandwidth Consumption: Less data needs to be sent to the cloud, reducing network congestion and costs.
  • Enhanced Privacy and Security: Sensitive data can be processed locally, reducing exposure to public networks and centralized repositories.
  • Improved Reliability: Services can operate even with intermittent or unreliable cloud connectivity. Edge devices can range from small IoT sensors and mobile phones to more powerful edge servers and edge gateways.

3.1.3. Collaborative Edge Computing (CEC)

Collaborative Edge Computing (CEC) extends the concept of edge computing by emphasizing the collaboration among various geo-distributed edge devices and cloud servers to jointly perform computational tasks. Instead of isolated edge devices or simple vertical offloading to a single cloud, CEC envisions a network of interconnected computing resources forming a shared resource pool. This paradigm aims to:

  • Enlarge Resource Pool: Combine the compute, memory, and storage capacities of multiple heterogeneous devices.
  • Optimize Resource Utilization: Distribute workloads efficiently across available resources.
  • Provide Low-Latency Services: Keep processing close to data sources while leveraging broader capabilities.
  • Expand Service Region: Enable wider coverage and availability of AI services. The authors highlight that CEC differs from traditional edge computing by focusing on horizontal (edge-to-edge) as well as vertical (cloud-edge) collaborations.

The following figure (Figure 1 from the original paper) illustrates the concept of collaborative edge computing:

Fig. 1. Collaborative edge computing integrates the computing resources of ubiquitous geo-distributed devices for jointly performing computational tasks, with great benefits of enlarged resource pool… 该图像是示意图,展示了协作边缘计算如何整合广泛分布的设备计算资源以共同执行任务,体现了资源池扩大、低延迟处理、灵活接入和服务区域扩展等优势。

Fig. 1. Collaborative edge computing integrates the computing resources of ubiquitous geo-distributed devices for jointly performing computational tasks, with great benefits of enlarged resource pool, low-latency data processing, flexible device access, and expanded service region.

3.2. Previous Works

The paper discusses several existing approaches for deploying LLMs, primarily focusing on addressing their resource demands.

3.2.1. Model Quantization

Model quantization is a technique to reduce the memory footprint and computational cost of neural networks by representing their weights and/or activations with lower precision (e.g., 8-bit integers or 4-bit integers instead of 32-bit floating points).

  • GPTQ [8]: An early work that quantizes LLMs with billions of parameters to 3-4 bits using approximate second-order information. It primarily focuses on weight-only quantization.
  • AWQ (Activation-aware Weight Quantization) [10]: Optimizes channel scaling to preserve important weights, aiming to reduce quantization error. Also primarily weight-only.
  • SmoothQuant [11] and Agile-Quant [7]: Take a more comprehensive approach by quantizing not only model weights but also activations. This can lead to greater memory and computational savings.
  • Limitation: While effective in reducing model size, quantization often comes with a trade-off: it may lead to accuracy loss compared to full-precision models. The paper also points out that even quantized LLMs might still exceed the capacity of resource-constrained edge devices (e.g., Llama2-70B 4-bit still requires 35GB).

3.2.2. Cloud-Edge Collaboration

These approaches partition an LLM and offload parts of its computation workload between an edge device and a powerful cloud server.

  • Wang et al. [13]: Focus on distributing computation between cloud servers and edge devices to increase throughput and reduce communication overhead by leveraging the low-rank property of residual activations.
  • Chen et al. [14]: Propose using location-based information of edge devices for personalized prompt completion in a collaborative edge-cloud LLM serving scenario.
  • Limitation: A major drawback of these vertical collaboration methods is that the latency between edge devices and cloud servers is usually high and unstable, which can negatively impact the overall inference or fine-tuning performance of LLMs. They also typically involve only two computing entities (one edge, one cloud).

3.2.3. Distributed Training/Inference in Cloud Data Centers

These works focus on distributing large neural networks across multiple GPUs within a high-performance cloud data center.

  • Gpipe [17]: A system for pipeline parallelism that splits a deep neural network into stages, assigning each stage to a different GPU. It focuses on efficient training of very large models.
  • PipeDream [18]: Another system for generalized pipeline parallelism designed for DNN training, aiming to maximize hardware utilization.
  • Differentiation from EdgeShard: The paper explicitly states that these cloud-centric solutions are not directly applicable to edge computing environments due to key differences:
    • Homogeneity vs. Heterogeneity: Cloud servers typically have homogeneous GPUs connected by high-bandwidth networks (e.g., InfiniBand, Nvlinks, up to 600 GB/s). Edge devices, in contrast, are inherently heterogeneous in their computation capabilities and are connected by heterogeneous and low-bandwidth networks (e.g., dozens of Kbps to 1000 Mbps).
    • Resource Constraints: Cloud data centers have abundant resources, whereas edge environments are resource-constrained. Solutions designed for the cloud often neglect these heterogeneous and resource-constrained characteristics of edge computing.

3.3. Technological Evolution

The evolution of deploying large AI models has generally followed a path from centralized, powerful cloud data centers towards distributed, localized edge environments. Initially, the sheer scale of deep learning models necessitated cloud infrastructure. As models grew larger into LLMs, techniques like quantization emerged to squeeze them onto smaller devices, albeit with potential accuracy compromises. Simultaneously, cloud-edge collaboration offered a hybrid approach, but its performance remained tethered to the stability and speed of cloud connectivity.

This paper's work on EdgeShard represents a significant step in this evolution by proposing collaborative edge computing. It moves beyond the limitations of single-device edge deployment and simple cloud-edge pairings, embracing a broader network of heterogeneous edge devices and cloud servers. This approach allows for more flexible and robust deployment by dynamically leveraging available resources, bringing the benefits of distributed processing (like those seen in cloud data centers via pipeline parallelism) to the more challenging and diverse edge computing landscape.

3.4. Differentiation Analysis

EdgeShard differentiates itself from previous works in several critical ways:

  • Collaborative Edge Computing Paradigm: Unlike quantization which reduces model size at the cost of potential accuracy, or cloud-edge collaboration which is limited by unstable cloud links and typically involves only two nodes, EdgeShard embraces a truly collaborative edge computing environment. It integrates multiple heterogeneous edge devices and cloud servers into a single resource pool. This allows for horizontal (edge-to-edge) collaboration in addition to vertical (cloud-edge) collaboration, creating a more resilient and powerful system.
  • Adaptive Joint Optimization: Instead of fixed partitions or relying on simple offloading, EdgeShard formulates an adaptive joint device selection and model partition problem. This allows it to intelligently decide which devices to use and how to partition the LLM layers among them, dynamically optimizing for specific goals like latency or throughput. This contrasts with fixed partitioning strategies (e.g., Cloud-Edge-Even) which may not account for heterogeneity.
  • Handling Heterogeneity: EdgeShard is explicitly designed for heterogeneous computing devices and heterogeneous network connections typical of edge environments. This is a key departure from cloud-based distributed systems (like Gpipe or PipeDream) that assume homogeneous GPUs and high-bandwidth interconnects. EdgeShard's dynamic programming algorithm inherently factors in these varying computational capabilities and bandwidths.
  • Flexibility in Objectives: The framework provides distinct optimization algorithms for both minimizing inference latency (suitable for single-user, real-time scenarios) and maximizing inference throughput (suitable for multi-user or batch processing scenarios), offering tailored performance solutions.
  • Memory Wall Solution without Accuracy Loss: By sharding the LLM across multiple devices, EdgeShard enables the deployment of models (like Llama2-70B) that would otherwise exceed the memory capacity of any single edge device or even simple cloud-edge setups, without resorting to quantization and its associated accuracy loss.

4. Methodology

The EdgeShard framework facilitates efficient collaborative LLM inference on distributed devices by intelligently partitioning the LLM and allocating its shards to a selected subset of available computing devices. The overall process consists of three main stages: Profiling, Scheduling Optimization, and Collaborative Inference.

The following figure (Figure 3 from the original paper) provides an overview of the EdgeShard framework:

Fig. 4. Collaborative LLM inference 该图像是论文中关于协同边缘计算加速LLM推理的示意图,展示了(a)顺序推理和(b)流水线并行推理两种执行方式的时间分布,体现了不同设备上模型层或批次的执行情况。

4.1. Principles

The core idea behind EdgeShard is to leverage the collective computational and memory resources of geo-distributed edge devices and cloud servers to overcome the limitations of deploying large LLMs on individual, resource-constrained devices. It views an LLM as a stack of layers and aims to distribute these layers (or shards) across a network of heterogeneous devices. The principle is that by optimally selecting devices and partitioning the model, the system can minimize overall inference latency or maximize throughput, even with varying computational capabilities and network bandwidths among the devices. The theoretical basis lies in dynamic programming for solving sequential decision-making problems, which naturally fits the layered structure of LLMs.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Profiling Stage

Profiling is an offline, one-time step that gathers essential runtime characteristics of the LLM layers on different devices and network conditions. This information is crucial for the subsequent scheduling optimization. The profiled traces include:

  1. Execution time of each layer on different devices: This involves measuring the time it takes for a specific LLM layer to compute on each available device. For LLMs, this profiling differentiates between the prefill stage and the autoregressive stage token generation. If a device cannot hold the full model for profiling, a dynamic model loading technology is used to load layers consecutively.
  2. Size of activations and memory consumption for each layer: This determines the data transfer volume between layers and the memory footprint of each layer on a device.
  3. Available memory of each device: The total memory capacity of each potential computing device.
  4. Bandwidth among devices: The network communication speed between any two devices in the collaborative network.

4.2.2. Scheduling Optimization Stage

At this stage, an intelligent scheduler uses the profiled information to determine an optimal deployment strategy. This strategy involves:

  • Device Selection: Deciding which subset of available computing devices (edge devices and cloud servers) will participate in the inference task.

  • Model Partition: Dividing the LLM into multiple shards. The partitioning is layer-wise, meaning specific layers are grouped together to form a shard.

  • Shard Allocation: Assigning each model shard to a selected device.

    The scheduling optimization considers:

  • Heterogeneous resources: The varying computational capabilities and memory budgets of different devices.

  • Memory budget of devices: Ensuring that the assigned layers do not exceed a device's memory capacity.

  • Privacy constraint: If applicable, specific layers might need to remain on certain devices for privacy reasons.

    This stage formulates the problem as a joint device selection and model partition problem and solves it using a dynamic programming algorithm to optimize for either inference latency or throughput.

4.2.3. Collaborative Inference Stage

Once the LLM model partition and allocation strategy are determined, the selected devices execute the inference collaboratively.

  • KV Cache Pre-allocation: Memory space is pre-allocated on each participating device for Key-Value (KV) caches to optimize autoregressive generation.
  • Two Cases for Collaborative Inference:
    • Sequential Inference: Devices process their allocated model shards in sequence.
      • Description: If an LLM is partitioned into SS shards allocated to devices D1,D2,,DSD_1, D_2, \dots, D_S, D1D_1 processes the input, sends intermediate activations/outputs to D2D_2, which processes and transmits to D3D_3, and so on.
      • Suitability: Ideal for serving a single user in scenarios like smart homes, where minimizing the latency for one request is paramount.
      • Resource Utilization: Not resource-efficient, as devices other than the currently processing one remain idle.
    • Pipeline Parallel Inference: Input data is split into micro-batches and processed in a pipeline fashion to improve resource utilization.
      • Description: Device D1D_1 processes micro-batch B1, then immediately starts micro-batch B2 while D2D_2 processes B1 (sent from D1D_1). This overlapping execution keeps all devices busy.

      • Suitability: Maximizes throughput by keeping devices utilized concurrently.

      • Inspiration: Similar to techniques used in cloud servers like Gpipe [17] and PipeDream [18].

        The following figure (Figure 4 from the original paper) illustrates collaborative LLM inference in both sequential and pipeline parallel modes:

        该图像是论文EdgeShard中的示意图,展示了在不同设备之间分配模型分片的布局方案。图中显示了在四个设备上如何分布模型的不同部分,以实现协同推理。 该图像是论文EdgeShard中的示意图,展示了在不同设备之间分配模型分片的布局方案。图中显示了在四个设备上如何分布模型的不同部分,以实现协同推理。

Fig. 4. Collaborative LLM inference

4.2.4. System Model

The system considers an LLM with NN layers and a network of MM heterogeneous edge devices and cloud servers.

  • NN: Total number of layers in the LLM.

  • OiO_i: Size of activations (output) of layer ii, for 0iN10 \le i \le N-1.

  • ReqiReq_i: Memory consumption required for layer ii.

  • MM: Total number of computing devices.

  • MemjMem_j: Memory budget of device jj, for 0jM10 \le j \le M-1.

  • Bk,jB_{k,j}: Bandwidth between device kk and device jj, for 0k,jM10 \le k, j \le M-1.

  • Source Node: Device 0 is designated as the source node where input tokens originate.

    The main notations used in the paper are listed in Table II: The following are the results from [Table II] of the original paper:

Symbol Descriptions
Xi,jX_{i,j} binary variable, whether layer ii of a model is allocated to device jj
tcompi,jt_{comp}^{i,j} computation time of layer ii on device jj
tcompim,jt_{comp}^{i \to m,j} computation time of layer ii to layer mm on device jj
tcommi1,k,jt_{comm}^{i-1,k,j} communication time to transmit activations of layer `i-1` from device kk to device jj
`DP(i, j)` minimal total execution time of the first ii layers if layer ii is allocated to device jj
`g(i, S, k)` processing time of the slowest node to process the first ii layers with device set SS

4.2.5. Optimize LLM inference latency

Problem Formulation

The goal is to minimize the total inference latency (TtolT_{tol}). The allocation strategy is defined by a binary variable Xi,jX_{i,j}:

  • Xi,j=1X_{i,j} = 1 if layer ii is allocated to device jj.
  • Xi,j=0X_{i,j} = 0 otherwise. Each layer must be allocated to exactly one device: $ \sum_{j=0}^{M-1} X_{i,j} = 1, \forall i $ The communication time to transmit activations of layer i-1 from device kk to device jj, denoted as tcommi1,k,jt_{comm}^{i-1,k,j}, is calculated as: tcommi1,k,j={Oi1Bk,j,if kj0,otherwise. t_{comm}^{i-1,k,j} = \left\{ \begin{array}{ll} \frac{O_{i-1}}{B_{k,j}}, & if \ k \neq j \\ 0, & \mathrm{otherwise}. \end{array} \right. Where Oi1O_{i-1} is the size of activations of layer i-1, and Bk,jB_{k,j} is the bandwidth between device kk and device jj. If layers i-1 and ii are on the same node, communication time is zero.

The total inference time (TtolT_{tol}) is the sum of computation times for all layers and communication times between layers if they are on different devices: Ttol=i=0N1j=0M1Xi,jtcompi,j+i=1N1j=0M1k=0M1Xi1,kXi,jtcommi1,k,j T_{tol} = \sum_{i=0}^{N-1} \sum_{j=0}^{M-1} X_{i,j} \cdot t_{comp}^{i,j} + \sum_{i=1}^{N-1} \sum_{j=0}^{M-1} \sum_{k=0}^{M-1} X_{i-1,k} \cdot X_{i,j} \cdot t_{comm}^{i-1,k,j} Where tcompi,jt_{comp}^{i,j} is the computation time of layer ii on device jj.

The problem of minimizing LLM inference latency is formulated as: minTtol\operatorname*{min} T_{tol} Subject to constraints:

  1. Privacy Constraint: The first layer must be allocated to the source node (device 0) to avoid raw input data transmission. X0,0=1X_{0,0} = 1
  2. Memory Constraint: The total memory required by all layers allocated to device jj cannot exceed its memory budget. i=0N1Xi,jReqiMemj \sum_{i=0}^{N-1} X_{i,j} \cdot Req_{i} \leq Mem_{j} Where ReqiReq_i is the memory consumption of layer ii, and MemjMem_j is the memory budget of device jj.

Solution: Dynamic Programming Algorithm

A dynamic programming (DP) algorithm is designed to solve this problem by leveraging its optimal sub-problem property. Let DP(i, j) denote the minimal total execution time of the first ii layers, given that layer ii is allocated to device jj. The state transition equation is formulated as: DP(i,j)=min0k<M(DP(i1,k)+tcompi,j+tcommi1,k,j) DP(i, j) = \operatorname*{min}_{0 \leq k < M} \left( DP(i-1, k) + t_{comp}^{i,j} + t_{comm}^{i-1,k,j} \right) This equation represents the minimal execution time for the first ii layers ending with layer ii on device jj, by considering all possible devices kk that could host layer i-1. The term DP(i1,k)DP(i-1, k) is the minimal execution time up to layer i-1 ending on device kk. For the last layer N-1, an additional communication time to send the generated token back to the source node (device 0) is included: DP(N1,j)=min0k<M(DP(N2,k)+tcompN1,j+tcommN2,k,j+tcommN1,j,0) DP(N-1, j) = \operatorname*{min}_{0 \leq k < M} \left( DP(N-2, k) + t_{comp}^{N-1,j} + t_{comm}^{N-2,k,j} + t_{comm}^{N-1,j,0} \right) The authors' provided equation (6) in the paper for DP(i, j) seems to be a simplified representation and doesn't explicitly show the special case for the last layer's return to the source node. Based on the accompanying text, the tcommN1,j,0t_{comm}^{N-1,j,0} term is added for the last layer. The provided formula (6) in the paper is: DP(i,j)={min0k<M(DP(i1,k)+tcompi,j+tcommi1,k,j)if i<N1min0k<M(DP(i1,k)+tcompi,j+tcommi1,k,j+tcommi,j,0)if i=N1 DP(i, j) = \left\{ \begin{array}{ll} \underset{0 \leq k < M}{\operatorname*{min}} \left( DP(i-1, k) + t_{comp}^{i,j} + t_{comm}^{i-1,k,j} \right) & \text{if } i < N-1 \\ \underset{0 \leq k < M}{\operatorname*{min}} \left( DP(i-1, k) + t_{comp}^{i,j} + t_{comm}^{i-1,k,j} + t_{comm}^{i,j,0} \right) & \text{if } i = N-1 \end{array} \right. The initialization for the first layer (layer 0) is based on the privacy constraint (X0,0=1X_{0,0}=1): DP(0,0)=tcomp0,0DP(0, 0) = t_{comp}^{0,0} The minimal total execution time for the entire LLM is then found by taking the minimum DP(N1,j)DP(N-1, j) across all devices jj: minj=0,,M1(DP(N1,j)) \min_{j=0, \dots, M-1} (DP(N-1, j)) The optimal allocation strategy can be reconstructed by backtracking the choice made at each step of the DP algorithm.

The algorithm to find the optimal LLM partition and allocation strategy for minimizing inference latency (Algorithm 1 in the paper) is presented as follows:

Algorithm 1: Joint device selection and LLM partition for optimizing latency
Input: A LLM model; Computing device MM; Profiled traces; bandwidth Bk,jB_{k,j}
Output: the device selection and LLM partition strategy RR

// initialization
1 Initialize DP table DP(i,j)=INFDP(i, j) = INF, and choice table choice(i,j)=NULLchoice(i, j) = NULL to record the strategy;
2 Enforce first layer to be allocated to node 0 by DP(0,0)=tcomp0,0DP(0, 0) = t_{comp}^{0,0} and choice(0,0)=0choice(0, 0) = 0

// fill in the DP table
3 for i=1i = 1 to N1N - 1 do
4   for j=0j = 0 to M1M - 1 do
5     if l=0iReql>Memj\sum_{l=0}^{i} Req_{l} > Mem_j then // Check if current layer and all preceding layers allocated to device j exceed its memory. This check seems problematic as it accumulates memory for _all_ previous layers on _one_ device. A more accurate check would be for layers assigned to device j within its contiguous block. Based on the context of lines 12-14 in the algorithm, it should be checking memory for layers (or a block of layers) currently being allocated to device j. However, I will strictly adhere to the paper's formula Memj<=ReqiMem_j <= Req_i (from line 5). The sum in (5) is for *all* layers allocated to device j, not necessarily a contiguous block. The problem statement (5) and line 5 of the algorithm are contradictory in terms of memory checks. I'll stick to the algorithm's direct translation and note the potential inconsistency if observed. Re-reading line 5, it checks Memj<=ReqiMem_j <= Req_i, which means "if the memory budget of device j is *less than or equal to* the requirement of layer i". This condition seems reversed or incorrectly stated; it should likely be Reqi>MemjReq_i > Mem_j to indicate an OOM scenario, or currentallocatedmemoryonj+Reqi>Memjcurrent_allocated_memory_on_j + Req_i > Mem_j. Given the initial memory constraint from section IV, "The memory budget of a device j is Mem_j," and constraint (5) "i=0N1Xi,jReqiMemj\sum_{i=0}^{N-1} X_{i,j} \cdot Req_{i} \leq Mem_{j}", the algorithm's check Memj<=ReqiMem_j <= Req_i on line 5 means if device j cannot even hold layer i. This is a very basic check. The actual sum constraint (5) is managed implicitly by the DP structure. For clarity, I will interpret line 5 as a check that layer ii *itself* can fit on device jj. The full memory constraint (Eq. 5) is usually implicitly handled by preventing allocations that would lead to memory overruns when building up the solution.

    Rereading line 5: ifMemj<=ReqithenContinue;if Mem_j <= Req_i then Continue;
    This means if the memory capacity of device jj is less than or equal to the requirement of layer ii, then skip this device. This is a basic filter. The original formulation (5) i=0N1Xi,jReqiMemj\sum_{i=0}^{N-1} X_{i,j} * Req_{i} \leq Mem_{j} is the overall constraint. The DP needs to keep track of the cumulative memory usage for each device as layers are assigned. The provided algorithm (Algo 1) does not explicitly track the cumulative memory for *all* layers assigned to jj. Line 13 UpdatememoryMemjUpdate memory Mem_j implies MemjMem_j is being reduced, which would be for *current* usage, not the total budget. This implies MemjMem_j in the algorithm is perhaps `remaining memory` on device jj. This is a common pattern in DP for resource constraints. However, without explicit definition of how MemjMem_j is updated, it's ambiguous.
    Given the constraint i=0N1Xi,jReqiMemj\sum_{i=0}^{N-1} X_{i,j} * Req_{i} \leq Mem_{j}, the DP would need to ensure this. A typical way is to pass remaining memory as a state, or to check the sum of `Req` for all layers *assigned so far to that specific device*. The algorithm as written (Line 5 and Line 13) is a bit simplified. Line 13 UpdatememoryMemjUpdate memory Mem_j is crucial and underspecified. I will assume it means MemjMem_j is the *available* memory and is updated to reflect the memory consumed by layer ii. However, the constraint (5) applies to *all* layers on device jj, not just layer ii. If MemjMem_j in Line 5 is the *total capacity*, and ReqiReq_i is for one layer, then Memj<=ReqiMem_j <= Req_i means `device_j_capacity <= layer_i_requirement`, which would imply that layerilayer_i cannot fit on devicejdevice_j. This is a reasonable initial check. Line 13 "Update memory MemjMem_j" is still the most ambiguous part regarding how the sum in (5) is maintained. A common approach is to track memory used for each device as part of the DP state or via a separate data structure. Given the strict instruction to follow the paper, I will present the algorithm exactly as written, acknowledging this potential ambiguity in a footnote if necessary. For now, I will assume MemjMem_j in line 13 implies updating a *temporary* memory usage tracker for device jj as layers are hypothetically assigned. If MemjMem_j is total capacity, then it should not be updated. If it is *remaining capacity*, it should be Memj=MemjReqiMem_j = Mem_j - Req_i. The constraint (5) is a global check *after* all allocations are decided. The DP is building *towards* that.

    Let's re-interpret the memory check logic: MemjMem_j in lines 5 and 13 is likely *not* the total memory budget, but rather a temporary variable tracking the *memory consumed by layers currently allocated to device jj within the current DP path*. So, Memj<=ReqiMem_j <= Req_i in line 5 would still be a minimum capacity check. The constraint (5) is usually handled by MemjMem_j meaning total capacity and checking currentusedmemory[j]+Reqi>Memjcurrent_used_memory[j] + Req_i > Mem_j before adding layer ii to device jj. Or, if MemjMem_j in the DP state means remaining memory, then Reqi>MemjReq_i > Mem_j would be the check. Given the phrasing Memj<=ReqiMem_j <= Req_i and UpdatememoryMemjUpdate memory Mem_j, it's still slightly unclear, but I'll stick to the exact text. For a rigorous academic assistant, I should highlight this ambiguity if possible, but the prompt asks to be faithful. I will translate it literally. If MemjMem_j is total capacity, then Memj<=ReqiMem_j <= Req_i means the device's total memory is too small for layer ii. If MemjMem_j refers to available memory, then ReqiReq_i should be compared against it. Let's assume MemjMem_j in Line 5 refers to the *total memory capacity* of device jj. Then Memj<=ReqiMem_j <= Req_i implies device jj cannot even hold layer ii. The update in Line 13 is problematic for this interpretation.

    The most consistent interpretation for Line 5 and Line 13, given the overall problem, is that Line 5 is a placeholder check and the *actual* memory constraint (Eq. 5) is checked at the *end* for the *entire* allocation, or Line 13 implies a cumulative memory sum that is checked *against* MemjMem_j (total capacity) at each step. If MemjMem_j in line 5 and line 13 refers to the total memory *budget* of device jj, then line 5 ifMemj<=ReqithenContinue;if Mem_j <= Req_i then Continue; means: if device jj's total capacity is less than or equal to layer ii's requirement, then it cannot host layer ii. Line 13 UpdatememoryMemjUpdate memory Mem_j would then be incorrect, as the budget doesn't change.

    A more common interpretation for `DP(i, j)` where jj is the device, if memory is constrained, is that the state should also include memory. Or the algorithm implicitly assumes the "memory budget of devices" constraint is checked when `t_total` is computed, perhaps the profiler provides `t_comp` as `INF` if OOM.
    However, I must strictly follow the *given algorithm text*. Line 5 reads ifMemj<=ReqithenContinue;if Mem_j <= Req_i then Continue;
    Line 13 reads UpdatememoryMemjUpdate memory Mem_j. This indicates MemjMem_j in the algorithm is a *dynamic* variable. This means MemjMem_j is likely meant to be the *remaining memory* on device jj. If so, then ReqiReq_i is memory needed for layer ii. Memj<=ReqiMem_j <= Req_i would mean remainingmemory<=memoryneededforlayeriremaining_memory <= memory_needed_for_layer_i, which is the condition for OOM. This interpretation makes more sense.

    Let's re-state based on this interpretation: MemjMem_j in the algorithm (lines 5, 13) represents the *remaining memory* on device jj *for the current path of layers being assigned to it*. The initial value of this MemjMem_j (which is not explicitly set in the algorithm's initialization for each iteration) would need to be the full capacity, `Mem_j_total_capacity`. Then line 5 makes sense as ifremainingmemory<=ReqithenContinue;if remaining_memory <= Req_i then Continue;. Line 13 UpdatememoryMemjUpdate memory Mem_j would then mean remainingmemory=remainingmemoryReqiremaining_memory = remaining_memory - Req_i. However, the DP state `DP(i, j)` only captures time, not remaining memory. This is a subtle contradiction.

    The paper's overall constraint is (5): i=0N1Xi,jReqiMemj\sum_{i=0}^{N-1} X_{i,j} \cdot Req_{i} \leq Mem_{j}.
    The algorithm's line 5: ifMemj<=ReqithenContinue;if Mem_j <= Req_i then Continue; (here MemjMem_j is the *total* memory budget, as it's not a state variable for `DP(i,j)` or dynamically tracked within the loop without more context).
    The algorithm's line 13: UpdatememoryMemjUpdate memory Mem_j. This line is the most problematic if MemjMem_j is the total budget.
    Perhaps MemjMem_j in the algorithm refers to the remaining capacity *for the current contiguous block of layers being assigned to device j*. This is complex.

    The simplest interpretation (and most charitable to the algorithm as written without additional state variables) is that Line 5 ifMemj<=ReqithenContinue;if Mem_j <= Req_i then Continue; acts as a basic filter, checking if the *total capacity* of device jj is not even enough for layer ii. The UpdatememoryMemjUpdate memory Mem_j in line 13 is likely a conceptual placeholder for how *future* memory state might be passed or tracked, or it's simply an error in pseudo-code for the *actual cumulative memory tracking* needed for constraint (5). Given the prompt's strictness, I must write it as Memj<=ReqiMem_j <= Req_i and UpdatememoryMemjUpdate memory Mem_j, and for `DP(i,j)`, the explicit memory tracking is not in its state. The global constraint (5) is usually satisfied by backtracking and summing ReqiReq_i for allocated layers. I will present the algorithm exactly as written.

Algorithm 1: Joint device selection and LLM partition for optimizing latency Input: A LLM model; Computing devices MM; Profiled traces; bandwidth Bk,jB_{k,j} Output: the device selection and LLM partition strategy RR

// initialization 1 Initialize DP table DP(i,j)=INFDP(i, j) = INF, and choice table choice(i,j)=NULLchoice(i, j) = NULL to record the strategy; 2 Enforce first layer to be allocated to node 0 by DP(0,0)=tcomp0,0DP(0, 0) = t_{comp}^{0,0} and choice(0,0)=0choice(0, 0) = 0

// fill in the DP table 3 for i=1i = 1 to N1N - 1 do 4 for j=0j = 0 to M1M - 1 do 5 if MemjReqiMem_j \leq Req_i then // Check if device j's memory budget is too small for layer i 6 Continue; 7 end 8 else 9 for k=0k = 0 to M1M - 1 do 10 Calculate the total execution time by Eq. (6) and assign it to ttotalt_{total}: // Eq. (6) is: DP(i,j)=min0k<M(DP(i1,k)+tcompi,j+tcommi1,k,j)DP(i, j) = \min_{0 \leq k < M} ( DP(i-1, k) + t_{comp}^{i,j} + t_{comm}^{i-1,k,j} ) // (with the special case for i = N-1 for return to source) 11 if ttotalDP(i,j)t_{total} \le DP(i, j) then 12 Update DP(i, j) by assigning DP(i,j)=ttotalDP(i, j) = t_{total} 13 Update memory MemjMem_j // This line is ambiguous in the paper's pseudocode. It likely implies tracking cumulative memory usage for device jj up to layer ii and ensuring it doesn't exceed MemjMem_j (total capacity). However, the pseudocode literally says UpdatememoryMemjUpdate memory Mem_j, which if MemjMem_j is the global budget, would be incorrect. Given the strict instruction, I will reproduce it as-is. 14 Record allocation plan choice(i,j)=kchoice(i, j) = k 15 end 16 end 17 end 18 end 19 end

// backtrace for allocation strategy 20 Initialize optimal strategy RR. 21 Find the last selected node Nlast=argminj(DP(N1,j))N_{last} = argmin_{j} (DP(N - 1, j)). 22 Add NlastN_{last} to RR. // This is incorrect, NlastN_{last} is a device index, not a layer. It should be adding the assignment (N1,Nlast)(N-1, N_{last}) to RR. 23 for i=N1i = N - 1 to 0 do 24 Find the previous node Nprev=choice(i,Nlast)N_{prev} = choice(i, N_{last}) // Original says Nlast=choice(i,Nlast)N_{last} = choice(i, N_{last}) which is also confusing. Assume NprevN_{prev} to avoid overwriting. 25 Add (i1,Nprev)(i-1, N_{prev}) to RR. // This adds the previous layer's device. For i=0i=0, this would be layer -1. Needs careful re-interpretation. 26 Update Nlast=NprevN_{last} = N_{prev} // Update for the next iteration of backtracing 27 end 28 Reverse RR: // RR would contain device assignments from last to first layer. Reversing makes it first to last. 29 return RR.

*Correction on Algorithm 1 Interpretation:* The pseudo-code for backtracing (lines 22, 25) is slightly inconsistent with typical DP backtracking. Usually, RR would store pairs of (layer\_index, device\_index). Line 22 implies adding the *device* of the last layer. Line 25 implies adding the *previous* layer's device. If RR is intended to contain `(layer_idx, device_idx)` pairs, then line 22 should be `Add`(N-1, N_{last})`to`R,andline25shouldbeAdd(i1,Nprev)toR, and line 25 should be `Add`(i-1, N_{prev})`to`R. Also, the loop condition for backtracing typically goes down to 0, and the `choice` for layer 0 (i.e. choice(0,0)choice(0,0)) is `0`, meaning layer 0 is on device 0. The current logic would add `(-1, device_idx)` if it runs to i=0i=0. I will strictly reproduce the pseudocode as given in the paper, acknowledging the potential minor ambiguities if any, but will try to keep my descriptive text consistent with the likely intent.

**Computational Complexity**: The complexity of Algorithm 1 is O(N×M×M)O(N \times M \times M), where NN is the number of layers and MM is the number of devices.

### 4.2.6. Optimize LLM inference throughput

#### Problem Formulation
For optimizing `throughput`, `pipeline parallelism` is adopted. The goal is to maximize throughput, which is equivalent to minimizing the latency of the slowest device. Communication and computation times can be overlapped.
For a selected set of devices SS, the maximum latency for any device jSj \in S is determined by the maximum of its computation time and incoming communication time for a specific task segment.
Tlatencyj=max{tcompim,jtcommi1,k,j
T_{latency}^{j} = \operatorname*{max} \left\{ \begin{array}{ll}
t_{comp}^{i \to m,j} \\
t_{comm}^{i-1,k,j}
\end{array} \right.

Where tcompim,jt_{comp}^{i \to m,j} is the computation time of layers from ii to mm on device jj, and tcommi1,k,jt_{comm}^{i-1,k,j} is the communication time to transmit activations of layer `i-1` from device kk to device jj.
The problem of maximizing inference throughput is then formulated as minimizing the maximum latency across all selected devices:
min{TlatencyjjS}
\operatorname*{min} \{ T_{latency}^{j} | j \in S \}


#### Solution: Dynamic Programming Algorithm
Similar to latency optimization, this problem also exhibits `optimal sub-problem property`, solvable using `dynamic programming`.
Let `g(i, S, k)` denote the minimum time for the slowest node to process the first ii layers, with the set of used devices SS, and device kk being the last node used (kSk \in S).
The state transition equation for `g(m, S', j)` (processing the first mm layers with device set SS', where jj is the last device and S=S{j}S' = S \cup \{j\}) is:
g(m,S,j)=min0i<mN1,kSmax{g(i,S,k)tcommi1,k,jtcompim,j
g(m, S', j) = \operatorname*{min}_{0 \leq i < m \leq N-1, k \in S} \operatorname*{max} \left\{ \begin{array}{ll}
g(i, S, k) \\
t_{comm}^{i-1,k,j} \\
t_{comp}^{i \to m,j}
\end{array} \right.

This equation means that to reach state `(m, S', j)`, we consider all previous states `(i, S, k)` and pick the one that minimizes the maximum of: (1) the slowest time up to layer ii using set SS ending with kk, (2) the communication time from kk to jj, and (3) the computation time on jj for the layers ii to mm.

Constraints for state transition:
1.  **Memory Constraint**: The memory required for layers ii to mm on device jj must not exceed its budget.
    ReqimMemjReq_{i \to m} \leq Mem_{j}
    Where ReqimReq_{i \to m} is the sum of memory requirements for layers ii through mm.
2.  **Privacy Constraint**: Similar to latency, the first layer is on device 0.
    g(1,{0},0)=tcomp0,0
    g(1, \{0\}, 0) = t_{comp}^{0,0}
    
    This initializes the state for the first layer (layer 0) using device 0. Here, (1,0,0)(1, {0}, 0) implies processing 1 layer (layer 0), with device set `{0}`, and `0` as the last device.

The final optimal solution TthrouoptT_{throu}^{opt} is the minimum g(N1,S,j)g(N-1, S', j) over all possible device sets SS' and last devices jSj \in S'.

The algorithm for optimizing throughput (Algorithm 2 in the paper) is provided as follows:

Algorithm 2: Joint device selection and LLM partition for optimizing throughput Input: A LLM model; Computing devices MM; Profiled traces; bandwidth Bk,jB_{k,j} Output: the device selection and LLM partition strategy RR

// initialization 1 Initialize DP table g(i,S,k)=INFg(i, S, k) = INF, and choice table choice(m,S,j)=NULLchoice(m, S, j) = NULL to record the strategy; 2 Enforce first layer to be allocated to node 0 by g(1,{0},0)=tcomp0,0g(1, \{0\}, 0) = t_{comp}^{0,0} and choice(1,{0},0)=(0,0,0)choice(1, \{0\}, 0) = (0, 0, 0)

// fill in DP table 3 for i=1i = 1 to N1N - 1 do // i is the upper layer index of the previous block 4 for each subset SMS \subseteq M do // S is the set of devices used for first i layers 5 for last node kSk \in S do // k is the last device in set S 6 for m=i+1m = i + 1 to N1N - 1 do // m is the upper layer index of the current block 7 for jMSj \in M \setminus S do // j is a new device not in S to host layers i+1 to m 8 if l=imReql>Memj\sum_{l=i}^{m} Req_l > Mem_j then // Check if layers i to m exceed device j's memory 9 Continue; 10 end 11 else 12 Get SS' by adding node jj to the selected device set SS 13 Calculate current maximum execution time TmaxT_{max} via Eq. (11) for the maximum execution time in all stages; // Eq. (11) is: Tmax=max(g(i,S,k),tcommi1,k,j,tcompim,j)T_{max} = \max (g(i, S, k), t_{comm}^{i-1,k,j}, t_{comp}^{i \to m,j}) 14 end 15 if Tmaxg(m,S,j)T_{max} \leq g(m, S', j) then // If new path leads to a smaller max latency for this state 16 g(m,S,j)=Tmaxg(m, S', j) = T_{max} 17 Record the current strategy choice(m,S,j)=(i,j,k)choice(m, S', j) = (i, j, k) // Records the previous state (i, S, k) and new device j 18 end 19 end 20 end 21 end 22 end 23 end

// backtrace for optimal allocation 24 Initialize optimal strategy RR 25 Find selected device set SoptS_{opt} and the last selected node Nlast_optN_{last\_opt} by Sopt,Nlast_opt=argminS,k(g(N1,S,k))S_{opt}, N_{last\_opt} = argmin_{S, k} (g(N - 1, S, k)) 26 Initialize layer=N1layer = N - 1 27 while layer>0layer > 0 do 28 (prev_layer,prev_S,prev_k)=choice(layer,Sopt,Nlast_opt)(prev\_layer, prev\_S, prev\_k) = choice(layer, S_{opt}, N_{last\_opt}) // Retrieves the previous state from which current state was optimized 29 Add (prev_layerlayer,Nlast_opt)(prev\_layer \to layer, N_{last\_opt}) to RR. // This implies layers from prev_layer to layer are on N_last_opt. 30 Update layer=prev_layerlayer = prev\_layer, Sopt=prev_SS_{opt} = prev\_S, and Nlast_opt=prev_kN_{last\_opt} = prev\_k 31 end 32 return RR.

**Computational Complexity**: The complexity of Algorithm 2 is O(N2×2M×M2)O(N^2 \times 2^M \times M^2), where NN is the number of layers and MM is the number of devices. This is significantly higher than Algorithm 1 due to iterating over all subsets of devices SS.

### 4.2.7. Pipeline Execution Optimization
The ideal `pipeline parallel inference` assumes no device idleness. However, `decoder-based LLM inference` has an `autoregressive nature`, meaning token generation is sequential and relies on previously generated tokens. This can lead to `bubbles` (idle time) in the pipeline if a device waits for a complete `micro-batch` to finish all its token generations before starting the next.

*   **EdgeShard-Bubbles**: The traditional pipeline approach where devices wait for full `micro-batches`. As shown in Fig. 5(a), there are periods where devices are idle, leading to resource underutilization.
*   **EdgeShard-No-bubbles**: This proposed optimization aims to reduce these bubbles by allowing `immediate token generation` without waiting for the completion of all `micro-batches` in an iteration. As illustrated in Fig. 5(b), after the `prefill stage` (`P1`) of the first `micro-batch` ends, Device 1 immediately starts the `token generation` (`G1A`) for that `micro-batch`. Similarly, upon completion of `G1A`, Device 1 proceeds to the next iteration of token generation (`G1B`). This approach mitigates device idle time, enhancing resource utilization and improving `throughput`.

    The following figure (Figure 5 from the original paper) contrasts `bubbles` and `no-bubbles` pipeline execution:

    ![Fig. 6. Our testbed has heterogeneous edge devices and cloud server. Their specifications are shown in Table III.](/files/papers/6901877330e1bdf721191ce5/images/6.jpg)
    *该图像是一张实验平台的照片,展示了多个异构边缘设备和云服务器的实际硬件部署,图中设备通过蓝色网线互联,展示了论文中EdgeShard系统的测试环境。*

Fig. 5. This figure from the paper contrasts pipeline execution with `Bubbles` (top) and `No-bubbles` (bottom) for collaborative LLM inference. In the `Bubbles` approach, devices wait for the completion of entire micro-batches before proceeding, leading to idle times. In the `No-bubbles` approach, devices immediately move to the next token generation step of a micro-batch, reducing idle periods and improving resource utilization.

# 5. Experimental Setup

## 5.1. Datasets
*   **Benchmark Tasks**: The experiments focus on `text generation` tasks, which is a core application of LLMs.
*   **LLM Models**: `Llama2 serial models` are used: `Llama2-7B`, `Llama2-13B`, and `Llama2-70B`. These models, released by Meta in 2023, are popular and powerful open-source LLMs.
*   **Dataset**: The `WikiText-2 dataset [21]` from HuggingFace is utilized.
    *   **Characteristics**: A collection of high-quality articles from Wikipedia, primarily used for language modeling tasks.
    *   **Input/Output Configuration**: A subset of samples is extracted with an `input token length of 32` and `generation of 96 tokens`.
*   **Precision**: All experiments use `full-precision model inference`, avoiding `quantization` to isolate the performance benefits of `EdgeShard`'s collaboration strategy.

## 5.2. Evaluation Metrics
The performance of `EdgeShard` and baseline methods are evaluated using two key metrics:
1.  **Latency**:
    *   **Conceptual Definition**: Latency, in this context, refers to the average time taken to generate a single token during the LLM inference process. It measures the responsiveness of the system. Lower latency indicates faster responses, which is critical for real-time interactive applications.
    *   **Unit**: Milliseconds per token (`milliseconds/token`).
    *   **Mathematical Formula**: While not explicitly provided in the paper, average latency per token is typically calculated as:
        \$
        \text{Average Latency} = \frac{\text{Total Inference Time}}{\text{Total Number of Generated Tokens}}
        \$
        Where `Total Inference Time` includes both computation and communication time for all layers to generate all tokens, and `Total Number of Generated Tokens` is the number of new tokens produced by the model (e.g., 96 tokens in their setup).
2.  **Throughput**:
    *   **Conceptual Definition**: Throughput measures the rate at which the system can process tokens, specifically the number of tokens generated per second. It indicates the system's capacity to handle workloads, particularly useful for batch processing or serving multiple users concurrently. Higher throughput means more tokens can be generated in a given time period.
    *   **Unit**: Tokens per second (`tokens/second`).
    *   **Mathematical Formula**: While not explicitly provided, throughput is generally calculated as:
        \$
        \text{Throughput} = \frac{\text{Total Number of Generated Tokens}}{\text{Total Inference Time}}
        \$
        This is the inverse of latency when considering a single batch/request. For multiple batches/requests, it's the total tokens generated across all batches divided by the total time. The paper mentions `maximum batch size`, implying throughput is measured for the maximum load the system can handle.

## 5.3. Baselines
`EdgeShard`'s performance is compared against several baseline methods, designed to cover different deployment strategies in edge and cloud environments. `Cloud-only` is explicitly excluded as a baseline due to `privacy concerns` regarding raw input transmission.

1.  **Edge-Solo**:
    *   **Description**: In this baseline, the entire LLM is deployed and inferred locally on a `single edge device` without any model partitioning or collaboration.
    *   **Representativeness**: Represents the simplest, fully on-device deployment. It is limited by the memory and computational capacity of that single edge device.

2.  **Cloud-Edge-Even**:
    *   **Description**: The LLM is `evenly partitioned` into two parts. One part is deployed on an `edge device`, and the other on a `cloud server`. The partitioning is fixed and doesn't adapt to device capabilities.
    *   **Representativeness**: Represents a naive `cloud-edge collaboration` strategy, splitting the model equally without optimization.

3.  **Cloud-Edge-Opt**:
    *   **Description**: The LLM is partitioned into two shards, with one on an `edge device` and another on a `cloud server`. However, unlike `Cloud-Edge-Even`, the `partitioning strategy` is optimized using the `dynamic programming algorithm` proposed in the paper (Algorithms 1 or 2), but constrained to only two devices (one edge, one cloud).
    *   **Representativeness**: Represents an optimized `vertical cloud-edge collaboration` and serves as a direct comparison for how `EdgeShard`'s algorithm performs when restricted to a two-device scenario.

## 5.4. Testbed
The experiments are conducted on a `heterogeneous physical testbed` designed to simulate a `collaborative edge computing` environment.

*   **Devices**:
    *   **Edge Devices**: 12 `Jetson AGX Orin` units and 2 `Jetson Orin NX` units.
    *   **Cloud Server**: 1 server equipped with an `RTX 3090 GPU`.
    *   **Total Devices**: 15 devices in total.

        The specifications of these devices are detailed in Table III:
The following are the results from [Table III] of the original paper:

Category Device Memory AI Performance
Edge Device Jetson AGX Orin 32GB 3.33 TFLOPS
Edge Device Jetson Orin NX 16GB 1.88 TFLOPS
Cloud Server RTX 3090 24GB 36 TFLOPS
* **Network Configuration**: * All devices are interconnected via a `router` and a `switch`. * The base bandwidth between any two devices is `1000Mbps`. * The `Linux TC tool [20]` is used to programmatically vary `network bandwidth` and `communication latency` between devices, allowing for testing under diverse network conditions. * **Specific Network Settings for Overall Evaluation**: * Source node: `AGX Orin`. * Bandwidth between source node and cloud server: `1Mbps`. * Bandwidth between other computing devices: `50Mbps` with a `20% variance`. * **Batch Size**: For throughput tests, the `batch size` is set to the `maximum batch size` that the participating devices can support. The following figure (Figure 6 from the original paper) shows the physical testbed: ![Fig. 7. Impact of Network Bandwidth to Latency of Collaborative LLMs inference](/files/papers/6901877330e1bdf721191ce5/images/7.jpg) *该图像是图表,展示了网络带宽对协同LLM推理延迟的影响,包括Llama2-7B、Llama2-13B和Llama2-70B三个模型在不同带宽下的延迟变化情况,比较了多种部署方案的性能。* Fig. 6. Our testbed has heterogeneous edge devices and cloud server. Their specifications are shown in Table III. # 6. Results & Analysis ## 6.1. Core Results Analysis The experiments evaluate `EdgeShard` against baseline methods across different `Llama2 models` under varying network conditions and source node capabilities, focusing on `inference latency` and `throughput`. ### 6.1.1. Overall Evaluation (Fixed Network Conditions) The initial evaluation sets the source node as an `AGX Orin`, with 1Mbps bandwidth between the source and cloud, and 50Mbps (20% variance) between other devices. The batch size is set to the maximum supported by participating devices. The following are the results from [Table IV] of the original paper:
Llama2-7B Llama2-13B Llama2-70B
Latency Throughput Latency Throughput Latency Throughput
Edge-Solo 140.34 24.36 OOM OOM OOM OOM
Cloud-Edge-Even 227.35 7.56 319.44 4.68 OOM OOM
Cloud-Edge-Opt 140.34 24.36 243.45 4.74 OOM OOM
EdgeShard 75.88 52.45 173.43 10.45 3086.43 1.25
**Key Observations from Table IV:** * **Deployment Capability for Large Models**: For `Llama2-70B`, `Edge-Solo`, `Cloud-Edge-Even`, and `Cloud-Edge-Opt` all suffer from `Out-Of-Memory (OOM)` errors. This is because Llama2-70B requires 280GB memory, far exceeding individual device capacities or simple two-device splits. `EdgeShard`, however, successfully deploys and infers `Llama2-70B` (with 3086.43 ms latency and 1.25 tokens/s throughput) by splitting it into shards across multiple devices. This highlights EdgeShard's ability to tackle the memory wall for very large LLMs. * **Latency Reduction for Smaller Models**: For `Llama2-7B`, `EdgeShard` achieves `75.88 ms` latency, which is approximately `1.85x faster` than `Edge-Solo` and `Cloud-Edge-Opt` (both `140.34 ms`), and about `3x faster` than `Cloud-Edge-Even` (`227.35 ms`). * **Throughput Improvement for Smaller Models**: For `Llama2-7B`, `EdgeShard` achieves `52.45 tokens/s` throughput, which is around `2.2x higher` than `Edge-Solo` and `Cloud-Edge-Opt` (`24.36 tokens/s`), and about `7x higher` than `Cloud-Edge-Even` (`7.56 tokens/s`). * **Similar improvements are observed for `Llama2-13B`**: `EdgeShard` achieves `45.7%` and `28.8%` lower latency, and `2.23x` and `2.2x` higher throughput compared to `Cloud-Edge-Even` and `Cloud-Edge-Opt`, respectively. * **Impact of Limited Cloud Bandwidth**: For `Llama2-7B`, `Cloud-Edge-Opt` performs identically to `Edge-Solo`. This is because the `1Mbps bandwidth` between the source node and the cloud server is severely limited. The `optimal strategy` for `Cloud-Edge-Opt` under such constraint is to perform local execution, effectively behaving like `Edge-Solo` to avoid high communication costs. ### 6.1.2. Effects of Bandwidth This experiment varies the bandwidth between the source node (AGX Orin) and the cloud server from 1Mbps to 50Mbps. For `Llama2-13B` and `Llama2-70B`, baselines `Edge-Solo`, `Cloud-Edge-Even`, and `Cloud-Edge-Opt` are not applicable due to `OOM` or similar performance as `Edge-Solo` at low bandwidths. `EdgeShard` is compared with its variant `EdgeShard-Even` (evenly partitioned across devices). The following figure (Figure 7 from the original paper) shows the impact of network bandwidth on the latency of collaborative LLM inference: ![Fig. 8. Impact of Bandwidth to Throughput of Collaborative LLMs Inference](/files/papers/6901877330e1bdf721191ce5/images/8.jpg) *该图像是图8,展示了带宽对协作型LLM推理吞吐量的影响。图中三个子图分别对应Llama2-7B、Llama2-13B和Llama2-70B模型,横轴为带宽(Mbps),纵轴为吞吐量(tokens/s),显示了EdgeShard相比其他方法在不同带宽下的吞吐量变化趋势。* Fig. 7. Impact of Network Bandwidth to Latency of Collaborative LLMs inference **Observations on Latency (Fig. 7):** * **Bandwidth Sensitivity**: Except for `Edge-Solo` (which is unaffected by cloud bandwidth as it's local), the latency of collaboration-based methods (`Cloud-Edge-Even`, `Cloud-Edge-Opt`, `EdgeShard`) decreases as bandwidth increases. This is due to reduced data transmission time. * **Bandwidth Saturation**: A `dramatic latency reduction` is observed when bandwidth increases from `1Mbps to 10Mbps`. Beyond `10Mbps` (e.g., up to `50Mbps`), the latency reduction becomes minor, suggesting that `computation time` becomes the `bottleneck` as communication costs become less dominant. * **Cloud-Edge vs. Edge-Solo**: When bandwidth is >10Mbps> 10Mbps, `cloud-edge collaboration methods` (`Cloud-Edge-Even`, `Cloud-Edge-Opt`) start outperforming `Edge-Solo` because the powerful cloud server can accelerate computation, outweighing communication costs. * **EdgeShard vs. Cloud-Edge-Opt**: Interestingly, for `Llama2-7B` and `Llama2-13B`, when bandwidth is >10Mbps> 10Mbps, `EdgeShard` shows nearly identical latency performance to `Cloud-Edge-Opt`. The authors found that in these conditions, `EdgeShard`'s optimization algorithm converged to the same model partition and allocation policies as `Cloud-Edge-Opt`, effectively indicating that `Cloud-Edge-Opt` is a special case of `EdgeShard` (i.e., using only two devices when optimal). * **EdgeShard-Even for Llama2-70B**: `EdgeShard` consistently outperforms `EdgeShard-Even` for `Llama2-70B`, demonstrating the benefit of adaptive partitioning even when most devices are homogeneous, especially with the inclusion of the much more powerful `RTX 3090` cloud server. The improvement is less pronounced due to the presence of 11 identical AGX Orin devices. The following figure (Figure 8 from the original paper) shows the impact of bandwidth on the throughput of collaborative LLM inference: ![Fig. 9. Impact of Source Node](/files/papers/6901877330e1bdf721191ce5/images/9.jpg) *该图像是图表,展示了图9中不同方法在Llama2-7B模型上的延迟和吞吐量性能对比。上图为延迟(ms),下图为吞吐量(tokens/s),不同颜色代表两种设备(AGX Orin与Orin NX),部分方法出现OOM(内存溢出)情况。* Fig. 8. Impact of Bandwidth to Throughput of Collaborative LLMs Inference **Observations on Throughput (Fig. 8):** * **Similar Patterns**: Throughput generally follows similar patterns to latency, increasing with bandwidth for collaborative methods. * **Memory-driven Throughput for Llama2-13B**: For `Llama2-13B`, `EdgeShard` shows a `2x higher throughput` than `Cloud-Edge-Opt` at `10Mbps`, despite similar latency at higher bandwidths. This is because `Cloud-Edge-Opt` (with only two devices) experiences high memory consumption (`95%-98%`) on the `RTX 3090` and `source node`, limiting it to a `maximum batch size of 4`. `EdgeShard`, by involving *several* edge devices, significantly reduces the memory consumption per device, allowing for a `larger batch size (8)`, thus boosting throughput. * **EdgeShard-Even for Llama2-70B**: `EdgeShard-Even` shows stable throughput across bandwidths, as its fixed partitioning strategy does not change. `EdgeShard` offers a slight improvement over `EdgeShard-Even` for `Llama2-70B` due to adaptive resource utilization. ### 6.1.3. Effects of Source Node This experiment investigates how the choice of the `source node` (where input tokens originate) impacts performance, comparing `AGX Orin` vs. `Orin NX` as the source, with cloud-source bandwidth fixed at `1Mbps`. The following figure (Figure 9 from the original paper) shows the impact of the source node: ![该图像是图表,展示了Llama2-7B和Llama2-13B模型在不同方法下的吞吐量比较。图中分别以蓝色和橙色条形区分EdgeShard-no-bubble与EdgeShard-bubble两种方案,显现EdgeShard方案在吞吐量上显著优于Cloud-Edge方法。](/files/papers/6901877330e1bdf721191ce5/images/10.jpg) *该图像是图表,展示了Llama2-7B和Llama2-13B模型在不同方法下的吞吐量比较。图中分别以蓝色和橙色条形区分EdgeShard-no-bubble与EdgeShard-bubble两种方案,显现EdgeShard方案在吞吐量上显著优于Cloud-Edge方法。* Fig. 9. Impact of Source Node **Observations (Fig. 9):** * **Memory Constraints with Weaker Source Node**: When the source node is `Orin NX` (which has lower memory, 16GB, compared to `AGX Orin`'s 32GB), `Edge-Solo` and `Cloud-Edge-Even` encounter `OOM` errors for `Llama2-7B`. This demonstrates that a single, less capable device cannot even host the full model or an evenly partitioned half. * **Performance Gap in Cloud-Edge-Opt**: The performance difference between `AGX Orin` and `Orin NX` as source nodes is much more pronounced for `Cloud-Edge-Opt` (around `60ms latency gap`) than for `EdgeShard` (around `5ms latency gap`). * This is because `Cloud-Edge-Opt`, being a two-device setup, tends to place more layers on the source node. If the source node is weaker (`Orin NX`), it becomes a significant bottleneck. * **EdgeShard's Adaptability**: `EdgeShard` involves *more devices* and strategically places *fewer model layers* on the source node, effectively `filling the computation capacity gap` between heterogeneous source nodes. This leads to more stable performance regardless of the source node's individual capabilities. * **Throughput Advantage**: Similarly, with `AGX Orin` as the source, `Cloud-Edge-Opt` achieves `6x higher throughput` than with `Orin NX`. In contrast, `EdgeShard` shows only a `2x higher throughput` difference. This further highlights `EdgeShard`'s ability to leverage the broader network resources to compensate for weaker individual devices. ### 6.1.4. Effects of Pipeline Execution Strategy This evaluation compares `EdgeShard-No-bubble` against `EdgeShard-Bubble`, with cloud-source bandwidth set to `1Mbps`. **Observations (Fig. 10):** * **EdgeShard-No-bubble Superiority**: `EdgeShard-No-bubble` consistently `outperforms EdgeShard-Bubble` across all methods for `Llama2-7B` and `Llama2-13B`. * For `Llama2-7B`, `EdgeShard-No-bubble` improves throughput by `0.36 tokens/s` for `Cloud-Edge-Even` and `6.96 tokens/s` for `EdgeShard`. (Note: For `Cloud-Edge-Opt`, it selects local execution at 1Mbps, so no pipeline is involved, thus throughput is the same for both). * For `Llama2-13B`, improvements are `1.69 tokens/s` for `Cloud-Edge-Even`, `1.89 tokens/s` for `Cloud-Edge` (likely `Cloud-Edge-Opt`), and `5.21 tokens/s` for `EdgeShard`. * **Reduced Idle Time**: `EdgeShard-No-bubble` is superior because it mitigates `device idle time` by allowing immediate token generation without waiting for all `micro-batches` in an iteration to complete. This effectively reduces `bubbles` in the pipeline and leads to higher `throughput`. ## 6.2. Data Presentation (Tables) The table has been transcribed in section 6.1.1. ## 6.3. Ablation Studies / Parameter Analysis The paper presents several implicit ablation and parameter analyses: * **Adaptive vs. Even Partitioning**: The comparison between `EdgeShard` and `EdgeShard-Even` for `Llama2-70B` (in "Effects of Bandwidth") acts as an ablation study for the adaptive partitioning strategy. `EdgeShard`'s superior performance demonstrates the value of its optimization algorithm over a simpler, fixed even split, especially when device heterogeneity is present. * **Impact of Network Bandwidth**: The "Effects of Bandwidth" section analyzes a crucial network parameter. It shows that `EdgeShard` adapts to bandwidth changes, converging to `Cloud-Edge-Opt` strategies when bandwidth is abundant but outperforming it when bandwidth is constrained or when more devices are needed to handle memory/computation. This highlights the adaptability of the dynamic programming approach. * **Impact of Source Node Capabilities**: The "Effects of Source Node" section implicitly ablates the contribution of a powerful source node. When the source node is weak (`Orin NX`), `EdgeShard`'s ability to distribute the workload across more devices becomes more critical, demonstrating its robustness against individual device limitations. * **Pipeline Execution Strategy**: The comparison between `EdgeShard-Bubble` and `EdgeShard-No-bubble` directly evaluates the effectiveness of the proposed `no-bubbles` optimization. This is a clear ablation showing that reducing idle time in the pipeline significantly boosts throughput. These analyses collectively demonstrate that `EdgeShard`'s `adaptive device selection and model partition` (driven by the DP algorithms) and `pipeline execution optimizations` are crucial for achieving superior performance in heterogeneous collaborative edge environments. # 7. Conclusion & Reflections ## 7.1. Conclusion Summary This work introduces `EdgeShard`, a pioneering framework for the efficient deployment and distributed inference of `Large Language Models (LLMs)` within a `collaborative edge computing` environment. By intelligently partitioning LLMs into `shards` and adaptively allocating them across heterogeneous edge devices and cloud servers, `EdgeShard` effectively addresses the challenges of `high latency`, `bandwidth costs`, and `memory limitations` inherent in cloud-centric or single-device edge deployments. The core of `EdgeShard` lies in its `dynamic programming algorithms`, which are designed to solve a `joint device selection and model partition problem`, optimizing specifically for either `inference latency` or `throughput`. Extensive experiments on a physical prototype with `Llama2 models` conclusively demonstrate that `EdgeShard` significantly outperforms baseline methods, achieving up to `50% latency reduction` and `2x throughput improvement`. The findings highlight `EdgeShard`'s adaptability to diverse network conditions and its ability to enable `LLM inference` on models that would otherwise be `out-of-memory` for traditional edge or simple cloud-edge setups. ## 7.2. Limitations & Future Work The authors acknowledge several open issues and suggest future research directions: * **Incentive Mechanisms**: The current framework assumes a cooperative environment where devices either belong to a single entity or are willing to share resources. For scenarios involving devices owned by different stakeholders, robust `incentive mechanisms` are needed to encourage resource sharing and participation in collaborative inference. * **Batch Size Aware Optimization**: While `EdgeShard` implicitly benefits from larger batch sizes by allowing more memory due to sharding, the current `dynamic programming algorithm` does not explicitly incorporate the `batch size` as a direct optimization variable. Future work could integrate `batch size` into the optimization problem to achieve even greater throughput improvements. * **Dynamic Re-partitioning**: The profiling and scheduling are currently an offline step. In highly dynamic edge environments, network conditions and device availability can change rapidly. Future work could explore `dynamic re-partitioning` or `adaptive rescheduling` mechanisms to cope with real-time changes. * **Energy Efficiency**: As edge devices are often battery-powered, optimizing for `energy consumption` alongside latency and throughput could be a critical future direction, especially for long-term deployments. * **Fault Tolerance**: In a distributed system with many heterogeneous devices, device failures or intermittent connectivity can occur. Investigating `fault-tolerant mechanisms` to ensure continuous LLM inference amidst device unreliability would be important. ## 7.3. Personal Insights & Critique This paper presents a highly relevant and well-executed approach to a critical problem in the burgeoning field of `Edge AI`. The shift from monolithic cloud LLMs to distributed edge deployments is inevitable given the demands for `low latency`, `privacy`, and `bandwidth efficiency`. `EdgeShard` provides a solid foundational framework for this transition. One of the paper's strengths is its `rigorous experimental validation` on a `physical heterogeneous testbed`, which adds significant credibility over simulation-based studies. The detailed analysis of `bandwidth effects`, `source node capabilities`, and `pipeline execution strategies` provides valuable insights into the practical challenges and opportunities of `collaborative edge computing`. The `no-bubbles` pipeline optimization is a clever detail that addresses a specific characteristic of `autoregressive LLM inference`, showcasing a deep understanding of the problem domain. A minor critique, as noted in the methodology section, is the slight ambiguity in the pseudo-code for Algorithm 1 regarding the memory constraint and how MemjMem_j is updated. While typically managed in DP, the exact mechanics could have been more explicitly detailed for complete clarity for someone implementing the algorithm solely from the paper. The methods and conclusions of `EdgeShard` are highly transferable. The `dynamic programming approach` for partitioning sequential models across heterogeneous resources could be applied to other `layered deep learning models` beyond LLMs (e.g., large `vision models` like `ViT`) or even other sequential computational tasks in edge environments. The principle of adaptive joint device selection and workload distribution is fundamental to optimizing any complex distributed task. Beyond the authors' suggested future work, I believe further exploration could include: * **Security and Trust**: In a multi-stakeholder scenario, beyond just incentives, ensuring the security and trustworthiness of each participating device (especially when processing sensitive intermediate activations) is paramount. `Homomorphic encryption` or `secure multi-party computation` could be explored, though they often introduce performance overheads. * **Quantization Integration**: While `EdgeShard` currently uses full-precision models to avoid accuracy loss, combining the collaborative sharding with `quantization` (e.g., using different quantization levels for different shards based on device capabilities or sensitivity requirements) could offer even greater flexibility and resource efficiency for extreme edge devices. * **Hierarchical Orchestration**: For extremely large-scale deployments, a hierarchical orchestration layer might be needed to manage clusters of collaborating edge devices, potentially involving `federated learning` principles for collaborative model updates or fine-tuning. Overall, `EdgeShard` is a well-designed and thoroughly evaluated solution that lays important groundwork for the future of `LLM deployment` in increasingly distributed and heterogeneous computing landscapes.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.