Paper status: completed

SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference

Published:10/12/2025

MoE Inference Acceleration (1)Speculative Decoding (1)Expert Prefetching Strategy (1)Compute-Communication Pipelining (1)Multi-Token Parallel Verification (1)

Original Link PDF

Price: 0.100000

13 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

SP-MoE introduces an SD-aware expert offloading framework using speculative expert prefetching and cutoff-layer policy, pipelining computation and communication to reduce memory and bandwidth bottlenecks, achieving up to 3.5× inference speedup on MoE models.

Abstract

The Mixture-of-Experts (MoE) architecture has been widely adopted in large language models (LLMs) to reduce computation cost through model sparsity. Employing speculative decoding (SD) can further accelerate MoE inference by drafting multiple tokens per step and verifying them in parallel. However, combining MoE with SD inflates GPU memory and aggravates CPU-GPU bandwidth contention during multi-token verification. Existing MoE offloading systems are SD-agnostic and do not address this bottleneck. We present SP-MoE, the first SD-aware expert-offloading and compute-communication pipelining framework. SP-MoE introduces: (1) speculative expert prefetching that exploits structural correspondence between the draft and target models to prefetch likely experts ahead of verification; (2) a cutoff-layer policy that bounds per-layer prefetch depth based on empirical profiles and an analytical latency model, guaranteeing just-in-time availability without overfetch; and (3) a pipelined runtime with asynchronous prefetch threads and batched I/O to hide loading latency. Extensive experiments demonstrate that SP-MoE achieves a 1.07-3.5 times TPOT speedup over state-of-the-art methods across diverse datasets, environments, and MoE-based models.

Mind Map

In-depth Reading

English Analysis~22 min read · 26,881 chars

1. Bibliographic Information

1.1. Title

SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference

The title clearly states the paper's focus: accelerating the inference (generation process) of Mixture-of-Experts (MoE) based models. It achieves this by combining two key techniques: Speculative Decoding (SD) and a novel Prefetching strategy.

1.2. Authors

The authors are Liangkun Chen, Zijian Wen, Tian Wu, and Xiaoxi Zhang from Sun Yat-sen University, and Chuan Wu from The University of Hong Kong. Their affiliations with well-regarded universities in computer science and engineering suggest a strong background in systems and machine learning optimization.

1.3. Journal/Conference

The paper is available on arXiv, which is a preprint server. This means it has not yet undergone formal peer review for a conference or journal. The listed publication date of "2025-10-11T17:59:00.000Z" is a placeholder for a future publication, indicating this is an early version of the work, likely submitted for review at a top-tier conference in machine learning systems (like MLSys, OSDI, SOSP) or computer architecture (like ISCA, ASPLOS).

1.4. Publication Year

The paper is a preprint with a future date, but based on the references and content, it was likely written and submitted in late 2024 or early 2025.

1.5. Abstract

The abstract summarizes the core problem and solution. The Mixture-of-Experts (MoE) architecture reduces computation in Large Language Models (LLMs) but is very large. Speculative Decoding (SD) can speed up inference but, when combined with MoE, worsens memory usage and creates a bottleneck in the communication between the CPU and GPU. Existing systems for offloading (moving parts of the model to CPU memory) are not designed for SD (SD-agnostic).

To solve this, the authors propose SP-MoE, the first framework that is SD-aware. It introduces three main innovations:

Speculative Expert Prefetching: It predicts which experts will be needed during the verification stage of SD by using information from the drafting stage.
Cutoff-Layer Policy: It uses a model to determine how many layers deep to prefetch experts, preventing the system from fetching too much data and slowing down.
Pipelined Runtime: It uses asynchronous operations to hide the time it takes to load experts from the CPU.

Experiments show that SP-MoE provides a 1.07x to 3.5x speedup in Time Per Output Token (TPOT) compared to existing methods.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2510.10302
PDF Link: https://arxiv.org/pdf/2510.10302v1.pdf
Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

Core Problem: Running extremely large language models (LLMs) is slow and memory-intensive. The standard autoregressive decoding process, where the model generates text one token at a time, is a major bottleneck because it is inherently sequential and underutilizes powerful GPUs.
Existing Solutions & Gaps:
1. Mixture-of-Experts (MoE): This architecture replaces large, dense layers in an LLM with many smaller "expert" layers. For each token, only a few experts are activated, reducing the total computation required. However, this dramatically increases the total model size (e.g., Mixtral 8x7B is much larger than a dense 7B model), making it impossible to fit on a single consumer GPU. This necessitates parameter offloading, where inactive experts are stored in slower CPU memory and loaded onto the GPU only when needed. Offloading, in turn, creates a new bottleneck: the limited communication bandwidth of the CPU-GPU PCIe bus.
2. Speculative Decoding (SD): This technique accelerates inference by using a small, fast "draft" model to generate a sequence of candidate tokens. The large, powerful "target" model then verifies these tokens in a single parallel step. This reduces the number of slow, sequential steps required from the target model.
The New Challenge (Entry Point): The paper identifies a critical new problem that arises when combining MoE and SD. During the SD verification stage, multiple draft tokens are processed in parallel. Each of these tokens may require a different set of experts to be loaded from the CPU. This simultaneous demand for many experts aggravates the CPU-GPU bandwidth contention, creating a severe I/O bottleneck. Existing MoE offloading systems are SD-agnostic—they are not designed to handle this multi-token verification workload and cannot exploit the unique structure of the SD process.
Innovative Idea: The authors' core insight is to leverage the drafting stage of SD, which was previously an untapped resource. During this stage, the main model is idle, and so is the CPU-GPU bus. The paper proposes to use this idle time to predict and prefetch the experts that the target model will likely need for the upcoming verification stage. This turns a bottleneck into an optimization opportunity.

2.2. Main Contributions / Findings

The paper introduces SP-MoE, the first system designed specifically to optimize MoE inference in the context of speculative decoding.

Primary Contributions:
1. Drafting-Stage Speculative Prefetching: SP-MoE pioneers a novel prefetching mechanism that runs during the SD drafting stage. It uses the internal states (attention outputs) of the draft model to predict which experts the target model will need, exploiting the structural similarity between the two models.
2. Analytical Cutoff-Layer Policy: To prevent over-prefetching (which can cause cache thrashing and I/O contention), the paper develops an analytical model that calculates a cutoff layer. This policy determines the optimal number of layers for which to prefetch experts, ensuring they are loaded just-in-time without overwhelming the system.
3. Fully Pipelined Runtime: SP-MoE implements an efficient runtime system that uses an asynchronous worker thread and batched I/O operations. This decouples expert loading from model computation, effectively hiding the I/O latency and maximizing bandwidth utilization.
Key Findings:
- SP-MoE significantly outperforms state-of-the-art MoE offloading systems when they are combined with SD. It achieves a 1.07x to 3.5x speedup in Time Per Output Token (TPOT).
- The effectiveness of the approach is demonstrated across a wide range of MoE models (Mixtral, Phi-MoE, Deepseek), datasets, and hardware environments (from consumer-grade RTX 3090 to datacenter-grade A100).
- The system is particularly effective in resource-constrained environments, showing the most significant gains on GPUs with less memory and bandwidth.

3.1. Foundational Concepts

To understand this paper, one must be familiar with the following concepts:

Autoregressive Decoding: This is the standard method for text generation in LLMs. The model produces output one token at a time. To generate the next token, the model takes all previously generated tokens as input. This process is repeated until a special "end-of-sequence" token is generated or a maximum length is reached. Its sequential nature makes it slow.
Mixture-of-Experts (MoE): An MoE model is a type of sparse neural network architecture. In a standard LLM, a Transformer block contains a dense Feed-Forward Network (FFN) layer. In an MoE model, this FFN layer is replaced by a set of smaller FFNs called experts and a gating network.
- Gating Network: This is a small neural network that, for each input token, dynamically selects which experts to use. Typically, it chooses the top- $k$ experts (e.g., 2 out of 8 in Mixtral).
- Experts: These are the parallel FFNs. A token is processed only by the experts selected by the gating network.
- Benefit: This reduces the amount of computation per token, as only a fraction of the model's parameters are used.
- Drawback: The total number of parameters is much larger, leading to huge memory requirements.
Speculative Decoding (SD): A technique to speed up autoregressive decoding. It uses two models:
- Draft Model: A small, fast model (e.g., a distilled version of the target model).
- Target Model: The original, large, high-quality LLM. The process works in two stages per iteration:
1. Drafting: The draft model autoregressively generates a short sequence of $N$ candidate tokens (the "draft").
2. Verification: The target model takes the original input plus the $N$ draft tokens and processes them all in a single, parallel forward pass. It then checks how many of the draft tokens it would have generated itself. The longest prefix of the draft that matches the target's predictions is accepted. This is faster because the slow target model performs fewer sequential steps.
Parameter Offloading and Prefetching:
- Offloading: When a model is too large to fit in a GPU's VRAM, some of its parameters (e.g., MoE experts) are stored in the host system's main memory (CPU RAM) or even on an SSD. They are loaded onto the GPU only when needed. This is a trade-off: it enables running larger models but introduces significant latency from data transfer over the PCIe bus.
- Prefetching: To mitigate this latency, prefetching is used. The system tries to predict which parameters will be needed soon and starts loading them from CPU to GPU before they are explicitly required for computation. If the prediction is accurate, the loading time can be hidden behind other ongoing computations.

3.2. Previous Works

The paper builds upon and differentiates itself from three main categories of research:

Expert Management for MoE Models:
- Mixtral-Offloading: A practical library that enables running large MoE models on consumer GPUs. It uses a simple Least Recently Used (LRU) caching policy to manage which experts are kept in GPU memory. When a needed expert is not on the GPU, it is loaded on-demand, and the least recently used one is evicted.
- MoE-Infinity: This system introduces prefetching based on historical activation patterns. It tracks which experts have been frequently used for a given sequence and prefetches them, assuming future tokens in the same sequence will follow a similar pattern.
- AdapMoE: A more advanced system that uses a gating predictor. After computing layer $l$ , it uses that layer's output to predict which experts will be activated in the next layer, $l+1$ . This allows prefetching for the next layer while the current layer's experts are being used. Limitation of all three: They are SD-agnostic. They are not designed to handle the burst of expert loading requests from multi-token verification in SD, nor do they leverage the drafting stage as an optimization opportunity.
Efficient Speculative Decoding:
- SpecExec, Medusa, Eagle: These methods focus on improving the core SD algorithm. For example, Medusa adds extra "decoding heads" to the model to generate multiple draft tokens in parallel, while SpecExec builds a tree of candidate tokens. These works are orthogonal to SP-MoE. Their goal is to generate better drafts, while SP-MoE's goal is to optimize the system-level execution of an MoE model given a draft. The techniques are complementary and could be combined.
General LLM Inference Systems:
- vLLM, DeepSpeed-Inference: These are highly optimized systems for serving standard (dense) LLMs. They use techniques like paged attention and efficient memory management. However, they lack specialized support for the unique challenges of MoE models, such as dynamic expert loading and routing.

3.3. Technological Evolution

The field has evolved from optimizing dense models to tackling the specific challenges of sparse MoE models.

Dense LLM Inference: Focus on parallelism and memory management (e.g., vLLM).
MoE Models Emerge: Need to handle massive parameter counts, leading to offloading systems (Mixtral-Offloading).
Optimizing Offloading: Simple on-demand loading is slow, leading to prefetching systems based on historical patterns (MoE-Infinity) or intra-model prediction (AdapMoE).
Parallel Acceleration: At the same time, Speculative Decoding emerges as a powerful way to speed up generation for all LLMs.
The Intersection: This paper, SP-MoE, operates at the intersection of these trends. It recognizes that naively combining SD and MoE offloading creates a new system bottleneck and proposes the first solution specifically designed to resolve it.

3.4. Differentiation Analysis

The core innovation of SP-MoE compared to its closest relatives (MoE-Infinity and AdapMoE) is its SD-aware design.

SP-MoE vs. AdapMoE: AdapMoE prefetches experts for layer $l+1$ during the computation of layer $l$ . This provides a very short window of time to hide the I/O latency. In contrast, SP-MoE prefetches experts for multiple layers (0 to $L$ ) during the entire drafting stage. This provides a much larger time window to overlap computation and communication, allowing it to hide more latency.
SP-MoE vs. MoE-Infinity: MoE-Infinity relies on coarse-grained historical data for an entire sequence to predict experts. This can be inaccurate, especially when the topic or style of the text shifts. SP-MoE uses a much more fine-grained and immediate signal: the attention output from the draft model for the current set of draft tokens. This leads to more accurate, just-in-time predictions.
Key Differentiator: SP-MoE is the only system that exploits the unique two-stage structure of speculative decoding, using the idle I/O time during drafting to prepare for verification.

4. Methodology

The methodology of SP-MoE is centered around using the draft model's execution as a "crystal ball" to predict the target model's needs and pre-loading the necessary experts.

The overall system architecture is shown in the figure below.

Figure 6. Overview of SP-MoE. 该图像是论文中图6的示意图，展示了SP-MoE框架的整体结构。图中细致描绘了草稿模型和目标模型的输入输出流程及其层级组织，重点阐释了系统分析器、预取器和门控机制如何协同工作以优化MoE模型的推理效率。

4.1. Principles

The core idea of SP-MoE is to transform the drafting stage of speculative decoding from an idle I/O period into a productive prefetching window. The system is built on three key observations:

Observation I (Predictability): Neighboring tokens in a sequence often activate similar experts. Because the draft model and target model are architecturally similar, the internal states of the draft model can be used to accurately predict the expert activations of the target model.
Observation II (Harm of Over-prefetching): Greedily prefetching experts for too many future layers is counterproductive. It can lead to GPU cache thrashing (evicting a useful expert just to load another one that might not be used) and I/O contention, which can delay critical on-demand loading tasks.
Observation III (Opportunity in Drafting): The drafting stage takes a non-trivial amount of time, during which the CPU-GPU communication bus is largely idle. This idle time is a perfect opportunity to perform expert prefetching without interfering with the main model's computation.

4.2. Core Methodology In-depth

4.2.1. Expert Predictor

This module is responsible for predicting which experts will be needed in the verification stage. It consists of a cross-model predictor and a cutoff-layer design.

1. Cross-Model Predictor: The key innovation here is to use components from both the draft and target models. During the drafting stage, for each layer $l$ , SP-MoE performs the following:

It intercepts the attention output $s$ from layer $l$ of the draft model.
It feeds this output $s$ directly into the gating network of layer $l$ from the target model.
The target model's gating network then produces scores for each expert, and SP-MoE identifies the top- $k$ experts as "critical" and candidates for prefetching.

This works because the draft and target models are chosen to be structurally similar (as shown in Table 1 of the paper), and their internal representations (like attention outputs) are highly correlated (high cosine similarity, as shown in Figure 7a).

2. Cutoff Layer Design: To prevent over-prefetching (Observation II), SP-MoE only prefetches experts for the first $L$ layers, where $L$ is the cutoff layer. The goal is to choose the largest $L$ that does not violate memory or time constraints. The constraints are formalized as follows:

Memory Constraint: The total memory used by the model and the prefetched experts must not exceed the GPU's memory capacity. $ M_{peak} + N_{expert} \cdot M_{expert} < M_{GPU} $ where:
- $N_{expert} = \sum_{i=0}^{L} k_i$ is the total number of experts to be prefetched up to layer $L$ .
- $k_i$ is the number of experts prefetched for layer $i$ .
- $M_{peak}$ is the peak memory usage of the model without the prefetched experts.
- $M_{expert}$ is the memory size of a single expert.
- $M_{GPU}$ is the total available GPU memory.
Time Constraint: The time taken to prefetch experts up to layer $L$ must not exceed the total time available during the drafting stage. This ensures prefetching does not delay the start of the verification stage. $ \max { (L - 1) t_{comp} + k_{L} \cdot t_{I/O} , \ N_{expert} \cdot t_{I/O} } \le L_{all} \cdot t_{comp} $ where:
- $t_{comp}$ is the per-layer computation time in the draft model.
- $t_{I/O}$ is the time to load one expert from CPU to GPU.
- $L_{all}$ is the total number of layers in the draft model.
- The max term captures the latency, which could be limited by either the pipeline depth (a mix of compute and I/O) or the total I/O transfer time, whichever is longer.
- The right side, $L_{all} \cdot t_{comp}$ , represents the total time of the drafting stage.
  
  In practice, SP-MoE approximates $k_i$ with a fixed value $k$ (e.g., the number of experts activated per token) and solves for the maximum $L$ that satisfies these constraints based on offline profiled system characteristics ( $t_{comp}$ , $t_{I/O}$ ).

3. Prediction Algorithm: The overall prediction logic is summarized in Algorithm 1.

Algorithm 1: Expert Prediction and Prefetching Task Queue Management.
Input: Prefetching task queue `Qload`, attention output `s`, cutoff layer `L`, current layer `l`, gates network `Gates`, `k`, critical experts `Ecritical`, cached queue `Qcache`, `cuda.Event`, `cuda_expert_stream`
1	if MLP of the l-th layer drafting is triggered & `l ≤ L` then
2	`expert_scores ← Gates[l](s)`;
	`Ecritical ← TopK_Index(expert_scores, k)`;
3	for expert in `Ecritical` do
4	if expert in `Qcache` then
5	`Ecritical.remove(expert)`;
6	`cuda.Event.record(cuda_expert_stream)`;
7	`Qload.push_back(Ecritical, cuda.Event)`;

Explanation of Algorithm 1:

Line 1: The process triggers for each MoE layer (MLP) during drafting, but only for layers up to the cutoff layer L.
Line 2: It uses the draft model's attention output $s$ and the target model's gating network Gates[l] to predict critical experts Ecritical.
Lines 3-5: It checks if any of the predicted experts are already in the GPU cache (Qcache). If so, they are removed from the list of experts to be prefetched.
Lines 6-7: For the remaining experts that need to be loaded, it records a CUDA synchronization event and pushes the prefetching task (the list of experts and the event) onto a shared queue Qload. This queue is consumed by the Expert Prefetcher module.

4.2.2. Expert Prefetcher

This module is responsible for executing the prefetching tasks efficiently. Its design focuses on maximizing the overlap between I/O and computation.

The workflow of SP-MoE's worker-based prefetcher is contrasted with a simpler "vanilla" prefetcher in Figure 8 from the paper.

Figure 8. Workflow of vanilla-prefetch executor and workerprefetch executor during the drafting stage.

1. Continuous Expert Prefetching via Worker Thread: Instead of blocking computation to perform I/O, SP-MoE uses a dedicated background worker thread, named the Prefetcher. This thread runs on a separate CUDA stream.

The main thread (running the draft model) simply pushes prefetching tasks to the Qload queue and continues its computation.
The Prefetcher thread continuously pulls tasks from Qload and executes the CPU-to-GPU data transfers. This design decouples I/O from computation, allowing the total computation time of the entire drafting stage to hide the total I/O time, rather than only overlapping I/O with a single layer's computation.

2. Queue Synchronization for Reliability: To ensure that the Prefetcher thread does not read incomplete data from the queue, a cuda.Event is used for synchronization.

When the main thread pushes a task to Qload, it records an event on its CUDA stream after the push is complete.
The Prefetcher thread, after popping a task, waits for the associated event to be signaled. This guarantees that the task information in the queue is fully written and ready to be processed.

3. Batched I/O Operations: To further reduce overhead, SP-MoE batches I/O operations. Instead of launching a separate data transfer for each expert, it groups all experts predicted for a given layer and transfers them in a single, larger operation. This minimizes the kernel launch overhead associated with many small cudaMemcpyAsync calls. When prefetching $N$ experts, $N$ experts from the GPU cache must be selected for eviction (using an LRU policy) to make space. This eviction-replacement is also done in a batched manner.

4. Prefetching Execution Algorithm: The logic of the Prefetcher worker thread is shown in Algorithm 2.

Algorithm 2: Prefetching Execution Algorithm.
Input: Prefetching task queue `Q_load`, cache queue `Qcache`, experts to load `E_load`, experts to evict `Eevict`
1	while LLM inference is not completed do
2	if `Q_load` is not empty then
3	// Step 1: fetch the critical expert loading tasks from the queue.
4	`E_load, cuda.Event ← Q_load.pop()`;
5	`cuda.Event.wait()`;
6	`N ← len(E_load)`;
7	// Step 2: select an equal number of evicted experts to replace the prefetched experts.
8	for i = 1 to `len(Q_cache)` do
9	`E_evict.append(Q_cache[i])`;
10	if `len(E_evict) == N` then
11	break;
12	// Step 3: batch-replace the prefetched experts.
13	`copy_non_blocking(E_load, E_evict)`
14	for i = 1 to `N` do
15	`Q_cache.move_to_end(E_load[i])`;

Explanation of Algorithm 2:

Lines 1-2: The worker thread runs in a continuous loop.
Lines 4-5: It pops a task from the queue and waits on the cuda.Event to ensure data integrity.
Lines 7-11: It identifies $N$ experts to evict from the GPU cache based on an LRU policy (the ones at the front of the Qcache queue).
Line 13: It executes a non-blocking, batched copy operation to load the new experts E_load from CPU to GPU, overwriting the memory of the evicted experts E_evict.
Lines 14-15: It updates the cache management queue Qcache, moving the newly loaded experts to the end (marking them as most recently used).

5. Experimental Setup

5.1. Datasets

The authors used four standard LLM benchmarks to evaluate SP-MoE, covering a range of tasks:

HumanEval: A code generation benchmark consisting of 164 programming problems. It tests the model's ability to generate correct Python code from docstrings.
BigBench: A broad benchmark with 204 diverse tasks designed to measure the general reasoning and understanding capabilities of LLMs.
WikiText-103: A large-scale language modeling dataset derived from Wikipedia articles. It is used to test performance on long-context generation.
MMLU-Pro: A more challenging version of the popular MMLU benchmark, testing expert-level knowledge across 57 subjects like mathematics, history, and law.

These datasets were chosen to ensure the evaluation is comprehensive and not limited to a single domain.

5.2. Evaluation Metrics

The primary metric used is TPOT (Time Per Output Token).

Conceptual Definition: TPOT measures the average wall-clock time required to generate a single token during the decoding phase of inference. It is a direct measure of generation speed or latency. A lower TPOT value is better, indicating faster performance.
Mathematical Formula: $ \text{TPOT} = \frac{T_{\text{total_decode}}}{N_{\text{generated_tokens}}} $
Symbol Explanation:
- $T_{\text{total\_decode}}$ : The total time elapsed from the end of the initial prompt processing (prefill) until the final token is generated.
- $N_{\text{generated\_tokens}}$ : The total number of tokens generated during the decoding phase.
  
  Another key metric analyzed is the Expert Hit Rate.
Conceptual Definition: The hit rate is the percentage of times that an expert required for computation is already present in the GPU memory (cache). A higher hit rate means less on-demand loading from the CPU, which should ideally lead to lower latency.
Mathematical Formula: $ \text{Hit Rate} = \frac{\text{Number of expert activations found in GPU cache}}{\text{Total number of expert activations}} \times 100% $
Symbol Explanation:
- An "expert activation" occurs every time the gating network routes a token to a specific expert.

5.3. Baselines

SP-MoE is compared against three state-of-the-art MoE offloading systems, which the authors adapted to work with speculative decoding to ensure a fair comparison:

Mixtral-Offloading + SD: This baseline represents a simple offloading strategy with an LRU cache. It loads experts on-demand when a cache miss occurs.
MoE-Infinity + SD: This baseline uses a prefetching strategy based on historical, sequence-level expert activation patterns.
AdapMoE + SD: This baseline uses a more sophisticated prefetching strategy that predicts experts for the next layer based on the current layer's outputs.

These baselines represent the spectrum of existing techniques, from simple reactive offloading to proactive, intra-model prefetching.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. End-to-End Performance

The core results demonstrate that SP-MoE consistently outperforms all baselines across different datasets, models, and hardware environments.

The following chart (Figure 9 from the paper) compares the TPOT on four datasets.

该图像是一个折线图，展示了不同方法在不同草稿标记数下的激活率百分比。横轴为草稿标记数，纵轴为激活率（%）。图中包括 Mixtral、Phi 和 Deepseek 及其最大化版本的数据对比。

Analysis of Figure 9:

SP-MoE (green bar) consistently achieves the lowest TPOT (fastest performance) in all scenarios.
The average speedup is significant, around 1.35x over the collection of baselines. The peak speedup is 1.75x against Mixtral-Offloading on the HumanEval dataset with an RTX 3090.
The performance gains are most pronounced on the RTX 3090 (Env. 1), which is the most resource-constrained GPU among the testbeds. This highlights SP-MoE's effectiveness in environments where I/O bottlenecks are most severe.

The next chart (Figure 10 from the paper) shows performance across different MoE models.

Analysis of Figure 10:
The performance advantage of SP-MoE holds across all three model types: Mixtral, Phi-MoE, and Deepseek.
For the Deepseek-Lite model, SP-MoE achieves a remarkable 3.5x speedup over Mixtral-Offloading on an A100 GPU. This is likely due to the high prediction accuracy for the Deepseek model pair, combined with its smaller expert size, which allows for more aggressive and effective prefetching.
Even against the strongest baseline, AdapMoE, SP-MoE provides consistent improvements, ranging from 9.3% to 31.6% depending on the model and hardware.

6.1.2. Hit Rate Evaluation

The paper analyzes expert hit rates to understand the mechanism behind the performance gains.

The following are the results from Table 3 of the original paper:

Dataset	Mixtral 8×7B				Phi-MoE				Deepseek
Dataset	MO	MI	AdapMoE	SP-MoE	MO	MI	AdapMoE	SP-MoE	MO	MI	AdapMoE	SP-MoE
HumanEval	15.08%	16.01%	41.83%	18.93%	35.37%	15.28%	56.22%	44.31%	14.60%	16.57%	18.74%	36.85%
Bigbench	15.14%	15.83%	42.82%	21.39%	22.36%	14.71%	45.14%	41.38%	17.50%	16.34%	21.80%	41.25%
Wikitext_103	15.14%	15.76%	42.55%	21.19%	28.37%	14.62%	50.39%	43.22%	21.53%	17.27%	25.15%	42.20%
MMLU_Pro	14.73%	15.87%	41.35%	21.06%	24.30%	14.07%	45.97%	41.97%	17.04%	16.74%	21.72%	39.92%
Average	15.02%	15.87%	42.14%	20.89%	27.60%	14.67%	49.43%	42.72%	17.67%	16.73%	21.85%	40.06%

Analysis of Table 3:

For the Deepseek model, SP-MoE achieves the highest average hit rate (40.06%), which directly contributes to its excellent performance.
Interestingly, for Mixtral and Phi-MoE, AdapMoE achieves a higher hit rate than SP-MoE (e.g., 42.14% vs. 20.89% for Mixtral). Despite this, SP-MoE still has a lower TPOT (is faster). This is a crucial finding: a higher hit rate does not guarantee better performance if the overhead of prefetching is too high. AdapMoE's synchronous, blocking prefetch mechanism incurs significant overhead that negates the benefit of its higher hit rate. In contrast, SP-MoE's fully pipelined, asynchronous runtime allows it to achieve better end-to-end performance even with a lower hit rate, because its prefetching is more efficient.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Impact of SP-MoE Components

Figure 12 shows an ablation study that breaks down the contribution of each optimization in SP-MoE.

Figure 4. Observation III: Latency distribution of a single decode iteration across three models.

Analysis of Figure 12:

Baseline (s): This is vanilla offloading with SD, which is very slow.
+ vp (vanilla prefetch): Adding a simple, blocking prefetch during the drafting stage already provides a substantial speedup (e.g., 1.68x for Mixtral). This confirms the core idea of using the drafting stage is effective.
+ wp (worker prefetch): Replacing the blocking prefetch with SP-MoE's asynchronous worker thread provides another significant performance boost. This demonstrates the value of decoupling I/O and computation.
+ b (batched I/O): Finally, adding batched I/O gives a small additional improvement by reducing kernel launch overhead. The largest gain comes from the drafting-stage prefetch concept and the worker thread implementation.

6.2.2. Impact of the Cutoff Layer

Figure 14 investigates how the choice of cutoff layer affects performance.

Figure 6. Overview of SP-MoE.

Analysis of Figure 14:

For Mixtral and Phi-MoE, which have large experts, the TPOT exhibits a U-shaped curve.
- When the cutoff layer is too small (e.g., < 5), not enough experts are prefetched, and performance is limited by on-demand loading.
- As the cutoff layer increases, TPOT decreases (performance improves) up to an optimal point (around 20).
- Beyond this point, increasing the cutoff layer makes performance worse. This is because the system is over-prefetching: the I/O time exceeds the drafting stage duration, and the GPU cache begins to thrash, validating the need for the cutoff layer policy.
For DeepSeek, which has small experts, the TPOT consistently decreases as the cutoff layer increases. The U-shape is not observed because the loading time for its experts is so small that the drafting stage window is always large enough to hide the I/O, even when prefetching for all layers.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper presents SP-MoE, a novel and effective system for accelerating the inference of MoE-based LLMs that use speculative decoding. The core problem identified is the severe I/O bottleneck caused by multi-token verification in SD when combined with expert offloading.

SP-MoE addresses this by being the first SD-aware framework. Its main contributions are:

A drafting-stage prefetching mechanism that accurately predicts expert usage.
A cutoff-layer policy to prevent harmful over-prefetching.
A highly efficient pipelined runtime to hide I/O latency.

Through extensive experiments, the authors demonstrate that SP-MoE achieves state-of-the-art performance, delivering speedups of up to 3.5x over existing methods. The work provides a robust solution for running large sparse models efficiently on resource-constrained hardware.

7.2. Limitations & Future Work

The authors acknowledge several limitations and areas for future research:

Large Batch Sizes: The current work focuses on a batch size of 1, which prioritizes single-request latency. The behavior with large batch sizes, a common scenario in throughput-oriented serving, is more complex, as expert activation patterns across different requests in a batch are uncorrelated. This makes caching and prefetching more challenging.
Sequential Drafting: SP-MoE was evaluated with greedy, sequential drafting. Extending it to support more advanced tree-based drafting (e.g., SpecExec) or sampling-based decoding would broaden its applicability.
Prefetching Accuracy: While the current predictor is effective, there is room to improve accuracy, perhaps by considering cross-layer dependencies or adaptive gating mechanisms.
System Integration: SP-MoE could be integrated with other system optimizations like request batching, scheduling, and advanced memory management to build a comprehensive, large-scale MoE serving system.

7.3. Personal Insights & Critique

Strengths:
- The paper's core insight—using the SD drafting stage for prefetching—is both elegant and highly effective. It identifies a previously overlooked optimization opportunity at the intersection of two major acceleration techniques.
- The system design is very thoughtful. It not only introduces a powerful idea but also includes crucial safeguards like the cutoff-layer policy to prevent the optimization from backfiring.
- The experimental evaluation is exceptionally thorough, covering multiple models, hardware platforms, and datasets, and includes insightful ablation studies that clearly justify each design choice. The analysis of hit rate vs. TPOT is particularly revealing.
Potential Issues and Areas for Improvement:
- Offline Profiling Dependency: The cutoff-layer calculation relies on offline profiling of $t_{comp}$ and $t_{I/O}$ . In a real-world serving environment where system load can vary, these values might change. A dynamic, online mechanism that adjusts the cutoff layer based on real-time system monitoring could make the system more robust.
- Focus on Latency: The evaluation is heavily skewed towards latency (batch size = 1). While important, a discussion on the implications for throughput would have made the paper more complete. The challenges of large batches mentioned in the future work section are non-trivial and represent a significant hurdle for practical deployment.
- Generality of the Predictor: The cross-model predictor relies on a strong structural similarity between the draft and target models. While this holds for the model pairs tested, the approach might be less effective if the draft model is architecturally very different (e.g., an RNN drafting for a Transformer).
Inspirations and Transferability: The central principle of SP-MoE—identifying and exploiting idle periods in a computational pipeline to pre-prepare resources for a future stage—is a classic systems design pattern. This idea is highly transferable to other domains beyond LLM inference, such as multi-stage data processing pipelines, compilers, and operating systems. The use of a decoupled, asynchronous worker thread to hide I/O latency is another fundamental technique that serves as a great example for any system dealing with I/O-bound tasks. This paper is an excellent case study in system-level co-design, where an understanding of both the application (LLM inference) and the underlying hardware (GPU/CPU architecture) leads to significant performance breakthroughs.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~22 min read · 26,881 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Expert Predictor

4.2.2. Expert Prefetcher

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. End-to-End Performance

6.1.2. Hit Rate Evaluation

6.2. Ablation Studies / Parameter Analysis

6.2.1. Impact of SP-MoE Components

6.2.2. Impact of the Cutoff Layer

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers