RServe: Overlapping Encoding and Prefill for Efficient LMM Inference
TL;DR Summary
RServe optimizes LMM inference by overlapping multimodal encoding and language model prefill within requests, while balancing inter-request loads via schedulable tokens in a disaggregated architecture. This fine-grained scheduling significantly reduces latency by 66% and boosts t
Abstract
Large multimodal models (LMMs) typically employ an encoding module to transform multimodal data inputs into embeddings, which are then fed to language models for further processing. However, efficiently serving LMMs remains highly challenging due to the inherent complexity of their inference pipelines. Traditional serving engines co-locate the encoding module and the language model, leading to significant resource interference and tight data dependency. Recent studies have alleviated this issue by disaggregating the encoding module from the model, following a design style of prefill-decode disaggregation. Nevertheless, these approaches fail to fully exploit parallelism both within individual requests (intra-request) and across multiple requests (inter-request). To overcome the limitation, we propose REDServe, an LMM inference system that efficiently orchestrates intra- and inter-request pipelines. REDServe is designed to reduce low latency and maximize parallelism at both intra- and inter-request granularities. Built on the disaggregated architecture of the encoding module and language model, REDServe adopts a fine-grained scheduling method that overlaps multimodal encoding with the forward computation of the language model within a single request. For inter-request pipeline, REDServe leverages schedulable tokens and token budgets to balance computational loads across micro-batches. Combined with chunked prefill, this enables a novel scheduling strategy that coordinates the execution of intra- and inter-request pipelines. Experimental evaluations on representative LMMs show that REDServe achieves substantial latency reduction of up to 66% while improving throughput by up to 109%, significantly outperforming existing serving approaches.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: RServe: Overlapping Encoding and Prefill for Efficient LMM Inference
- Authors: Tianyu Guo, Tianming Xu, Xianjie Chen, Junru Chen, Nong Xiao, Xianwei Zhang
- Affiliations: Sun Yat-sen University and Rednote.
- Journal/Conference: The paper is presented as a preprint on arXiv. The link provided () uses a non-standard format for arXiv identifiers, which are typically in the
YYMM.NNNNNformat. This suggests the link is a placeholder, and the paper is likely a recent submission awaiting its final identifier. - Publication Year: The year is likely 2025, as inferred from the placeholder link and the references within the paper.
- Abstract: The paper addresses the inefficiency of serving Large Multimodal Models (LMMs), where the sequential processing of multimodal encoding and language model prefill creates significant latency. Traditional methods either co-locate these modules, causing resource interference, or disaggregate them without fully exploiting parallelism. The authors propose RServe, an LMM inference system that orchestrates intra-request (within a single request) and inter-request (across multiple requests) parallelism. RServe overlaps the encoding of multimodal data with the language model's prefill computation within a single request. For multiple requests, it uses "schedulable tokens" and chunked prefill to balance computational loads and coordinate execution. Experimental results show RServe can reduce latency by up to 66% and increase throughput by up to 109% compared to existing systems.
- Original Source Link:
- Source: https://arxiv.org/abs/2509.24381
- PDF: http://arxiv.org/pdf/2509.24381v1
- Publication Status: Preprint. Note: The paper exhibits a minor inconsistency in naming, with the title and main text using "RServe" while some figures and a sentence in the abstract use "REDServe" or "RedServe". This analysis will use "RServe" as the primary name, following the paper's title.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Serving Large Multimodal Models (LMMs) is inefficient. LMMs first use an encoder to convert inputs like images into embeddings, and then a Large Language Model (LLM) processes these embeddings. This process has two main stages:
prefill(processing the input prompt) anddecode(generating the output). The encoding step, especially for high-resolution images, is time-consuming and traditionally must be completed before theprefillstage can even begin. This sequential dependency creates a major performance bottleneck, increasing the time it takes to get the first output token. - Gaps in Prior Work: Existing solutions have tried to separate (disaggregate) the encoder from the LLM to reduce resource conflict. However, they still treat encoding and
prefillas sequential steps, failing to overlap these operations. This leaves significant potential for parallelism untapped, both within a single complex request (intra-request) and across a batch of multiple requests (inter-request). - Innovation: RServe introduces a novel scheduling mechanism that allows the
prefillstage of an LMM to start processing parts of a prompt (e.g., text or already-encoded images) while other parts (e.g., subsequent images) are still being encoded. This creates a fine-grained pipeline that overlaps encoding andprefill, drastically reducing idle time and overall latency.
- Core Problem: Serving Large Multimodal Models (LMMs) is inefficient. LMMs first use an encoder to convert inputs like images into embeddings, and then a Large Language Model (LLM) processes these embeddings. This process has two main stages:
-
Main Contributions / Findings (What):
- Identified Unexploited Parallelism: The paper highlights that the parallelism between multimodal encoding and the LLM's forward pass (prefill) is a critical, yet underutilized, opportunity for optimization in LMM serving.
- Proposed RServe System: An efficient LMM inference system is designed and implemented to orchestrate both intra- and inter-request pipelines.
- Intra-request Pipeline: Overlaps encoding and
prefillwithin a single request using anEmbedding Trackerand chunked encoding. - Inter-request Pipeline: Manages multiple requests simultaneously by introducing the concepts of
schedulable tokensand atoken budget, which fill processing batches efficiently and eliminate pipeline bubbles.
- Intra-request Pipeline: Overlaps encoding and
- Significant Performance Gains: RServe is shown to substantially outperform existing state-of-the-art serving systems. Experimental results demonstrate a latency reduction of up to 66% and a throughput improvement of up to 109%.
3. Prerequisite Knowledge & Related Work
This section explains the core concepts needed to understand the paper's contributions.
-
Foundational Concepts:
-
Large Multimodal Models (LMMs): These are AI models, like GPT-4V or Qwen-VL, that can understand and process information from multiple data types (modalities), such as text, images, audio, and video. As shown in Image 1, a prompt can be combined with various multimodal data from a database before being served to an LMM.
该图像是示意图,展示了LMM(大型多模态模型)的服务流程。它始于一个提示(Prompt),该提示通过检索(Retrieval)从数据库中获取多模态数据(如图片、文档、视频和音频)。原始提示和检索到的多模态数据被聚合(Concat),随后进入RedServe系统进行服务(Serving)。最终,这些经过处理的输入被送往LMM进行推理,体现了提示如何融入丰富多样的多模态数据。 -
Inference Pipeline (Prefill & Decode): LLM/LMM inference occurs in two phases:
- Prefill: The model processes the entire input prompt (text and multimodal embeddings) in a single, large computation to generate the very first output token. This phase is compute-intensive.
- Decode: The model generates subsequent output tokens one by one, in an autoregressive manner. Each new token depends on all previous tokens. This phase is memory-bandwidth intensive.
-
KV Cache: During
prefill, the model calculates intermediate results (attention keys and values) for the input prompt and stores them in GPU memory. This "KV cache" is reused during thedecodephase to avoid recomputing the same information, making token generation much faster. -
Model Parallelism: A set of techniques to run a single large model across multiple GPUs.
- Tensor Parallelism (TP): Splits individual layers of a model across GPUs. It requires high-speed connections (like NVLink) and is excellent for reducing latency but can suffer from high communication overhead.
- Pipeline Parallelism (PP): Splits the entire model layer-wise, assigning different blocks of layers to different GPUs, forming a pipeline. It is better for increasing throughput (processing more requests at once) but traditionally has higher latency for a single request.
-
Chunked Pipeline Parallelism (CPP): An enhancement to PP where a long input prompt is broken into smaller "chunks." These chunks are fed into the pipeline one after another without waiting for the previous chunk to finish all stages. This allows PP to achieve latency comparable to TP while maintaining high throughput.
-
Disaggregated Architecture: To prevent the compute-heavy
prefillstage from blocking the lighterdecodestage, modern systems physically separate them onto different groups of GPUs. An Encoder-Prefill-Decode (EPD) Disaggregated Architecture takes this further by dedicating separate hardware for the multimodal encoder, theprefillstage, and thedecodestage.
-
-
Previous Works & Differentiation:
- LLM Serving Systems (
vLLM,Orca): Systems likevLLMoptimized memory management withPagedAttention, whileOrcaintroduced iteration-level scheduling. However, they were designed for text-only LLMs and do not address the unique challenges of multimodal encoding. - Prefill-Decode Disaggregation (
Splitwise,DistServe,Mooncake): These systems separateprefillanddecodeonto different hardware, a key architectural principle that RServe builds upon.Mooncakeintroduced CPP, which is also a foundational technique for RServe. However, none of these systems focused on the initial multimodal encoding stage. - LMM Serving Systems (
ModServe): Recent systems likeModServealso use disaggregation for LMMs but focus on high-level scheduling and auto-scaling for different modalities. Another work introduced the EPD architecture. RServe complements these by proposing a much more fine-grained scheduling mechanism within the inference pipeline to overlap computation, directly targeting the latency bottleneck.
- LLM Serving Systems (
4. Methodology (Core Technology & Implementation)
RServe's core innovation is its ability to intelligently schedule and overlap the encoding and prefill stages. It is built on an EPD disaggregated architecture, where the encoder and the LLM (prefill workers) run on separate GPUs.
Image 3 provides a fantastic overview of how RServe's components progressively improve the LMM inference pipeline, from a simple sequential Vanilla approach to the fully parallelized Token Scheduling method.
该图像是一个流程示意图,对比了RServe系统下大型多模态模型(LMM)推理的不同调度策略。它展示了从传统的Vanilla顺序执行到通过编码-预填充解耦、分块预填充(CPP)以及令牌调度逐步优化的过程。最终的令牌调度方法利用微批处理,实现了请求内(intra-request)和请求间(inter-request)的并行,显著重叠了多模态编码与语言模型预填充,以提高处理效率。
4.1 Intra-request Pipeline: Overlapping Within a Single Request
For a single request containing multiple multimodal items (e.g., "What is in Image1? Compare it to Image2."), RServe avoids waiting for both images to be encoded before starting the prefill.
-
Embedding Tracker:
- Principle: RServe maintains a tracker for each request that keeps a record of all its input embeddings (text and multimodal). This tracker marks each embedding as either "ready" or "not-ready."
- Procedure:
- When a request arrives, text embeddings are immediately marked as "ready" (as this is a trivial lookup). Placeholders are created for multimodal embeddings, marked as "not-ready."
- As the encoder processes multimodal data (e.g., Image1), it sends the resulting embeddings to the LLM worker.
- The
Embedding Trackerreceives these embeddings, places them in the correct position, and updates their status to "ready." - The scheduler can now immediately start the
prefillcomputation for all contiguous "ready" embeddings. - Once an embedding has been used for
prefill, the tracker releases its memory to prevent leaks.
- This allows a flow like: while ->
Prefill(Image1 + Text2)while -> .
-
Encoder Scheduling (Algorithm 1):
- Principle: Encoding multimodal items one-by-one is inefficient as it underutilizes the GPU. Encoding all items at once introduces a long initial delay. RServe finds a middle ground by batching.
- Procedure: Multimodal items are collected into a buffer. Once the buffer contains a minimum number of tokens (a configurable batch size ), the encoder processes this batch. This ensures the encoder runs efficiently while still producing embeddings in a stream-like fashion, enabling the overlap with
prefill.
4.2 Inter-request Pipeline: Orchestrating Multiple Requests
When serving multiple requests, RServe ensures the pipeline stages are always full, maximizing hardware utilization and throughput.
- Token Scheduling (Algorithm 2):
-
Principle: Instead of scheduling entire requests, RServe schedules at the granularity of tokens. It introduces the concept of schedulable tokens: tokens whose embeddings are ready and whose preceding tokens have already been processed.
-
Procedure:
- RServe maintains a global token budget () for each processing step (micro-batch).
- In each scheduling cycle, it iterates through waiting requests and calculates the number of
schedulable tokensfor each, using theEmbedding Tracker. - It fills the micro-batch with
schedulable tokensfrom one or more requests until the token budget is exhausted. - Requests that are only partially processed are placed back at the front of the queue for the next cycle.
-
As shown in Image 4, this combined intra- and inter-request pipeline fills the processing micro-batches more effectively than an intra-request pipeline alone, eliminating idle "bubbles" and improving overall efficiency.
该图像是图10的示意图,比较了RServe系统中“请求内管道”与“请求内和请求间组合管道”的计算调度。它展示了后者如何通过重叠多模态编码和语言模型预填充(包括微批次处理)来显著提高并行度并缩短处理时间。
-
4.3 Implementation
-
RServe is implemented on top of
gLLM, a lightweight LLM/LMM serving framework. -
The overall workflow is shown in Image 2. User requests are handled by an
Encoder Scheduler, which feeds multimodal data to parallel encoders. The resulting embeddings are managed by theEmbedding Trackerand scheduled by theToken Schedulerinto parallelPrefillworkers.
该图像是图11,展示了RServe原型系统的整体工作流程。它始于用户提交的多模态输入(文本、音频、图像、视频),通过编码器调度和并行编码生成多模态嵌入。这些嵌入随后经过嵌入追踪器和Token调度,最终进入并行的Prefill阶段。整个流程体现了LMM推理中编码与Prefill的重叠与并行处理。 -
The intra-request pipeline optimization is versatile and can be used with Tensor Parallelism, Pipeline Parallelism, or even on a single GPU. The inter-request pipeline is designed to work specifically with Pipeline Parallelism.
5. Experimental Setup
- Datasets:
-
MMMU: A large-scale benchmark with diverse multimodal inputs (images, charts, diagrams) that require expert-level reasoning. Image 8 shows the input length distribution for this dataset.
该图像是图15,一个直方图,展示了MMMU数据集中不同分辨率(1K和2K)下输入长度的分布。1K分辨率(含5k多模态令牌)的输入长度主要集中在8k-9k范围,平均值约为8.5k。而2K分辨率(含9k多模态令牌)则显示出更长的输入长度,主要分布在11k-13k,平均值约为12.5k,表明分辨率越高,输入令牌序列越长。 -
SGLang Benchmark: An open-source benchmark used to simulate a cloud service environment with request arrivals following a Poisson distribution.
-
- Evaluation Metrics:
- Time to First Token (TTFT):
- Conceptual Definition: Measures the latency from the moment a request is sent until the first output token is generated. It is a critical metric for user-perceived responsiveness in interactive applications. Lower is better.
- Throughput:
- Conceptual Definition: Measures the rate at which the system can process input tokens, typically expressed in tokens per second (toks/s). It reflects the system's overall processing capacity. Higher is better.
- SLO Attainment:
- Conceptual Definition: Measures the percentage of requests that are completed within a predefined Service Level Objective (SLO), such as a maximum allowed TTFT (e.g., 10 seconds). This metric is crucial for evaluating a system's reliability and performance under real-world service constraints. Higher is better.
- Time to First Token (TTFT):
- Baselines:
vLLM: A widely used, high-performance LLM inference engine. Used with both Tensor Parallelism (TP4) and Pipeline Parallelism (PP4).gLLM: The lightweight framework on which RServe is built. Used as a performance baseline.gLLM-epd: A version ofgLLMmodified to use the Encoder-Prefill-Decode (EPD) disaggregated architecture, but without RServe's overlapping scheduler.RServe-intra: An ablation of RServe that only includes the intra-request pipeline optimization, to isolate the benefits of the inter-request pipeline.
6. Results & Analysis
6.1 Core Results (with Pipeline Parallelism)
The main experiments compare RServe against baselines using a 4-GPU pipeline parallelism setup (PP4) with one additional GPU for encoding (E1).
-
Latency (TTFT): As seen in Image 5, RServe consistently achieves the lowest TTFT across all request rates.
-
Compared to
vLLMusing Tensor Parallelism (TP4), which suffers from high communication overhead, pipeline-based systems likegLLM(PP4) and RServe () are much faster. -
The EPD architecture (
gLLM-epd) provides a solid improvement over a standard pipeline (gLLM). -
RServe builds on this, achieving a further 18-19% TTFT reduction over
gLLM-epdby overlapping encoding andprefill. This advantage is most pronounced at lower request rates.
该图像是图12,比较了vLLM、gLLM和REDServe三种LMM推理系统的延迟性能。图表以对数坐标系展示了Qwen2.5-VL-72B模型在1K和2K分辨率下,请求速率与首次生成令牌时间(TTFT)的关系。REDServe在所有测试配置下均表现出最低的TTFT,显著优于vLLM和gLLM,证实了其在降低延迟方面的优势。
-
-
Throughput: Image 6 shows that RServe also achieves the highest throughput.
-
It sustains a higher number of tokens per second than all baselines, reaching over 11,000 toks/s in the 2K resolution test. This demonstrates that its scheduling not only reduces latency but also improves overall system capacity.
该图像是图13,展示了vLLM、gLLM和RServe的吞吐量比较。它包含两个子图:(a) Qwen2.5-VL-72B模型在1K分辨率下的吞吐量随请求率的变化,以及(b) 同一模型在2K分辨率下的吞吐量变化。在两个子图中,REDServe(绿色曲线)均表现出最高的吞吐量,尤其是在请求率达到一定阈值后,其性能明显优于gLLM-epd(蓝色曲线)、gLLM(红色虚线)和vLLM(黑色及紫色曲线),证实了REDServe在LMM推理服务中的卓越效率和并行化能力。
-
-
SLO Attainment: In Image 7, RServe maintains a much higher percentage of requests within the 10-second TTFT SLO, especially as the request rate increases. The area under its curve is 23% larger than that of
gLLM-epd, indicating superior and more reliable performance under load.
该图像是图14,展示了gLLM和REDServe在不同请求速率下的服务水平目标(SLO)达成率对比。两幅图分别在1K和2K分辨率下,比较了两种系统。结果显示,REDServe在两种分辨率下均表现出更高的SLO达成率,尤其在高请求速率下优势更明显,表明其卓越的性能。
6.2 Ablations / Parameter Sensitivity
-
Impact of Embedding Batch Size: Image 9 explores the trade-off in the encoder scheduling.
-
For high-quality multimodal data (many tokens per item), a smaller batch size is better, as it allows for more fine-grained overlap between encoding and
prefill. -
For low-quality data (few tokens per item), there is a sweet spot. A very small batch size makes encoding inefficient, while a very large one reduces the opportunity for overlap. This shows the batch size is a tunable parameter depending on the workload.
该图像是图表,展示了图16中不同嵌入批处理大小对RServe系统性能的影响,包括吞吐量(Throughput)和首个 token 延迟(TTFT)。图(a)为高质量多模态数据(1024 toks/MM item),图(b)为低质量多模态数据(32 toks/MM item)。结果显示,在两种数据质量下,RServe在多数嵌入批处理大小下均能保持较高的吞吐量和较低的TTFT。最右侧的"gLLM-epd"代表所有多模态数据编码完成后才执行预填充操作,其TTFT显著高于RServe,尤其是在高质量数据场景下。
-
-
Impact of Inter-request Pipeline: Image 10 compares the full RServe system with
RServe-intra. The results are stark: the full RServe system has 32% higher throughput and 172% lower latency. This proves that the inter-request token scheduling is critical for eliminating pipeline bubbles and achieving high performance.
该图像是图17,展示了RedServe与RedServe-intra在不同请求速率下的吞吐量(toks/s)和TTFT(s)的消融研究对比。图中显示,RedServe在所有请求速率下均表现出更高的吞吐量(深蓝色条形)和更低的TTFT(红色折线),显著优于RedServe-intra(浅蓝色条形和橙色虚线)。这表明RedServe在效率和延迟方面具有优势。 -
Functional Study: The authors verified that RServe's optimizations do not harm the model's output quality. The score on the MMMU benchmark was virtually identical across all systems.
This is a manual transcription of Table 1 from the paper.
Framework vLLM gLLM gLLM-epd RServe MMMU Score 62.7 62.6 62.4 62.6
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully identifies and addresses a key bottleneck in LMM serving: the sequential execution of multimodal encoding and LLM prefill. By proposing RServe, a system that facilitates fine-grained overlapping of these two stages, the authors demonstrate a powerful method for reducing latency and increasing throughput. RServe's design, combining an intra-request
Embedding Trackerwith an inter-requestToken Scheduler, provides a comprehensive solution that significantly outperforms existing approaches. -
Limitations & Future Work:
- The paper does not explicitly state limitations. However, potential ones could include:
- Overhead: The
Embedding Trackerand fine-grained token scheduling might introduce minor computational overhead, although the results suggest the benefits far outweigh any cost. - Complexity: The scheduling logic is more complex than traditional batching, which could make implementation and debugging more challenging.
- Generalization: While tested on images, its performance on other modalities with different encoding characteristics (e.g., long videos or audio streams) is not explored.
- Overhead: The
- The paper does not explicitly state limitations. However, potential ones could include:
-
Personal Insights & Critique:
- Novelty and Impact: The core idea of overlapping encoding and
prefillis both intuitive and highly effective. It represents a significant step forward in optimizing LMM inference and is likely to be adopted by future serving systems. - Clarity and Presentation: The paper is well-written, and the use of diagrams (especially Image 3 and 4) is excellent for illustrating the core concepts. The methodical evaluation, including ablations, convincingly supports the authors' claims. The minor inconsistency in the system's name (
RServevs.REDServe) is a small distraction but does not detract from the quality of the work. - Future Directions: This work opens up several interesting avenues. RServe's scheduling could be combined with other optimization techniques like speculative decoding or model quantization. Furthermore, the scheduling logic could be made adaptive, dynamically adjusting parameters like the embedding batch size based on real-time workload characteristics to achieve an even better balance between latency and throughput.
- Novelty and Impact: The core idea of overlapping encoding and
Similar papers
Recommended via semantic vector search.