Olympian: Scheduling GPU Usage in a Deep Neural Network Model Serving System

Ramesh Govindan

Paper status: completed

Olympian: Scheduling GPU Usage in a Deep Neural Network Model Serving System

Multi-Model GPU Scheduling (1)Deep Neural Network Serving System (1)Fair Resource Sharing (1)TensorFlow-Serving Extension (1)GPU Resource Usage Modeling (1)

Original Link

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper presents Olympian, which extends TensorFlow-Serving to enable fair sharing of a single GPU among multiple large DNNs with low overhead, achieving interleaved execution in 1-2 ms and greatly reducing latency unpredictability.

Abstract

Deep neural networks (DNNs) are emerging as important drivers for GPU (Graphical Processing Unit) usage. Routinely, now, cloud offerings include GPU-capable VMs, and GPUs are used for training and testing DNNs. A popular way to run inference (or testing) tasks with DNNs is to use middleware called a serving system. Tensorflow-Serving (TF-Serving) is an example of a DNN serving system. In this paper, we consider the problem of carefully scheduling multiple concurrent DNNs in a serving system on a single GPU to achieve fairness or service differentiation objectives, a capability crucial to cloud-based TF-Serving offerings. In scheduling DNNs, we face two challenges: how to schedule, and switch between, different DNN jobs at low overhead; and, how to account for their usage. Our system, Olympian, extends TF-Serving to enable fair sharing of a GPU across multiple concurrent large DNNs at low overhead, a capability TF-Serving by itself is not able to achieve. Specifically, Olympian can run concurrent instances of several large DNN models such as Inception, ResNet, GoogLeNet, AlexNet, and VGG, provide each with an equal share of the GPU, while interleaving them at timescales of 1-2 ms, and incurring an overhead of less than 2%. It achieves this by leveraging the predictability of GPU computations to profile GPU resource usage models offline, then using these to achieve low overhead switching between DNNs.

Mind Map

In-depth Reading

English Analysis~27 min read · 31,585 chars

1. Bibliographic Information

1.1. Title

Olympian: Scheduling GPU Usage in a Deep Neural Network Model Serving System

1.2. Authors

Yitao Hu (University of Southern California)
Swati Rallapalli (IBM Research)
Bongjun Ko (IBM Research)
Ramesh Govindan (University of Southern California)

1.3. Journal/Conference

The paper was published in a conference setting, indicated by its typical conference paper structure and content. While the specific conference name is not provided within the text, the content is consistent with a publication in the field of computer systems, networking, or machine learning systems. These venues are typically highly reputable and influential in their respective domains.

1.4. Publication Year

The publication year is not explicitly stated in the provided text.

1.5. Abstract

Deep neural networks (DNNs) are significant drivers of Graphics Processing Unit (GPU) usage, with cloud platforms increasingly offering GPU-enabled virtual machines (VMs). A common method for performing DNN inference (testing) is through middleware known as a serving system, such as Tensorflow-Serving (TF-Serving). This paper addresses the challenge of scheduling multiple concurrent DNNs on a single GPU within a serving system to achieve objectives like fairness or service differentiation, which are critical for cloud-based TF-Serving offerings.

The authors identify two main challenges in DNN scheduling:

Efficiently scheduling and switching between different DNN jobs with low overhead.
Accurately accounting for GPU usage by these jobs.

The proposed system, Olympian, extends TF-Serving to enable fair sharing of a GPU among multiple concurrent large DNNs while maintaining low overhead, a capability that TF-Serving alone lacks. Olympian achieves this by leveraging the predictable nature of GPU computations. It profiles GPU resource usage models offline, then uses these models to facilitate low-overhead switching between DNNs.

Specifically, Olympian is demonstrated to run concurrent instances of large DNN models (e.g., Inception, ResNet, GoogLeNet, AlexNet, VGG), provide each with an equal share of the GPU, interleave them at timescales of 1-2 milliseconds, and incur an overhead of less than 2%.

1.6. Original Source Link

/files/papers/691d6e408106199151d7e8c0/paper.pdf This appears to be a link to a PDF hosted on a local file system or an internal server, not a public, official publication link. Its publication status (e.g., formally published in a conference, workshop, or as a preprint) is not explicitly stated from the given information.

2. Executive Summary

2.1. Background & Motivation

The proliferation of deep neural networks (DNNs) has made Graphics Processing Units (GPUs) indispensable computational resources. Cloud providers now routinely offer GPU-capable Virtual Machines (VMs), and GPUs are central to both training and inference (also known as testing or prediction) of DNNs. For inference tasks, model serving systems act as middleware, allowing multiple inference requests to share GPU resources. Tensorflow-Serving (TF-Serving) is a prominent example of such a system.

The core problem the paper aims to solve is the unpredictable execution times for concurrent DNN inference tasks when run on a single GPU within a serving system like TF-Serving.

Why this problem is important:
- Low GPU Utilization: Practical DNN applications often exhibit intermittent and bursty GPU usage, leading to underutilization. Serving systems multiplex these tasks to improve utilization.
- Unpredictable Latency: In TF-Serving, multiple concurrent jobs can submit GPU kernels to the underlying GPU driver. Since the driver cannot distinguish between kernels belonging to different DNNs or client requests, jobs with identical computational needs can finish at vastly different times (e.g., up to 1.7x difference observed in experiments). This variability makes it extremely difficult for developers to engineer latency-sensitive user-facing applications that rely on model servers.
- Lack of Service Differentiation: The inability to control DNN execution precisely prevents model server operators from implementing crucial features like priority scheduling or weighted scheduling. These capabilities are essential for offering differentiated services and potentially new revenue streams in cloud-based TF-Serving offerings.
Specific challenges/gaps:
- GPU Driver Limitations: The underlying GPU driver schedules kernels submitted by different jobs without knowledge of which DNN they belong to.
- High Overhead of Traditional GPU Preemption: While GPU preemption could enforce scheduling, it is known to be expensive, especially for massively parallel GPU contexts. Instruction-level preemption (e.g., in NVIDIA Pascal architecture) can take hundreds of microseconds for a context switch.
- Middleware-level Control: The TF-Serving middleware is the only point in the stack that understands the high-level structure and requirements of DNNs (e.g., Tensorflow graph and node abstractions). This knowledge is critical for effective scheduling and isolation.
  
  The paper's entry point is to explore a novel strategy: time-slicing GPU usage at the middleware level rather than relying on lower-level hardware preemption. This allows TF-Serving to gain control over DNN execution and address the predictability and differentiation challenges.

2.2. Main Contributions / Findings

The paper introduces Olympian, a system that extends TF-Serving to provide fine-grained, predictable GPU scheduling for concurrent DNN inference tasks. Its primary contributions and findings are:

Middleware-level Time-Slicing: Olympian demonstrates that carefully time-slicing GPU usage at the TF-Serving middleware layer (specifically at the Tensorflow node boundary) is a viable and effective strategy for achieving isolation and predictability without requiring specialized hardware support or changes to GPU drivers.
Low-Overhead Scheduling: It achieves efficient scheduling and switching between DNN jobs at very low overhead. Experiments show Olympian can interleave DNNs at timescales of 1-2 milliseconds with an overhead of less than 2%.
Offline Profiling for Predictability: Olympian leverages the predictable nature of GPU computations in DNNs to profile GPU resource usage models offline. This offline profiling eliminates the high overhead of online monitoring (which can inflate execution times by 21-29%) and enables accurate estimation of quantum duration and cost accumulation rates.
Cooperative Co-scheduling: Olympian employs cooperative co-scheduling, where the entire "gang" of CPU threads associated with a DNN job is suspended and resumed. This approach avoids expensive hardware preemption while maintaining control.
Achieving Fairness and Service Differentiation: Olympian successfully enables:
- Fair Sharing: Providing nearly identical finish times for homogeneous workloads and fair distribution of GPU duration for diverse, heterogeneous workloads, which TF-Serving alone cannot achieve (Figure 11, Figure 14, Figure 16).
- Weighted Fair Sharing: Implementing policies where jobs receive GPU access proportional to assigned weights (Figure 17).
- Priority Scheduling: Allowing higher-priority jobs to execute before lower-priority ones (Figure 18).
Portability and Compatibility: The implementation preserves the portability of TF-Serving to different GPUs and supports models originally written in other frameworks (e.g., Caffe converted to Tensorflow), demonstrated by experiments on different hardware (Figure 21).
Minimal Utilization Sacrifice: Despite introducing temporal multiplexing, Olympian minimally sacrifices GPU utilization (around 6-8% reduction compared to default TF-Serving).
Scalability: Olympian exhibits comparable scalability to TF-Serving in terms of the number of concurrent DNNs it can support, limited by GPU memory in most cases, though for some DNNs it can be limited by operating system thread pool size.

3.1. Foundational Concepts

Deep Neural Networks (DNNs)

Deep neural networks (DNNs) are a class of machine learning algorithms inspired by the structure and function of the human brain. They consist of multiple layers of interconnected "neurons" that process data in a hierarchical manner. Each layer learns to recognize different features of the input, passing its output to the next layer. DNNs have achieved state-of-the-art results in various complex tasks, including image recognition, natural language processing, and speech recognition. Their computational structure involves numerous matrix multiplications and other mathematical operations, making them highly suitable for parallel processing.

Graphics Processing Units (GPUs)

Graphics Processing Units (GPUs) are specialized electronic circuits designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. However, their highly parallel architecture, consisting of thousands of cores (processing units), makes them exceptionally well-suited for a broad range of general-purpose computations, particularly those involving large-scale data parallelism. This characteristic makes GPUs ideal for accelerating DNN operations, significantly reducing the time required for both training (learning the model parameters) and inference (using the trained model to make predictions).

DNN Serving Systems

A DNN serving system is a piece of middleware software designed to deploy and manage DNN models for inference in production environments. It acts as an intermediary between client applications and the DNN models themselves. Key functions include:

Model Loading: Loading and managing multiple DNN models efficiently.
Request Handling: Receiving inference requests from clients.
Batching: Grouping multiple incoming requests into larger batches to improve GPU utilization (as processing a batch of inputs together is often more efficient than processing them one by one).
Execution: Orchestrating the execution of DNN models on available hardware (e.g., GPUs or CPUs).
Response Generation: Returning inference results to clients. Tensorflow-Serving (TF-Serving) is a prominent open-source example, built specifically for Tensorflow models.

Tensorflow

Tensorflow is an open-source software library for dataflow and differentiable programming across a range of tasks. It is primarily used for machine learning applications such as neural networks. In Tensorflow, computations are represented as data flow graphs, where:

Nodes: Represent mathematical operations (e.g., addition, matrix multiplication, convolution).
Edges: Represent the tensors (multi-dimensional arrays) that flow between operations and also encode dependencies between operations. Tensorflow's runtime system executes these graphs. Some nodes might run on a CPU, while others, particularly data-parallel operations, are offloaded to a GPU. A Tensorflow node that runs on a GPU typically invokes one or more GPU kernels.

GPU Kernels

A GPU kernel is a program or function that runs on a GPU. It represents an elemental data-parallel computation. Programmers define a kernel computation in terms of threads, which are grouped into thread blocks. These thread blocks are then assigned to stream multiprocessors (SMs) on the GPU. Further, thread blocks are subdivided into warps (groups of threads) that are scheduled concurrently. The GPU driver is responsible for managing the execution of these kernels, including queuing them and scheduling them on the available GPU hardware.

Batching

Batching in DNN inference refers to the practice of grouping multiple individual input data samples (e.g., images) into a single, larger input batch before feeding them to the DNN model for processing.

Purpose: Batching significantly improves GPU utilization because GPUs are highly efficient at processing large amounts of data in parallel. Processing a batch of 100 images together can be much faster than processing 100 images individually due to reduced overhead from kernel launches, data transfers between CPU and GPU, and better exploitation of GPU parallelism.
Mechanism in Tensorflow: Tensorflow implements batching by adding specific nodes to the DNN graph that decode inputs and arrange them into appropriate multi-dimensional matrices for subsequent processing.

Scheduling and Time-Slicing

Scheduling is the process of deciding which tasks get access to resources (e.g., CPU, GPU) and for how long. Time-slicing is a common scheduling technique where a shared resource is allocated to different tasks for short, fixed durations (called quanta or time slices) in a round-robin or similar fashion. After a quantum expires, the current task is preempted and another task gets access to the resource. This creates an illusion of parallel execution and ensures fairness or responsiveness. In the context of GPUs, time-slicing aims to allow multiple DNNs to share the GPU by rapidly switching between them.

GPU Preemption

GPU preemption is the ability of a GPU or its driver to interrupt an ongoing GPU computation (a kernel or a group of kernels), save its state, and switch to another computation. This is analogous to CPU process preemption in operating systems.

Levels of Preemption:
- Kernel-level preemption: Interrupting execution only after a GPU kernel completes.
- Thread-block level preemption: Interrupting execution after a thread block completes.
- Instruction-level preemption: Interrupting execution at almost any point, down to individual instructions.
Cost: While desirable for responsiveness and fairness, GPU preemption can be very expensive. Saving and restoring the state of a massively parallel GPU context (which includes many threads, registers, and memory states) incurs significant overhead, often hundreds of microseconds. This high overhead limits the practical frequency of preemption and thus the granularity of scheduling.

3.2. Previous Works

The paper contextualizes its work by referencing prior research in GPU multiplexing and scheduling, identifying gaps that Olympian aims to fill.

GPU Multiplexing via Kernel Slicing:
- Several works ([2, 4, 19, 23, 31, 33]) have explored novel kernel slicing approaches to overcome the high cost of hardware preemption. These methods aim to break down GPU kernels into smaller, more manageable units that can be interleaved.
- Limitation: These often incur significant preemption overheads. Furthermore, they frequently require programmers to explicitly specify slicing points, or demand rewriting/recompiling code against a new framework, reducing portability and ease of adoption for existing DNN models. Olympian distinguishes itself by working with pre-trained models and requiring no changes to application code or frameworks.
Hardware Extensions for Multiprogrammed GPU Workloads:
- Prior work ([27]) proposed hardware extensions to facilitate multiprogrammed GPU workloads, including preemption mechanisms for GPU scheduling policies.
- Limitation: Relying on hardware preemption, as discussed, typically results in high overhead. Olympian avoids this by using offline profiling for resource usage metering, which does not depend on specialized hardware preemption features.
Timegraph ([15]):
- Timegraph is mentioned as a system that handles scheduling multiple tasks using priority and reservation policies to ensure responsiveness for high-priority tasks. It achieves this through modifications to the GPU device driver.
- Limitation: Its reliance on device driver modifications implies that running Timegraph on different GPU devices would require additional implementation effort for each new device type. Olympian aims for GPU-independent operation.
GPU Virtualization:
- A body of work ([24, 30, 32]) has explored full and para-virtualization of GPUs. This involves creating virtual GPUs that can be assigned to Virtual Machines (VMs), allowing multiple VMs to share a physical GPU.
- Limitation: This line of work often suffers from poor performance due to high GPU context switching overheads associated with virtualization. Virtualization is generally considered too heavyweight for model-serving systems. Olympian provides lightweight time-slicing to achieve isolation objectives, which is more appropriate for a model-serving context.
GPU Call Interposition:
- Other research ([7, 8, 10, 11, 17, 20, 21]) has explored GPU call interposition for purposes like remote execution or GPU multiplexing in virtualized environments. Some of these also investigate scheduling GPU calls from different VMs ([20]) or co-scheduling a VM with its associated GPU ([11]).
- Limitation: While these approaches address GPU sharing, Olympian specifically focuses on model serving systems, uses the DNN as the fundamental unit of scheduling, and explicitly allocates GPU time based on offline profiling of DNN-specific characteristics.

3.3. Technological Evolution

The evolution of GPU usage for DNNs has been rapid:

Early GPUs for Graphics: Initially designed for rendering graphics, GPUs possessed inherent parallelism suitable for certain scientific computations.
GPGPU (General-Purpose GPU) Computing: Researchers and developers began leveraging GPUs for general-purpose parallel computing, leading to frameworks like CUDA (NVIDIA) and OpenCL.
DNN Acceleration: The DNN revolution found GPUs to be the perfect accelerators due to their ability to perform massive matrix multiplications and other parallel operations efficiently. This led to significant performance gains in training and inference.
DNN Serving Systems: As DNNs moved from research labs to production, the need arose for efficient deployment and management. DNN serving systems like TF-Serving emerged to manage model loading, batching, and concurrent inference requests.
Challenges in Shared Environments: With GPUs becoming shared resources in cloud environments and serving systems, challenges like resource contention, unpredictable latency, and the need for service differentiation became prominent. Existing GPU scheduling techniques (e.g., low-level preemption, virtualization) often introduce high overheads or lack DNN-specific awareness.

Olympian fits into this evolution by addressing the critical need for predictable and controllable GPU sharing within DNN serving systems. It moves beyond generic GPU scheduling by exploiting the unique characteristics of DNNs (predictable structure, node-level granularity) at the middleware level, where DNN context is available.

3.4. Differentiation Analysis

Olympian's core innovations and differences from related work stem from its choice of middleware-level scheduling and its reliance on offline profiling of DNN characteristics.

Feature / Aspect	Related Work (e.g., Kernel Slicing, Hardware Preemption, Virtualization)	Olympian
Scheduling Level	Often at the `GPU driver` level, `hardware level`, or `hypervisor level`.	Middleware level (TF-Serving): This is crucial because only `TF-Serving` understands the `DNN graph` and `node abstractions`. This high-level context allows for intelligent scheduling decisions that are impossible at lower levels.
Unit of Scheduling	`GPU instructions`, `GPU kernels`, `thread blocks`, or entire `VMs`.	Tensorflow node: `Olympian` interleave `DNNs` at the boundary of `Tensorflow nodes`. This is a coarser granularity than `GPU kernel` or `instruction` preemption, but still fine-grained enough for responsiveness, and crucially, it's `GPU-independent` and aligns with `DNN` computation units (Figure 4 shows most nodes have very small durations).
Preemption/Switching	Relies on `hardware preemption` (expensive, context switch hundreds of $\mu$ s) or requires `programmer annotations` for slicing.	Cooperative Co-scheduling: `Olympian` uses `cooperative co-scheduling` of `CPU thread gangs`. It suspends all `CPU threads` of one `DNN job` and resumes another. This avoids expensive `hardware preemption` and its associated `context switching overheads` at the `GPU hardware level`. While not "perfect" isolation (Figure 10), the `overflow` is minimal due to short `node durations`.
Overhead	`Hardware preemption` (high overhead), `online profiling` (21-29% overhead), `virtualization` (heavyweight).	Low overhead (<2%): Achieved by `offline profiling` and `cooperative co-scheduling`. The `overhead` is incurred mainly on the `CPU` for `thread suspension/resumption`.
Resource Usage Metering	Typically not available at a fine-grained, `DNN-aware` level, or requires `online instrumentation`.	Offline Profiling: Leverages the `predictability of DNN execution` to build `offline profiles` of `GPU resource usage` (costs and durations). This avoids the `overhead` of `online profiling` (Figure 6) and allows `Olympian` to accurately estimate when a `quantum` has elapsed. The `cost accumulation rate` ( $T_j = Q \frac{C_j}{D_j}$ ) is a key innovation here.
Portability	Often tied to specific `GPU hardware`, `drivers`, or `frameworks`.	High portability: Inherits `framework` and `device independence` from `Tensorflow` and `TF-Serving`. It abstracts `GPU details` and can run models from various frameworks (e.g., Caffe converted to Tensorflow). Demonstrated to work on different `GPU platforms` without modification (Figure 21).
Scheduling Objectives	Focus on `responsiveness` or general `multiplexing`. Lacks `DNN-specific QoS`.	DNN-aware QoS: Enables `fair sharing`, `weighted fair sharing`, and `priority scheduling` for `DNNs`, providing `predictable latency` and `service differentiation` capabilities that are crucial for `cloud-based serving systems`.
Application Changes	May require `programmer intervention`, `code modifications`, or `recompilation`.	No application changes: Works with `pre-trained models` and does not require modifications to the `DNN models` themselves or the `client applications`.
GPU Utilization	A primary goal, but sometimes at the cost of predictability or with high overheads.	Maintains high utilization: While introducing `time-slicing`, `Olympian` minimizes `utilization sacrifice` (6-8% reduction) while significantly improving `predictability`.
"Quantum" Definition	Often defined purely by elapsed wall-clock time (for `CPUs`) or `kernel counts`.	GPU Duration (cost-based): `Olympian` defines a `quantum` not just by elapsed time, but by the total duration a `DNN` runs at least one `Tensorflow node` on the `GPU` (Figure 5). This `GPU duration` is estimated via `cost accumulation rate`, making the `quantum` meaningful in terms of `GPU resource consumption`. The paper shows that simple `CPU timer`-based `time-slicing` fails to provide isolation (Figure 19).

4. Methodology

4.1. Principles

The core principle behind Olympian is to achieve predictable and differentiated GPU scheduling for concurrent Deep Neural Networks (DNNs) in a model serving system like Tensorflow-Serving (TF-Serving) by implementing middleware-level time-slicing. This strategy is built on two fundamental ideas:

Leveraging DNN Predictability with Offline Profiling: DNNs exhibit predictable GPU resource usage for a given model and batch size. Olympian exploits this by performing offline profiling to characterize the GPU resource consumption of DNNs. This allows it to estimate GPU usage without incurring online monitoring overheads.
Cooperative Co-scheduling of CPU and GPU Resources: Instead of relying on expensive hardware GPU preemption, Olympian implements cooperative co-scheduling. It allocates exclusive access to the GPU (and implicitly the associated CPU threads) to a single DNN job for a short duration (a quantum). At the end of this quantum, it cooperatively suspends the current job's CPU threads and resumes another job's CPU threads, thereby switching GPU access.

By combining these principles, Olympian gains fine-grained control over GPU allocation while maintaining low overhead and high portability.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. The TF-Serving Execution Model and its Drawbacks

TF-Serving is a model serving system that manages DNN inference tasks. Internally, when a client submits a job (an inference request for a specific DNN model with input data):

A separate CPU thread from a thread pool is assigned to each job.
This thread traverses the DNN's data flow graph in a breadth-first search (BFS) manner, starting from the root node.
Nodes are inserted into a FIFO queue.
The session's thread dequeues a node, executes it, and then examines its children.
If a child node needs to be executed asynchronously (typically GPU-intensive nodes), another CPU thread from the thread pool is fetched to manage its execution. If no threads are available, execution may be delayed. Other child nodes are re-inserted into the queue.

This process is summarized in Algorithm 1 from the paper:

$Algorithm 1 TF-Serving’s processing loop 1: function SEssION::RuN(modelName, input) 2: srInfo = new SessRunInfo(sessId, runId) 3: root = getRoot(modelName, input) 4: Process(root, srInfo) 5: function PROCEss(root, srInfo) 6: bfsQueue.push(root) 7: while !bfsQueue.empty() do 8: curNode = bfsQueue.pop() 9: compute(curNode) 10: for childNode in curNode.children() do 11: if !childNode.kernelIsAsync() then 12: bfsQueue.push(childNode) 13: else 14: thread = threadPool.fetch() 15: thread(Process(childNode, srInfo))$

In this model, a single DNN job is executed by a collection of CPU threads, which the authors refer to as a "gang." Each thread in this gang is generally responsible for executing a GPU node. This design allows TF-Serving to queue multiple GPU kernels concurrently. When multiple DNN instances run concurrently, each can queue its kernels to the GPU, leading to interleaving of execution by default.

The main drawback of TF-Serving's execution model: The GPU driver, which ultimately schedules the GPU kernels, lacks knowledge about which kernels belong to which DNN job. It cannot differentiate between kernels from different DNNs invoked by different clients. This leads to:

Unpredictable Finish Times: DNN jobs with identical resource requirements can finish at significantly different times (e.g., Figure 3 shows finish times varying by up to 1.7x for 10 concurrent Inception clients). This variability makes it hard to engineer latency-sensitive applications.
Lack of Control: Without DNN-aware scheduling, TF-Serving cannot implement priority scheduling or weighted sharing, preventing service differentiation.

The following are the results from Figure 3 of the original paper:

该图像是柱状图，展示了不同客户端（Client id）在两次运行（Run-1 和 Run-2）中完成任务所需时间（以秒为单位）。柱状图中，蓝色表示 Run-1，红色表示 Run-2，可以观察到不同客户端的完成时间变化。

4.2.2. Olympian's Design Choices and Rationale

To address TF-Serving's limitations, Olympian introduces a time-slicing approach at the middleware level, based on several key design choices:

4.2.2.1. The Unit of Interleaving

Olympian interleave DNNs at the granularity of a Tensorflow node.

Rationale:
- Low-level preemption (GPU instruction/kernel): Requires hardware preemption, which is known to be expensive (hundreds of $\mu$ s for context switch). While instruction-level preemption exists in newer GPUs, its overhead is too high for fine-grained control.
- Tensorflow node vs. GPU kernel: Tensorflow nodes often invoke one or a small number of GPU kernels, so the difference in granularity is minimal.
- Simplicity and Portability: Switching at the Tensorflow node level can be implemented in a GPU-independent manner within TF-Serving (after a node finishes execution in Algorithm 1, Line 9).
- Small Node Durations: Most nodes in modern DNNs have very small execution times. As illustrated in Figure 4, for Inception, over 80% of nodes take less than 20 $\mu$ s, and over 90% take less than 1 ms. This allows for fine-grained interleaving without significantly inflating the quantum.
- Data-unit level interleaving: Not considered because processing a single image (a data unit) can take tens of milliseconds, which is too coarse for responsive scheduling.
  
  The following are the results from Figure 4 of the original paper:
  
  该图像是一个示意图，展示了不同批量大小下 Inception 模型的节点持续时间的累积分布函数 (CDF)。蓝色和红色曲线分别表示批量大小为 10 和 100 时的节点持续时间，横轴为节点持续时间（微秒），纵轴为 CDF 值。

4.2.2.2. Defining the Quantum

Unlike traditional CPU scheduling where a quantum is simply elapsed wall-clock time, Olympian defines a quantum as the total duration of time that a DNN runs at least one Tensorflow node on the GPU.

Rationale:
- A fixed CPU time quantum is insufficient because DNNs execute some nodes on the CPU and others on the GPU. A DNN's GPU usage can vary widely within a fixed CPU time slice.
- Defining the quantum based on GPU duration (Figure 5) provides a more meaningful and controllable unit for allocating GPU resources, allowing Olympian to allocate GPUs to each DNN for predictable durations of actual GPU usage.
- Other definitions, like actual GPU resources consumed (e.g., number of cores), are not exposed by current GPUs or would require expensive online profiling.
  
  The following are the results from Figure 5 of the original paper:
  
  $Figure 5: Definition of GPU duration. In this example, GPU duration is given by `t _ { 1 } + t _ { 2 } + t _ { 3 }` .$ 该图像是示意图，展示了在调度中，各节点（Node 1 到 Node 4）之间的时间间隔关系，具体时间跨度由 $t_1$ , $t_2$ , $t_3$ 表示。该图有助于理解多任务调度时各节点的资源分配情况。

4.2.2.3. How to Estimate Quantum Expiration

To determine when a quantum expires based on GPU duration, Olympian relies on offline profiling rather than online instrumentation.

Rationale for Offline Profiling:
- Tensorflow provides a cost profiler (using CUDA Profiling Tools Interface) that can track DNN execution details. However, running this cost profiler online incurs high overhead, inflating execution times by 21% to 29% (Figure 6). This is unacceptable for a production serving system.
- DNN execution is largely predictable for a given model and batch size. Olympian exploits this by profiling GPU resource usage offline (when the GPU is idle). This profile is computed once per model (for a few common batch sizes, using linear regression for others) and then reused. This adds no overhead to inference tasks.
  
  The following are the results from Figure 6 of the original paper:
  
  该图像是一个柱状图，展示了使用在线和不使用在线配置的深度神经网络模型（如Inception、GoogLeNet、AlexNet等）完成任务所需的时间（秒）。每个模型的完成时间以蓝色（不使用在线配置）和红色（使用在线配置）表示，显示出优化后的性能差异。

4.2.2.4. How to Switch Between Jobs

Olympian employs cooperative co-scheduling of CPU thread gangs.

Mechanism: At the end of a quantum, Olympian suspends all CPU threads belonging to the current DNN job's gang and resumes all CPU threads of the next job's gang.
Rationale:
- Each TF-Serving job can involve a thousand or more CPU threads. Suspending and resuming only GPU-related threads is complex.
- This cooperative approach avoids preempting CPU threads (and their corresponding GPU operations), a design choice rejected due to high overhead.
- Implication (Partial Isolation): Not all threads of a job are suspended simultaneously. If a CPU thread is managing a GPU kernel that has already been launched, that kernel will continue to execute on the GPU even after the scheduling decision and thread suspension. This means Olympian cannot guarantee perfect exclusive GPU access at all times. However, since most nodes execute for very small durations (Figure 4), this "overflow" is minimal and its impact is managed by charging the cost to the original job (Figure 10).
  
  The following are the results from Figure 10 of the original paper:
  
  该图像是一个示意图，展示了多线程调度的时间线及其任务执行情况。在时间 $t_1$ 处，多个线程的运行状态被标记，显示了任务的切换和并发执行的特征。

4.2.3. Olympian Architecture

Olympian consists of two main components: a Profiler and a Scheduler.

The following are the results from Figure 7 of the original paper:

Figure 7: Olympian's profiler and scheduler. 该图像是一个示意图，展示了TF-Serving系统的组成和工作流程，其中包括Profiler和Scheduler模块。客户端任务通过TF-Serving进行处理，调度策略由管理员配置，并通过Profiler获取新模型的数据。TF-Serving在操作系统和GPU之间进行调度以实现高效计算。

4.2.3.1. The Profiler

The Profiler is responsible for offline profiling DNNs to determine their GPU resource usage characteristics.

Tensorflow Cost Model API: Tensorflow provides an API to obtain a cost model for a data flow graph. This cost model includes the time taken to execute each node (referred to as node cost), which is an approximate measure of GPU resources needed. While the sum of node costs isn't directly the total GPU execution time (due to concurrency), the individual node costs and the total GPU duration for a DNN are stable across runs when it has exclusive GPU access.
Cost Accumulation Rate: This is a key concept for Olympian to estimate quantum completion. Let $C_j$ be the sum of all node costs in DNN j, obtained from Tensorflow's cost model API. Let $D_j$ be the total duration of DNN j on the GPU, defined as the total time that at least one node belonging to $j$ runs on the GPU (as illustrated in Figure 5). $D_j$ is measured by instrumenting Tensorflow offline. The average cost accumulation rate for DNN j is then $\frac{C_j}{D_j}$ .

If a desired quantum duration is $Q$ , then a job is considered to have completed one quantum when it has accumulated a cost equal to a cost accumulation threshold, $T_j$ . The formula for $T_j$ is: $T_j = Q \times \frac{C_j}{D_j}$ Where:
- $T_j$ : The cost accumulation threshold for DNN j, indicating when one quantum for DNN j has expired.
- $Q$ : The desired quantum duration (in time units, e.g., microseconds).
- $C_j$ : The sum of all node costs in DNN j, as estimated by Tensorflow's cost model API (in arbitrary cost units).
- $D_j$ : The total GPU duration for DNN j (in time units, e.g., microseconds), measured offline when DNN j has exclusive access to the GPU. The ratio $\frac{C_j}{D_j}$ effectively represents the "cost per unit of GPU time."
Overhead-Q Curves: To determine the optimal $Q$ , Olympian profiles the overhead as a function of $Q$ .
1. For a given DNN j, the Profiler first computes $D_j$ and $C_j$ .
2. It then runs two instances of DNN j on unmodified TF-Serving (case $a$ ) and two instances on Olympian (case $b$ ) for various values of $Q$ .
3. Overhead is calculated as the ratio of the difference in finish times between case (b) and case (a), divided by the finish time in case (a).
4. These experiments generate Overhead-Q curves (Figure 8), showing how overhead decreases as $Q$ increases. Overhead-Q curves are precomputed for common batch sizes and can be estimated for others using linear regression.

The following are the results from Figure 8 of the original paper:

$Figure 8: The Overhead- $Q$ curves for the 7 DNNs shown in Figure 2.$ 该图像是图表，展示了不同深度神经网络模型（如Inception、GoogLeNet、AlexNet等）在不同请求数量Q下的开销（Overhead）变化趋势。随着Q的增加，所有模型的开销均呈下降趋势，表明模型在资源共享时的高效性。

Determining Q: The final step is to select an appropriate $Q$ . This is driven by an operator-specified overhead tolerance (e.g., 2.5%). Olympian finds the $Q$ value on the Overhead-Q curve that corresponds to this tolerance. If multiple DNNs are involved, it takes the largest $Q$ among them to ensure the overhead never exceeds the tolerance for any DNN. Once $Q$ is determined, the Profiler computes $T_j$ for each DNN.
Intuition for Predictability: The approach relies on the observed stability of total node cost ( $C_j$ ) and total GPU duration ( $D_j$ ) for a given DNN and batch size across different executions when the DNN has exclusive GPU access. This stability arises because individual nodes have deterministic executions, and the DNN's duration is determined by its data flow graph structure. Olympian aims to ensure each job gets exclusive GPU access during its quantum, effectively spreading out its execution timeline (Figure 9).

The following are the results from Figure 9 of the original paper:

该图像是示意图，展示了在时间轴T上，节点Node1到Node4在不同时刻t1和t2的调度情况。左侧图展示了并行进程，右侧图则显示了可能的切换点。具体调度理论可以用公式描述，比如 $t_{1}$ 和 $t_{2}$ 之间的时间分配。

4.2.3.2. The Scheduler

Olympian's scheduler is integrated into TF-Serving's processing loop with minimal modifications.

The modified scheduling logic is shown in Algorithm 2: $Algorithm 2 Olympian Scheduling logic 1: function SEssION::RuN(modelName, input) 2: srInfo = new SessRunInfo(sessId, runId) 3: root = getRoot(modelName, input) 4: scheduler.register(srInfo) 5: cumulatedCost = 0 6: Process(root, srInfo, cumulatedCost) 7: scheduler.deregister(srInfo) 8: function ProcEss(root, srInfo, cumulatedCost) 9: bfsQueue.push(root) 10: while !bfsQueue.empty() do 11: curNode = bfsQueue.pop() 12: scheduler.yield(srInfo) 13: compute(curNode) 14: if curNode.isGPUNode() then 15: cumulatedCost += curNode.cost() 16: if cumulatedCost ≥ threshold then 17: cumulatedCost -= threshold 18: scheduler.updateTokenInfo(srInfo) 19: for childNode in curNode.children() do 20: if !childNode.kernelIsAsync() then 21: bfsQueue.push(childNode) 22: else 23: thread = threadPool.fetch() 24: thread(Process(childNode, srInfo, cumulated- Cost))$

Job Registration and Cost Tracking (Lines 2-6, 8, 15):
- Each job's session (srInfo) is registered with the scheduler (Line 4).
- A cumulatedCost variable (Line 5), shared by all threads in the job's gang, tracks the cost consumed by the job.
- When a GPU node completes execution (compute(curNode) in Line 13), the CPU thread managing it updates cumulatedCost by adding curNode.cost() (Line 15).
Quantum Expiration and Scheduling Decision (Lines 16-18):
- If cumulatedCost exceeds the threshold ( $T_j$ ) (Line 16), it signifies that the job has completed its quantum.
- cumulatedCost is decremented by threshold (Line 17) to account for the just-completed quantum.
- The scheduler.updateTokenInfo(srInfo) function is called (Line 18) to trigger a scheduling decision. This function applies the active scheduling policy (e.g., round-robin) to determine which job gets exclusive GPU access next and updates an internal "token" accordingly.
Cooperative Yielding (Line 12):
- Before compute(curNode) (Line 13), scheduler.yield(srInfo) is called (Line 12).
- During this call, if the current job no longer holds the "token" (i.e., it doesn't have exclusive GPU access according to the scheduler's decision), the CPU thread will suspend itself.
- The scheduler uses condition variables: CPU threads of a job wait on their condition variable if their job loses GPU access. When a job is granted GPU access, its condition variable is notified, waking up its CPU threads.
Thread Management (Lines 19-24):
- Similar to TF-Serving, child nodes are processed: if not async (e.g., CPU nodes), they are pushed to bfsQueue; if async (e.g., GPU nodes), a new CPU thread is fetched to Process them.
Job Deregistration (Line 7): When a job completes, it deregisters from the scheduler.

Cooperative Nature and "Overflow" (Revisited): As noted in Figure 10 and discussed in the paper, Olympian's cooperative co-scheduling means that not all CPU threads of a job are suspended simultaneously. If a CPU thread has already launched a GPU kernel (compute(curNode) in Algorithm 2, Line 13) before scheduler.yield(srInfo) (Line 12) leads to its suspension, that GPU kernel will continue running on the GPU even after the scheduling decision has shifted GPU access to another job. This can lead to a slight "overflow" where kernels from the previous job briefly run concurrently with kernels from the new job. However, given that most Tensorflow nodes have very short durations, this overlap is minimal. Olympian accounts for this by charging any cost accumulated by these "overflow" nodes to the quantum of the original job.

4.2.4. Scheduling Policies

Olympian provides a framework for implementing various scheduling policies on top of its time-slicing mechanism. The paper mentions three implemented policies:

Fair Sharing: Jobs are scheduled in a round-robin fashion, with each job receiving one quantum in its turn.
Weighted Fair Sharing: A variant of fair sharing where jobs receive a number of quanta proportional to an assigned integer weight during their turn.
Priority Scheduling: At any given time, the job with the highest priority receives the next quantum.

4.3. GPU Multiplexing Strategy

The paper notes that TF-Serving can efficiently multiplex GPU kernels both spatially (assigning different kernels' thread blocks to different stream multiprocessors) and temporally (switching between kernels within the same stream multiprocessor). This often leads to high GPU utilization. However, for DNNs with batching, which perform pixel-level data parallel operations on large batches of images, the total number of pixels can exceed the GPU's parallelism. This leaves little room for spatial multiplexing between different client requests, as exemplified by two concurrent Inception jobs taking twice as long as one, indicating lack of spatial multiplexing.

For this reason, Olympian consciously chooses to focus only on temporal multiplexing. It aims to give one DNN exclusive temporal access to the GPU for a quantum, rather than trying to spatially interleave kernels from different DNNs, which is less effective for large batch sizes and hinders predictability.

5. Experimental Setup

The evaluation of Olympian was conducted using an implementation built on TF-Serving (version 1.2). The Olympian codebase, encompassing the profiler, scheduler, and scheduling policies, consists of 5234 lines of code.

5.1. Datasets

The paper uses several well-known Deep Neural Network (DNN) models for image classification. These models serve as the "datasets" in the sense that their computational graphs and resource demands are the subject of the scheduling experiments. The DNN models used are:

Inception-v4 ([25], abbreviated as Inception)
ResNet-152 ([13], abbreviated as ResNet)
GoogLeNet ([26])
AlexNet ([16])
VGG ([22])
ResNet-50 ([13])
ResNet-101 ([13])

These models vary in terms of their complexity, number of nodes, and GPU usage characteristics, making them suitable for evaluating Olympian under diverse workloads. For instance, Table 2 provides details on some of these models with varying batch sizes:

The following are the results from Table 2 of the original paper:

Model	Batch Size	Nodes	GPU Nodes	Runtime (s)
Inception	150	15599	13309	0.81
GoogLeNet	200	18980	15948	1.09
AlexNet	256	23774	19902	1.13
VGG	120	11297	9965	0.83
ResNet-50	144	14472	12280	0.79
ResNet-101	128	14034	12082	0.85
ResNet-152	100	12495	10963	0.80

Why these datasets (models) were chosen: They represent a range of large, computationally intensive DNNs commonly used for image classification, allowing for a robust evaluation of Olympian's ability to manage diverse GPU workloads. Many of these models (all except Inception) were originally developed on Caffe and converted to Tensorflow, demonstrating Olympian's portability across model origins.
Data samples: The models are typically used for tasks like image classification, meaning input data would be images. For instance, an Inception model would take an image as input and output a classification label. No specific example image is provided in the text.

5.2. Evaluation Metrics

The paper uses several metrics to evaluate Olympian's performance:

5.2.1. Overhead

Conceptual Definition: Overhead measures the additional time or resource consumption introduced by Olympian's scheduling mechanism compared to the baseline TF-Serving. A lower overhead indicates a more efficient scheduling system.
Mathematical Formula: The overhead is defined as the ratio of the difference between the finish times of two jobs run on Olympian and on TF-Serving, divided by the finish time on TF-Serving. $\text{Overhead} = \frac{(\text{Finish Time}_{\text{Olympian}} - \text{Finish Time}_{\text{TF-Serving}})}{\text{Finish Time}_{\text{TF-Serving}}}$
Symbol Explanation:
- $\text{Overhead}$ : The measured overhead, expressed as a fraction or percentage.
- $\text{Finish Time}_{\text{Olympian}}$ : The time taken for jobs to complete when scheduled by Olympian.
- $\text{Finish Time}_{\text{TF-Serving}}$ : The time taken for jobs to complete when scheduled by the default TF-Serving. The finish times are typically measured from when a client sends a request until it receives a response.

5.2.2. GPU Duration (per quantum)

Conceptual Definition: This metric quantifies the actual amount of time a DNN job spends actively utilizing the GPU during a single scheduling quantum. It is a direct measure of how effectively Olympian can enforce its quantum definition based on GPU usage.
Mathematical Formula: The paper defines GPU duration as the total time that at least one Tensorflow node belonging to a DNN runs on the GPU (as illustrated in Figure 5). This is measured using offline instrumentation of Tensorflow. For a single quantum, it is the duration of GPU activity between two scheduling decisions for that job. The average GPU duration per quantum is obtained by averaging these individual quantum durations over a job's execution. $D_{q} = \sum_{k=1}^{N_q} d_k$ where $d_k$ is the GPU duration of node k within a quantum, and $N_q$ is the number of GPU nodes executed in that quantum. The average $D_q$ is then taken over many quanta.
Symbol Explanation:
- $D_q$ : The GPU duration for a single quantum.
- $d_k$ : The GPU duration of an individual Tensorflow node k.
- $N_q$ : The number of GPU nodes executed within a specific quantum. The average GPU duration per quantum is crucial for validating if the actual quanta achieved by Olympian match the predicted $Q$ from offline profiling.

5.2.3. Job Finish Times

Conceptual Definition: This metric measures the total elapsed time from when a client sends an inference request until it receives the complete response. It directly reflects the latency experienced by the user. Predictability and fairness in job finish times are primary goals of Olympian.
Measurement: Measured as wall-clock time from request submission to response reception.

5.2.4. GPU Utilization

Conceptual Definition: GPU utilization measures the percentage of time the GPU's computing units are actively performing computations. Higher utilization generally indicates more efficient use of GPU resources. Olympian aims to maintain high utilization while introducing time-slicing.
Measurement: Measured using the NVIDIA System Management Interface (NVIDIA SMI), which periodically samples GPU utilization. The average utilization is computed over the duration of the experiment.

5.2.5. Scalability

Conceptual Definition: Scalability refers to the maximum number of concurrent DNN clients or jobs that the serving system can support before encountering resource limitations (e.g., memory, CPU threads, GPU capacity).
Measurement: Determined by observing the maximum number of concurrent clients that can be supported by Olympian compared to TF-Serving.

5.3. Baselines

The primary baseline model used for comparison is the default, unmodified Tensorflow-Serving (TF-Serving) (version 1.2).

Why it's representative: TF-Serving is a widely adopted and representative DNN model serving system. Comparing Olympian against TF-Serving directly highlights the improvements Olympian brings in terms of predictability, fairness, and differentiated services, which are inherently lacking in TF-Serving's default GPU scheduling behavior. TF-Serving relies on the underlying GPU driver for kernel scheduling, which, as the paper argues, is unaware of DNN-specific job boundaries.

5.4. Hardware and Workload Configuration

Primary Hardware:
- CPU: Intel i7-8700 CPU
- RAM: 32 GB
- Operating System: Ubuntu 16.04 on Linux 4.4.0-112-generic
- GPU: GeForce GTX 1080 Ti
Secondary Hardware (for portability validation):
- CPU: Intel Xeon E5-2603 v4 CPU
- RAM: 16 GB
- Operating System: Ubuntu 14.04 on Linux 4.4.0-72-generic
- GPU: NVIDIA Titan X
Default Workload Configuration (unless specified):
- Clients: Multiple concurrent clients.
- Input Batches: Each client submits 10 batches sequentially.
- Batch Size: 100.
- DNN Models: Inception-v4 or ResNet-152. This configuration is chosen to stress concurrent execution and simplify result interpretation under a uniform client load.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Demonstrating GPU Isolation

6.1.1.1. Homogeneous Workload

Olympian aims to provide strong GPU isolation, ensuring predictable execution times. When 10 concurrent Inception clients (with identical resource demands) are run:

Olympian with Fair Scheduling: Achieves nearly identical finish times for all jobs, completing between 48 and 50 seconds. This demonstrates its ability to enforce fairness and predictability.
Default TF-Serving: Shows significant variability in finish times, ranging from 42 to 50 seconds, highlighting its lack of GPU isolation and predictability.

The following are the results from Figure 11 of the original paper:

该图像是对比图表，展示了在不同客户端 ID 下，TF-Serving 和 Olympian 公平共享的完成时间（单位：秒）。从图中可以看出，两个系统在大多数情况下的完成时间相似，表明 Olympian 方案在公平性方面表现良好。

The granular details of Olympian's operation are revealed by examining the duration of successive scheduling intervals (quanta). As shown in Figure 12:

The average scheduling interval duration is 1.8 ms. This demonstrates Olympian's capability for fine-timescale interleaving.
Individual scheduling intervals vary widely in duration. This is expected because quantum completion is based on cost accumulation (a job finishes a quantum when it accumulates a certain cost), and DNNs do not accumulate cost at a perfectly even rate throughout their execution. At different stages, a DNN might execute more or fewer GPU nodes concurrently.
The chosen overhead in the profiling phase for this experiment was 2.5%, aligning with the observed low overhead. Olympian's fair scheduler effectively provides fair GPU access when averaged over longer timescales (tens or hundreds of intervals).

The following are the results from Figure 12 of the original paper:

$Figure 12: Duration of scheduling interval, average $\\mathbf { \\varepsilon } _ { : = 1 . 8 \\mathbf { m s } }$$ 该图像是一个示意图，展示了不同间隔之间的持续时间分布。横轴为持续时间ID，纵轴为间隔之间的持续时间（秒），可以看出大部分持续时间集中在0.01秒到0.06秒之间，分布呈现出一定的波动性。

6.1.1.2. Heterogeneous Workload

To evaluate Olympian's performance with diverse workloads, an experiment was conducted with 10 concurrent clients: 5 Inception jobs and 5 ResNet jobs.

Finish Times (Figure 13): Inception jobs finish at similar times, and ResNet jobs finish at similar times, but the finish times between the two model types are slightly different. This difference arises because models like Inception and ResNet have different total execution times even for the same batch size.
Equilizing Workloads: The authors attempted to equalize total execution times by choosing an Inception batch size of 150 (estimated to have similar total run time as ResNet-100). Even then, finish times were not identical. This is because Olympian fair shares the GPU, not the CPU. DNNs can have varying proportions of CPU and GPU work.
Verification of Fair GPU Sharing: To confirm Olympian's intended function, the average GPU duration (total time a job's nodes run on GPU) per quantum was measured while all jobs were active (Figure 14).
- Each job received a nearly identical share of GPU time. The average quantum duration ranged from 1084 $\mu$ s to 1257 $\mu$ s, with a standard deviation of 4.9% to 10.1%.
- This closely matches the $Q$ value (1190 $\mu$ s) predicted by the offline profiler using the Overhead-Q curves (Figure 8) for these models. This validates the accuracy of Olympian's profiling and its ability to enforce predicted quanta.
  
  The following are the results from Figure 13 of the original paper:
  
  该图像是一个条形图，展示了不同客户端在使用 Inception 和 ResNet 模型的任务完成时间。横轴为客户端 ID，纵轴为完成时间（秒）。蓝色条形表示 Inception-batch-100 和 ResNet152-batch-100 的完成时间，红色条形表示 Inception-batch-150 和 ResNet152-batch-100 的完成时间。

The following are the results from Figure 14 of the original paper:

Figure 14: Average GPU duration per quantum for heterogeneous workload, clients 0-4 run Inception jobs and 5-9 run ResNet jobs. 该图像是一个柱状图，展示了不同客户 ID 在 Olympian 系统中平均 GPU 持续时间（微秒）。图中显示，所有客户的平均 GPU 持续时间大致相同，范围在 1000 到 1400 微秒之间，误差条表示了数据的波动性。

Impact of "Overflow" (Figure 15): The variance in actual quanta received can be attributed to Olympian's cooperative co-scheduling. When Olympian switches to a new job, some GPU nodes from the old job might still be executing. These nodes (typically 2-3 per context switch) interfere with the new job's nodes. Their cost is still charged to the old job's cumulated cost. This explains why some jobs might have a slightly higher average quantum (if the overflowed cost is significant) or lower quantum (if the next quantum is reduced because of accumulated overflow cost from a previous turn).

The following are the results from Figure 15 of the original paper:

$Figure 15: Inflated GPU duration $\\left( t _ { 1 } + t _ { 2 } < t _ { 3 } \\right)$ when a node from previous job finishes after that job has been switched out.$ 该图像是示意图，展示了在不同节点上调度任务的时间关系。上部分表示节点间的任务执行时间 t1 和 t2，下部分显示了引入的新节点 Node_x 和其任务执行时间 t3 的调度关系。通过对比，可以分析调度算法在 GPU 资源使用中的效率。
Observed Overhead: For this heterogeneous workload, the actual observed overhead was 2.4%, very close to the target overhead of 2.5% used for $Q$ determination. This further confirms the profiler's accuracy.

6.1.1.3. Complex Workload

A more complex scenario involved 14 clients running 7 different DNNs (Inception, GoogLeNet, AlexNet, VGG, ResNet-50/101/152), each with a different batch size. As shown in Table 2, these DNNs vary significantly in terms of GPU/CPU node counts and runtime.

Average GPU Duration (Figure 16): Even under this diverse workload, all DNNs achieved comparable average GPU durations. The $Q$ value chosen by the profiler was 1620 $\mu$ s. The observed average GPU durations for clients ranged from 1438 $\mu$ s to 1662 $\mu$ s, with standard deviations between 4.1% and 12.0%. This demonstrates Olympian's effectiveness in fair GPU sharing across highly varied tasks.
Overhead Match: The predicted overhead for the chosen $Q$ was 2%, and the actual observed overhead was 1.8%, again confirming the profiler's predictive power.

The following are the results from Figure 16 of the original paper:

该图像是一个柱状图，展示了使用 Olympian 系统的多个深度神经网络模型的平均 GPU 持续时间（微秒）。每个柱子代表不同模型的 GPU 使用时长，结果显示大部分模型的 GPU 持续时间在 1500 微秒左右，误差条表示标准差。

6.1.2. Other Scheduling Policies

Olympian's flexibility allows for implementing various scheduling policies:

Experiment: 10 Inception clients were run. The first 5 clients were assigned a weight of 2, and the remaining 5 clients a weight of 1.
Results (Figure 17): The first set of clients (weight 2) finished between 36 and 38 seconds, while the second set (weight 1) finished around 50 seconds.
Validation: This outcome aligns with theoretical expectations. For a system where jobs have weights of $k$ $k$ or 1, the ratio of finish times is expected to be $\frac{k+1}{2k}$ $\frac{k + 1}{2 k}$ .
- For $k=2$ (weights 2 and 1), the ratio is $\frac{2+1}{2 \times 2} = \frac{3}{4} = 0.75$ . The observed finish time ratio (approx. 37s / 50s = 0.74) closely matches this.
- For $k=10$ (if implemented with a weight ratio of 10:1), a similar close match to the expected 55% ratio was observed. This confirms Olympian's ability to provide weighted fair sharing, a capability not available in default TF-Serving.

The following are the results from Figure 17 of the original paper:

Figure 17: Finish times on a homogeneous workload with weighted fair sharing, for two different weight assignments. 该图像是一个柱状图，展示了不同客户端在TF-Serving和Olympian系统下的完成时间（秒）。柱状图中包括三组数据，分别为TF-Serving、Olympian-2:1加权公平和Olympian-10:1加权公平，X轴表示客户端ID，Y轴表示完成时间。可以观察到，Olympian系统在各客户端上表现出比TF-Serving更好的时间公平性。

6.1.2.2. Priority Scheduling

Experiment 1 (Strict Priority): 10 Inception clients with strictly decreasing priorities (client 0 highest, client 9 lowest).
Results (Figure 18, 10-level priority): Clients are effectively serialized, with higher priority jobs completing before lower priority ones can begin significant execution.
Experiment 2 (Two-Level Priority): The first 5 clients have the same higher priority, and the next 5 have a lower priority.
Results (Figure 18, 2-level priority): The first five clients fair share the GPU among themselves and finish at approximately 25 seconds. After they complete, the lower-priority clients are then assigned the GPU and finish at roughly the same time. This demonstrates Olympian's capability to implement priority scheduling, a crucial feature for service differentiation.

The following are the results from Figure 18 of the original paper:

Figure 18: Finish times on a homogeneous workload with priority scheduling, for two different priority assignments. 该图像是图表，展示了不同客户端在使用 TF-Serving 和 Olympian 调度系统时的完成时间（秒）。图表中显示了三种调度方式：TF-Serving（空白条），Olympian-10-level-priority（红色条）和 Olympian-2-level-priority（蓝色条）。

6.1.3. Utilization and Scaling

6.1.3.1. Utilization

Comparison: Olympian generally has slightly lower GPU utilization compared to default TF-Serving.
Measurements: For 10 Inception clients:
- Default TF-Serving: 84.74%
- Olympian (Fair Sharing): 78.62%
- Olympian (Weighted Fair Sharing): 78.10%
- Olympian (Priority Scheduling): 76.35%
Loss: Olympian reduces utilization by 6-8%. This is because Olympian provides exclusive GPU access to a single job during a quantum, potentially losing opportunities for spatial multiplexing (which is already limited for large DNN batches) or when a job briefly idles the GPU.
Priority Scheduling's Lower Utilization: Priority scheduling shows the lowest utilization because, with strict priorities, jobs are effectively serialized, minimizing the brief "overlapped job execution" (Figure 10) that can occur in fair or weighted fair sharing, which slightly boosts utilization.

6.1.3.2. Scalability

Memory Limit: Both TF-Serving and Olympian are limited by the GPU memory of the GeForce GTX 1080 Ti, supporting about 45 concurrent clients.
Olympian's Thread Pool Limit: For some DNNs (e.g., Inception), Olympian's scalability can be limited by operating system thread pool limits. It supports 40-60 concurrent Inception clients, whereas TF-Serving can support 100.
- Reason: In Olympian, CPU thread gangs are alternately suspended and resumed, meaning threads are allocated for longer durations. In TF-Serving, threads are always runnable and returned to the thread pool as soon as they finish their work, allowing it to serve more concurrent clients with a smaller thread pool. Olympian reaches the thread pool limit faster.

6.1.4. Other Validations

6.1.4.1. The Importance of Profiling

Experiment: Olympian's profiler-based time-slicing was replaced with a simple CPU timer (quantum expires after fixed wall-clock time).
Results (Figure 19):
- Homogeneous Workload: CPU timer-based scheduling resulted in unequal finish times for Inception jobs, similar to default TF-Serving's unpredictability.
- Heterogeneous Workload: It led to widely varying GPU durations for Inception and ResNet jobs.
Conclusion: This strongly validates Olympian's choice of careful GPU profiling and time-slicing based on GPU cost accumulation rate, as opposed to simplistic CPU timer-based approaches.

The following are the results from Figure 19 of the original paper:

该图像是柱状图，左侧展示了不同客户端（Client id 0-9）的完成时间（Finish time），每个客户端的完成时间相近，均接近50秒；右侧展示了平均GPU持续时间（Average GPU duration），客户端0的时间最高为1872微秒，而其他客户端的时间明显较少，数据显示调度效果。

6.1.4.2. Portability

Experiment: The fair sharing experiment (10 Inception clients) was re-run on a different hardware platform (Intel Xeon E5-2603 v4 CPU, 16 GB RAM, NVIDIA Titan X GPU).
Results (Figure 21): Olympian achieved similar performance to the original fair sharing experiment. The total finish times were different due to the distinct hardware, but the fairness characteristic was preserved.
Conclusion: Olympian demonstrated high portability, requiring no modifications to run on a different GPU platform. This is attributed to inheriting framework and device independence from Tensorflow and TF-Serving.

The following are the results from Figure 21 of the original paper:

该图像是一个条形图，显示了在不同硬件平台上运行的 Olympian 的完成时间。横坐标为客户端 ID，纵坐标为完成时间（秒），所有条形高度相近，表明在多个客户请求下，系统实现了稳定的延迟表现。

6.1.4.3. Cost and Duration Stability

Olympian's profiler relies on the stability of DNN costs and GPU durations.

Validation: An Inception model (batch size 100) was run 100 times on TF-Serving to measure stability.
Results:
- Total Cost: Mean = 4,058,477 ns, Standard Deviation = 100,536 ns.
- GPU Duration: Mean = 262,773 ns, Standard Deviation = 4,462 ns.
Conclusion: The very low standard deviations relative to the means confirm that both the total cost and GPU duration are highly stable across multiple executions, validating a core assumption of Olympian's offline profiling.

6.1.4.4. Linear Models for Node Costs

Strategy: Instead of profiling every batch size, Olympian profiles a few common batch sizes and uses linear regression to estimate node costs for other batch sizes.
Experiment: 10 Inception clients were run with 3 different batch sizes (25, 75, 150). For each, node costs were derived from a linear model trained on profiles of two batch sizes (50 and 100).
Results (Figure 20): In all cases, Olympian provided fairness in completion times comparable to that seen in Figure 11 (homogeneous workload with direct profiling).
Conclusion: This demonstrates that linear models are effective for generalizing node costs across batch sizes, further reducing the offline profiling effort without sacrificing performance.

The following are the results from Figure 20 of the original paper:

该图像是一个柱状图，展示了三个不同批次（Batch-005、Batch-025 和 Batch-150）在不同客户端（Client id）上的完成时间（Finish time），单位为秒。每个批次的完成时间分布表现出一定的规律性。

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

§4.1	Even under a diverse workload containing 7 DNNswith different batch sizes, Olympian can achieve the predicted Q while keeping overheads under 2%
§4.2	Olympian can support priority scheduling and weighted fair sharing
§4.3	Olympian sacrifices GPU utilization by less than 8% and scales similar to TF-Serving for some DNNs
§4.4	Olympian is portable to other GPUs, its cost estimates are stable across runs, and its offline profiler is crucial.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Olympian, a novel system designed to improve GPU scheduling in Deep Neural Network (DNN) model serving systems like Tensorflow-Serving (TF-Serving). The core innovation lies in implementing middleware-level time-slicing, where DNN jobs are given exclusive access to the GPU for short, carefully determined quanta. Olympian achieves this by leveraging the predictable nature of DNN computations through offline profiling to estimate GPU resource usage (costs and durations) and determine quantum sizes. It employs cooperative co-scheduling of CPU thread gangs to switch between DNNs with minimal overhead.

The evaluation demonstrates Olympian's effectiveness:

It significantly improves the predictability and fairness of DNN job execution compared to default TF-Serving, even for heterogeneous and complex workloads.
It can interleave DNNs at millisecond timescales (around 1.8 ms average quantum) while incurring very low overhead (typically less than 2%).
It successfully supports advanced scheduling policies such as fair sharing, weighted fair sharing, and priority scheduling, enabling service differentiation crucial for cloud offerings.
Olympian maintains high GPU utilization (sacrificing less than 8%) and is portable across different GPU hardware platforms and DNN frameworks without modification. Its reliance on offline profiling and the stability of DNN costs and durations are key to its performance.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

Number of Models and GPUs: Future work can expand Olympian to serve more DNN models and support multiple GPUs within a single server. The current work focuses on scheduling DNN jobs on a single GPU.
Realistic Workloads: The current evaluation primarily uses homogeneous or controlled heterogeneous workloads. Future work should explore Olympian's performance under more realistic and dynamic workloads (e.g., varying arrival rates, differing job characteristics over time).
Scheduling Policies: The set of supported scheduling policies could be expanded to include more sophisticated or domain-specific policies.
Exploiting DNN Dependencies: Investigating how to exploit dependencies between DNNs (if they exist) could potentially improve GPU utilization further.
Power Measurement: The paper notes that power usage is an important metric that was not evaluated, suggesting it as future work.

7.3. Personal Insights & Critique

Olympian presents an elegant and practical solution to a critical problem in DNN model serving: the lack of predictability and control over GPU resource allocation.

Innovation: The core innovation of performing DNN-aware time-slicing at the middleware layer by leveraging offline profiling is particularly insightful. It cleverly sidesteps the high overhead of hardware preemption and the lack of DNN context at lower levels. This "sweet spot" in the software stack allows for both fine-grained control and high portability. The definition of a quantum based on GPU duration (derived from cost accumulation rate) is also a smart design choice, making the scheduling truly GPU-centric rather than just time-centric.
Applicability: The methods and conclusions are highly transferable to any model serving system built on data flow graph frameworks where DNN execution is predictable. This could extend beyond Tensorflow to other frameworks if similar cost models or profiling APIs are available. The concept of offline profiling for resource modeling could also be applied to other types of predictable, accelerator-bound workloads.
Potential Issues/Areas for Improvement:
- Predictability of DNNs: While the paper validates the stability of DNN costs for the explored models, this assumption might not hold universally for all future DNN architectures or highly dynamic models (e.g., those with adaptive computation or sparse activation patterns). Continuous monitoring or adaptive re-profiling might be needed in such cases.
- "Overflow" Implications: Although deemed minimal, the cooperative co-scheduling's "overflow" effect (Figure 10) means Olympian doesn't provide perfect hard GPU isolation. For extremely latency-sensitive or security-critical applications requiring absolute resource guarantees, this slight interference might be a concern. A more detailed analysis of the worst-case overflow and its impact could be beneficial.
- Thread Pool Limit: The observation that Olympian hits operating system thread pool limits faster than TF-Serving for some DNNs is an interesting scalability bottleneck. Future work could explore optimizations to thread management within Olympian (e.g., more aggressive thread pooling, event-driven models) to mitigate this.
- Dynamic Workloads and Hot-swapping: The offline profiling approach is efficient for static or slowly changing workloads. For scenarios with frequent model updates, A/B testing, or cold starts of new DNNs, the profiling process needs to be integrated efficiently (e.g., in a CI/CD pipeline) to avoid introducing delays.
- Multi-GPU Scheduling: The paper explicitly defers multi-GPU scheduling to future work. Extending Olympian to orchestrate DNNs across multiple GPUs while preserving its core principles of fairness and predictability would be a significant next step. This would involve complex decisions about job placement and inter-GPU communication.
- Power Consumption: As noted by the authors, investigating the power overhead would provide a more complete picture of Olympian's efficiency, especially in energy-conscious environments.
  
  Overall, Olympian represents a robust and well-thought-out contribution to the field, offering a practical path to managing GPU resources effectively and predictably in the context of DNN serving.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Olympian: Scheduling GPU Usage in a Deep Neural Network Model Serving System

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~27 min read · 31,585 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Deep Neural Networks (DNNs)

Graphics Processing Units (GPUs)

DNN Serving Systems

Tensorflow

GPU Kernels

Batching

Scheduling and Time-Slicing

GPU Preemption

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. The TF-Serving Execution Model and its Drawbacks

4.2.2. Olympian's Design Choices and Rationale

4.2.2.1. The Unit of Interleaving

4.2.2.2. Defining the Quantum

4.2.2.3. How to Estimate Quantum Expiration

4.2.2.4. How to Switch Between Jobs

4.2.3. Olympian Architecture

4.2.3.1. The Profiler

4.2.3.2. The Scheduler

4.2.4. Scheduling Policies

4.3. GPU Multiplexing Strategy

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Overhead

5.2.2. GPU Duration (per quantum)

5.2.3. Job Finish Times

5.2.4. GPU Utilization

5.2.5. Scalability

5.3. Baselines

5.4. Hardware and Workload Configuration

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Demonstrating GPU Isolation

6.1.1.1. Homogeneous Workload

6.1.1.2. Heterogeneous Workload

6.1.1.3. Complex Workload

6.1.2. Other Scheduling Policies

6.1.2.1. Weighted Fair Sharing

6.1.2.2. Priority Scheduling

6.1.3. Utilization and Scaling

6.1.3.1. Utilization

6.1.3.2. Scalability

6.1.4. Other Validations

6.1.4.1. The Importance of Profiling

6.1.4.2. Portability

6.1.4.3. Cost and Duration Stability

6.1.4.4. Linear Models for Node Costs

6.2. Data Presentation (Tables)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers