Olympian: Scheduling GPU Usage in a Deep Neural Network Model Serving System
TL;DR Summary
The paper presents Olympian, which extends TensorFlow-Serving to enable fair sharing of a single GPU among multiple large DNNs with low overhead, achieving interleaved execution in 1-2 ms and greatly reducing latency unpredictability.
Abstract
Deep neural networks (DNNs) are emerging as important drivers for GPU (Graphical Processing Unit) usage. Routinely, now, cloud offerings include GPU-capable VMs, and GPUs are used for training and testing DNNs. A popular way to run inference (or testing) tasks with DNNs is to use middleware called a serving system. Tensorflow-Serving (TF-Serving) is an example of a DNN serving system. In this paper, we consider the problem of carefully scheduling multiple concurrent DNNs in a serving system on a single GPU to achieve fairness or service differentiation objectives, a capability crucial to cloud-based TF-Serving offerings. In scheduling DNNs, we face two challenges: how to schedule, and switch between, different DNN jobs at low overhead; and, how to account for their usage. Our system, Olympian, extends TF-Serving to enable fair sharing of a GPU across multiple concurrent large DNNs at low overhead, a capability TF-Serving by itself is not able to achieve. Specifically, Olympian can run concurrent instances of several large DNN models such as Inception, ResNet, GoogLeNet, AlexNet, and VGG, provide each with an equal share of the GPU, while interleaving them at timescales of 1-2 ms, and incurring an overhead of less than 2%. It achieves this by leveraging the predictability of GPU computations to profile GPU resource usage models offline, then using these to achieve low overhead switching between DNNs.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Olympian: Scheduling GPU Usage in a Deep Neural Network Model Serving System
1.2. Authors
- Yitao Hu (University of Southern California)
- Swati Rallapalli (IBM Research)
- Bongjun Ko (IBM Research)
- Ramesh Govindan (University of Southern California)
1.3. Journal/Conference
The paper was published in a conference setting, indicated by its typical conference paper structure and content. While the specific conference name is not provided within the text, the content is consistent with a publication in the field of computer systems, networking, or machine learning systems. These venues are typically highly reputable and influential in their respective domains.
1.4. Publication Year
The publication year is not explicitly stated in the provided text.
1.5. Abstract
Deep neural networks (DNNs) are significant drivers of Graphics Processing Unit (GPU) usage, with cloud platforms increasingly offering GPU-enabled virtual machines (VMs). A common method for performing DNN inference (testing) is through middleware known as a serving system, such as Tensorflow-Serving (TF-Serving). This paper addresses the challenge of scheduling multiple concurrent DNNs on a single GPU within a serving system to achieve objectives like fairness or service differentiation, which are critical for cloud-based TF-Serving offerings.
The authors identify two main challenges in DNN scheduling:
-
Efficiently scheduling and switching between different DNN jobs with low overhead.
-
Accurately accounting for GPU usage by these jobs.
The proposed system,
Olympian, extendsTF-Servingto enable fair sharing of a GPU among multiple concurrent large DNNs while maintaining low overhead, a capability thatTF-Servingalone lacks.Olympianachieves this by leveraging the predictable nature of GPU computations. It profiles GPU resource usage models offline, then uses these models to facilitate low-overhead switching between DNNs.
Specifically, Olympian is demonstrated to run concurrent instances of large DNN models (e.g., Inception, ResNet, GoogLeNet, AlexNet, VGG), provide each with an equal share of the GPU, interleave them at timescales of 1-2 milliseconds, and incur an overhead of less than 2%.
1.6. Original Source Link
/files/papers/691d6e408106199151d7e8c0/paper.pdf
This appears to be a link to a PDF hosted on a local file system or an internal server, not a public, official publication link. Its publication status (e.g., formally published in a conference, workshop, or as a preprint) is not explicitly stated from the given information.
2. Executive Summary
2.1. Background & Motivation
The proliferation of deep neural networks (DNNs) has made Graphics Processing Units (GPUs) indispensable computational resources. Cloud providers now routinely offer GPU-capable Virtual Machines (VMs), and GPUs are central to both training and inference (also known as testing or prediction) of DNNs. For inference tasks, model serving systems act as middleware, allowing multiple inference requests to share GPU resources. Tensorflow-Serving (TF-Serving) is a prominent example of such a system.
The core problem the paper aims to solve is the unpredictable execution times for concurrent DNN inference tasks when run on a single GPU within a serving system like TF-Serving.
- Why this problem is important:
- Low GPU Utilization: Practical DNN applications often exhibit intermittent and bursty GPU usage, leading to underutilization. Serving systems multiplex these tasks to improve utilization.
- Unpredictable Latency: In
TF-Serving, multiple concurrent jobs can submitGPU kernelsto the underlyingGPU driver. Since the driver cannot distinguish between kernels belonging to different DNNs or client requests, jobs with identical computational needs can finish at vastly different times (e.g., up to 1.7x difference observed in experiments). This variability makes it extremely difficult for developers to engineerlatency-sensitiveuser-facing applications that rely on model servers. - Lack of Service Differentiation: The inability to control
DNN executionprecisely prevents model server operators from implementing crucial features likepriority schedulingorweighted scheduling. These capabilities are essential for offering differentiated services and potentially new revenue streams in cloud-basedTF-Servingofferings.
- Specific challenges/gaps:
-
GPU Driver Limitations: The underlying
GPU driverschedules kernels submitted by different jobs without knowledge of whichDNNthey belong to. -
High Overhead of Traditional GPU Preemption: While
GPU preemptioncould enforce scheduling, it is known to be expensive, especially formassively parallelGPU contexts. Instruction-level preemption (e.g., in NVIDIA Pascal architecture) can take hundreds of microseconds for a context switch. -
Middleware-level Control: The
TF-Servingmiddleware is the only point in the stack that understands the high-level structure and requirements ofDNNs(e.g.,Tensorflow graphandnode abstractions). This knowledge is critical for effective scheduling and isolation.The paper's entry point is to explore a novel strategy: time-slicing GPU usage at the middleware level rather than relying on lower-level hardware preemption. This allows
TF-Servingto gain control overDNN executionand address the predictability and differentiation challenges.
-
2.2. Main Contributions / Findings
The paper introduces Olympian, a system that extends TF-Serving to provide fine-grained, predictable GPU scheduling for concurrent DNN inference tasks. Its primary contributions and findings are:
- Middleware-level Time-Slicing:
Olympiandemonstrates that carefully time-slicingGPU usageat theTF-Servingmiddleware layer (specifically at theTensorflow nodeboundary) is a viable and effective strategy for achieving isolation and predictability without requiring specialized hardware support or changes to GPU drivers. - Low-Overhead Scheduling: It achieves efficient scheduling and switching between
DNN jobsat very low overhead. Experiments showOlympiancan interleaveDNNsat timescales of 1-2 milliseconds with an overhead of less than 2%. - Offline Profiling for Predictability:
Olympianleverages the predictable nature ofGPU computationsinDNNsto profileGPU resource usage modelsoffline. This offline profiling eliminates the high overhead of online monitoring (which can inflate execution times by 21-29%) and enables accurate estimation ofquantumduration andcost accumulation rates. - Cooperative Co-scheduling:
Olympianemployscooperative co-scheduling, where the entire "gang" ofCPU threadsassociated with aDNN jobis suspended and resumed. This approach avoids expensive hardware preemption while maintaining control. - Achieving Fairness and Service Differentiation:
Olympiansuccessfully enables:- Fair Sharing: Providing nearly identical finish times for homogeneous workloads and fair distribution of
GPU durationfor diverse, heterogeneous workloads, whichTF-Servingalone cannot achieve (Figure 11, Figure 14, Figure 16). - Weighted Fair Sharing: Implementing policies where jobs receive
GPU accessproportional to assigned weights (Figure 17). - Priority Scheduling: Allowing higher-priority jobs to execute before lower-priority ones (Figure 18).
- Fair Sharing: Providing nearly identical finish times for homogeneous workloads and fair distribution of
- Portability and Compatibility: The implementation preserves the portability of
TF-Servingto differentGPUsand supports models originally written in other frameworks (e.g., Caffe converted to Tensorflow), demonstrated by experiments on different hardware (Figure 21). - Minimal Utilization Sacrifice: Despite introducing temporal multiplexing,
Olympianminimally sacrificesGPU utilization(around 6-8% reduction compared to defaultTF-Serving). - Scalability:
Olympianexhibits comparable scalability toTF-Servingin terms of the number of concurrentDNNsit can support, limited byGPU memoryin most cases, though for someDNNsit can be limited byoperating system thread poolsize.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
Deep Neural Networks (DNNs)
Deep neural networks (DNNs) are a class of machine learning algorithms inspired by the structure and function of the human brain. They consist of multiple layers of interconnected "neurons" that process data in a hierarchical manner. Each layer learns to recognize different features of the input, passing its output to the next layer. DNNs have achieved state-of-the-art results in various complex tasks, including image recognition, natural language processing, and speech recognition. Their computational structure involves numerous matrix multiplications and other mathematical operations, making them highly suitable for parallel processing.
Graphics Processing Units (GPUs)
Graphics Processing Units (GPUs) are specialized electronic circuits designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. However, their highly parallel architecture, consisting of thousands of cores (processing units), makes them exceptionally well-suited for a broad range of general-purpose computations, particularly those involving large-scale data parallelism. This characteristic makes GPUs ideal for accelerating DNN operations, significantly reducing the time required for both training (learning the model parameters) and inference (using the trained model to make predictions).
DNN Serving Systems
A DNN serving system is a piece of middleware software designed to deploy and manage DNN models for inference in production environments. It acts as an intermediary between client applications and the DNN models themselves. Key functions include:
- Model Loading: Loading and managing multiple
DNN modelsefficiently. - Request Handling: Receiving inference requests from clients.
- Batching: Grouping multiple incoming requests into larger
batchesto improveGPU utilization(as processing a batch of inputs together is often more efficient than processing them one by one). - Execution: Orchestrating the execution of
DNN modelson available hardware (e.g.,GPUsorCPUs). - Response Generation: Returning inference results to clients.
Tensorflow-Serving (TF-Serving)is a prominent open-source example, built specifically forTensorflowmodels.
Tensorflow
Tensorflow is an open-source software library for dataflow and differentiable programming across a range of tasks. It is primarily used for machine learning applications such as neural networks. In Tensorflow, computations are represented as data flow graphs, where:
- Nodes: Represent mathematical operations (e.g., addition, matrix multiplication, convolution).
- Edges: Represent the
tensors(multi-dimensional arrays) that flow between operations and also encode dependencies between operations.Tensorflow's runtime systemexecutes these graphs. Some nodes might run on aCPU, while others, particularlydata-paralleloperations, are offloaded to aGPU. ATensorflow nodethat runs on aGPUtypically invokes one or moreGPU kernels.
GPU Kernels
A GPU kernel is a program or function that runs on a GPU. It represents an elemental data-parallel computation. Programmers define a kernel computation in terms of threads, which are grouped into thread blocks. These thread blocks are then assigned to stream multiprocessors (SMs) on the GPU. Further, thread blocks are subdivided into warps (groups of threads) that are scheduled concurrently. The GPU driver is responsible for managing the execution of these kernels, including queuing them and scheduling them on the available GPU hardware.
Batching
Batching in DNN inference refers to the practice of grouping multiple individual input data samples (e.g., images) into a single, larger input batch before feeding them to the DNN model for processing.
- Purpose:
Batchingsignificantly improvesGPU utilizationbecauseGPUsare highly efficient at processing large amounts of data in parallel. Processing a batch of 100 images together can be much faster than processing 100 images individually due to reduced overhead fromkernel launches,data transfersbetweenCPUandGPU, and better exploitation ofGPU parallelism. - Mechanism in Tensorflow:
Tensorflowimplementsbatchingby adding specific nodes to theDNN graphthat decode inputs and arrange them into appropriate multi-dimensional matrices for subsequent processing.
Scheduling and Time-Slicing
Scheduling is the process of deciding which tasks get access to resources (e.g., CPU, GPU) and for how long. Time-slicing is a common scheduling technique where a shared resource is allocated to different tasks for short, fixed durations (called quanta or time slices) in a round-robin or similar fashion. After a quantum expires, the current task is preempted and another task gets access to the resource. This creates an illusion of parallel execution and ensures fairness or responsiveness. In the context of GPUs, time-slicing aims to allow multiple DNNs to share the GPU by rapidly switching between them.
GPU Preemption
GPU preemption is the ability of a GPU or its driver to interrupt an ongoing GPU computation (a kernel or a group of kernels), save its state, and switch to another computation. This is analogous to CPU process preemption in operating systems.
- Levels of Preemption:
- Kernel-level preemption: Interrupting execution only after a
GPU kernelcompletes. - Thread-block level preemption: Interrupting execution after a
thread blockcompletes. - Instruction-level preemption: Interrupting execution at almost any point, down to individual instructions.
- Kernel-level preemption: Interrupting execution only after a
- Cost: While desirable for responsiveness and fairness,
GPU preemptioncan be very expensive. Saving and restoring the state of amassively parallel GPU context(which includes manythreads,registers, andmemory states) incurs significantoverhead, often hundreds of microseconds. This high overhead limits the practical frequency of preemption and thus the granularity of scheduling.
3.2. Previous Works
The paper contextualizes its work by referencing prior research in GPU multiplexing and scheduling, identifying gaps that Olympian aims to fill.
-
GPU Multiplexing via Kernel Slicing:
- Several works (
[2, 4, 19, 23, 31, 33]) have explored novelkernel slicingapproaches to overcome the high cost of hardware preemption. These methods aim to break downGPU kernelsinto smaller, more manageable units that can be interleaved. - Limitation: These often incur significant
preemption overheads. Furthermore, they frequently requireprogrammersto explicitly specify slicing points, or demand rewriting/recompiling code against a new framework, reducing portability and ease of adoption for existingDNN models.Olympiandistinguishes itself by working withpre-trained modelsand requiring no changes to application code or frameworks.
- Several works (
-
Hardware Extensions for Multiprogrammed GPU Workloads:
- Prior work (
[27]) proposedhardware extensionsto facilitatemultiprogrammed GPU workloads, includingpreemption mechanismsforGPU scheduling policies. - Limitation: Relying on
hardware preemption, as discussed, typically results in highoverhead.Olympianavoids this by usingoffline profilingforresource usage metering, which does not depend on specialized hardware preemption features.
- Prior work (
-
Timegraph (
[15]):Timegraphis mentioned as a system that handlesschedulingmultiple tasks usingpriorityandreservation policiesto ensure responsiveness for high-priority tasks. It achieves this through modifications to theGPU device driver.- Limitation: Its reliance on
device drivermodifications implies that runningTimegraphon differentGPU deviceswould require additional implementation effort for each new device type.Olympianaims forGPU-independentoperation.
-
GPU Virtualization:
- A body of work (
[24, 30, 32]) has exploredfullandpara-virtualizationofGPUs. This involves creating virtualGPUsthat can be assigned toVirtual Machines (VMs), allowing multipleVMsto share a physicalGPU. - Limitation: This line of work often suffers from poor performance due to high
GPU context switching overheadsassociated withvirtualization.Virtualizationis generally considered too heavyweight formodel-serving systems.Olympianprovideslightweight time-slicingto achieve isolation objectives, which is more appropriate for amodel-serving context.
- A body of work (
-
GPU Call Interposition:
- Other research (
[7, 8, 10, 11, 17, 20, 21]) has exploredGPU call interpositionfor purposes likeremote executionorGPU multiplexinginvirtualized environments. Some of these also investigatescheduling GPU callsfrom differentVMs([20]) orco-schedulingaVMwith its associatedGPU([11]). - Limitation: While these approaches address
GPU sharing,Olympianspecifically focuses onmodel serving systems, uses theDNNas the fundamental unit of scheduling, and explicitly allocatesGPU timebased onoffline profilingofDNN-specific characteristics.
- Other research (
3.3. Technological Evolution
The evolution of GPU usage for DNNs has been rapid:
-
Early GPUs for Graphics: Initially designed for rendering graphics,
GPUspossessed inherent parallelism suitable for certain scientific computations. -
GPGPU (General-Purpose GPU) Computing: Researchers and developers began leveraging
GPUsfor general-purpose parallel computing, leading to frameworks likeCUDA(NVIDIA) andOpenCL. -
DNN Acceleration: The
DNNrevolution foundGPUsto be the perfect accelerators due to their ability to perform massivematrix multiplicationsand other parallel operations efficiently. This led to significant performance gains intrainingandinference. -
DNN Serving Systems: As
DNNsmoved from research labs to production, the need arose for efficient deployment and management.DNN serving systemslikeTF-Servingemerged to managemodel loading,batching, and concurrent inference requests. -
Challenges in Shared Environments: With
GPUsbecoming shared resources in cloud environments andserving systems, challenges likeresource contention,unpredictable latency, and the need forservice differentiationbecame prominent. ExistingGPU schedulingtechniques (e.g., low-level preemption, virtualization) often introduce high overheads or lackDNN-specific awareness.Olympianfits into this evolution by addressing the critical need for predictable and controllable GPU sharing within DNN serving systems. It moves beyond genericGPU schedulingby exploiting the unique characteristics ofDNNs(predictable structure, node-level granularity) at the middleware level, whereDNNcontext is available.
3.4. Differentiation Analysis
Olympian's core innovations and differences from related work stem from its choice of middleware-level scheduling and its reliance on offline profiling of DNN characteristics.
| Feature / Aspect | Related Work (e.g., Kernel Slicing, Hardware Preemption, Virtualization) | Olympian |
|---|---|---|
| Scheduling Level | Often at the GPU driver level, hardware level, or hypervisor level. |
Middleware level (TF-Serving): This is crucial because only TF-Serving understands the DNN graph and node abstractions. This high-level context allows for intelligent scheduling decisions that are impossible at lower levels. |
| Unit of Scheduling | GPU instructions, GPU kernels, thread blocks, or entire VMs. |
Tensorflow node: Olympian interleave DNNs at the boundary of Tensorflow nodes. This is a coarser granularity than GPU kernel or instruction preemption, but still fine-grained enough for responsiveness, and crucially, it's GPU-independent and aligns with DNN computation units (Figure 4 shows most nodes have very small durations). |
| Preemption/Switching | Relies on hardware preemption (expensive, context switch hundreds of s) or requires programmer annotations for slicing. |
Cooperative Co-scheduling: Olympian uses cooperative co-scheduling of CPU thread gangs. It suspends all CPU threads of one DNN job and resumes another. This avoids expensive hardware preemption and its associated context switching overheads at the GPU hardware level. While not "perfect" isolation (Figure 10), the overflow is minimal due to short node durations. |
| Overhead | Hardware preemption (high overhead), online profiling (21-29% overhead), virtualization (heavyweight). |
Low overhead (<2%): Achieved by offline profiling and cooperative co-scheduling. The overhead is incurred mainly on the CPU for thread suspension/resumption. |
| Resource Usage Metering | Typically not available at a fine-grained, DNN-aware level, or requires online instrumentation. |
Offline Profiling: Leverages the predictability of DNN execution to build offline profiles of GPU resource usage (costs and durations). This avoids the overhead of online profiling (Figure 6) and allows Olympian to accurately estimate when a quantum has elapsed. The cost accumulation rate () is a key innovation here. |
| Portability | Often tied to specific GPU hardware, drivers, or frameworks. |
High portability: Inherits framework and device independence from Tensorflow and TF-Serving. It abstracts GPU details and can run models from various frameworks (e.g., Caffe converted to Tensorflow). Demonstrated to work on different GPU platforms without modification (Figure 21). |
| Scheduling Objectives | Focus on responsiveness or general multiplexing. Lacks DNN-specific QoS. |
DNN-aware QoS: Enables fair sharing, weighted fair sharing, and priority scheduling for DNNs, providing predictable latency and service differentiation capabilities that are crucial for cloud-based serving systems. |
| Application Changes | May require programmer intervention, code modifications, or recompilation. |
No application changes: Works with pre-trained models and does not require modifications to the DNN models themselves or the client applications. |
| GPU Utilization | A primary goal, but sometimes at the cost of predictability or with high overheads. | Maintains high utilization: While introducing time-slicing, Olympian minimizes utilization sacrifice (6-8% reduction) while significantly improving predictability. |
| "Quantum" Definition | Often defined purely by elapsed wall-clock time (for CPUs) or kernel counts. |
GPU Duration (cost-based): Olympian defines a quantum not just by elapsed time, but by the total duration a DNN runs at least one Tensorflow node on the GPU (Figure 5). This GPU duration is estimated via cost accumulation rate, making the quantum meaningful in terms of GPU resource consumption. The paper shows that simple CPU timer-based time-slicing fails to provide isolation (Figure 19). |
4. Methodology
4.1. Principles
The core principle behind Olympian is to achieve predictable and differentiated GPU scheduling for concurrent Deep Neural Networks (DNNs) in a model serving system like Tensorflow-Serving (TF-Serving) by implementing middleware-level time-slicing. This strategy is built on two fundamental ideas:
-
Leveraging DNN Predictability with Offline Profiling:
DNNsexhibit predictableGPU resource usagefor a given model andbatch size.Olympianexploits this by performingoffline profilingto characterize theGPU resource consumptionofDNNs. This allows it to estimateGPU usagewithout incurringonline monitoring overheads. -
Cooperative Co-scheduling of CPU and GPU Resources: Instead of relying on expensive
hardware GPU preemption,Olympianimplementscooperative co-scheduling. It allocates exclusive access to theGPU(and implicitly the associatedCPU threads) to a singleDNN jobfor a short duration (aquantum). At the end of thisquantum, it cooperatively suspends the current job'sCPU threadsand resumes another job'sCPU threads, thereby switchingGPU access.By combining these principles,
Olympiangains fine-grained control overGPU allocationwhile maintaining lowoverheadand highportability.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. The TF-Serving Execution Model and its Drawbacks
TF-Serving is a model serving system that manages DNN inference tasks. Internally, when a client submits a job (an inference request for a specific DNN model with input data):
-
A separate
CPU threadfrom athread poolis assigned to eachjob. -
This thread traverses the
DNN's data flow graphin abreadth-first search (BFS)manner, starting from theroot node. -
Nodes are inserted into a
FIFO queue. -
The
session's threaddequeues anode, executes it, and then examines itschildren. -
If a
child nodeneeds to be executed asynchronously (typicallyGPU-intensivenodes), anotherCPU threadfrom thethread poolis fetched to manage its execution. If no threads are available, execution may be delayed. Otherchild nodesare re-inserted into thequeue.This process is summarized in Algorithm 1 from the paper:
In this model, a single DNN job is executed by a collection of CPU threads, which the authors refer to as a "gang." Each thread in this gang is generally responsible for executing a GPU node. This design allows TF-Serving to queue multiple GPU kernels concurrently. When multiple DNN instances run concurrently, each can queue its kernels to the GPU, leading to interleaving of execution by default.
The main drawback of TF-Serving's execution model:
The GPU driver, which ultimately schedules the GPU kernels, lacks knowledge about which kernels belong to which DNN job. It cannot differentiate between kernels from different DNNs invoked by different clients. This leads to:
-
Unpredictable Finish Times:
DNN jobswith identical resource requirements can finish at significantly different times (e.g., Figure 3 shows finish times varying by up to 1.7x for 10 concurrent Inception clients). This variability makes it hard to engineerlatency-sensitiveapplications. -
Lack of Control: Without
DNN-aware scheduling,TF-Servingcannot implementpriority schedulingorweighted sharing, preventingservice differentiation.The following are the results from Figure 3 of the original paper:
该图像是柱状图,展示了不同客户端(Client id)在两次运行(Run-1 和 Run-2)中完成任务所需时间(以秒为单位)。柱状图中,蓝色表示 Run-1,红色表示 Run-2,可以观察到不同客户端的完成时间变化。
4.2.2. Olympian's Design Choices and Rationale
To address TF-Serving's limitations, Olympian introduces a time-slicing approach at the middleware level, based on several key design choices:
4.2.2.1. The Unit of Interleaving
Olympian interleave DNNs at the granularity of a Tensorflow node.
- Rationale:
-
Low-level preemption (GPU instruction/kernel): Requires
hardware preemption, which is known to be expensive (hundreds of s for context switch). Whileinstruction-level preemptionexists in newerGPUs, itsoverheadis too high for fine-grained control. -
Tensorflow node vs. GPU kernel:
Tensorflow nodesoften invoke one or a small number ofGPU kernels, so the difference in granularity is minimal. -
Simplicity and Portability: Switching at the
Tensorflow nodelevel can be implemented in aGPU-independentmanner withinTF-Serving(after a node finishes execution in Algorithm 1, Line 9). -
Small Node Durations: Most
nodesin modernDNNshave very small execution times. As illustrated in Figure 4, for Inception, over 80% ofnodestake less than 20 s, and over 90% take less than 1 ms. This allows for fine-grained interleaving without significantly inflating thequantum. -
Data-unit level interleaving: Not considered because processing a single image (a data unit) can take tens of milliseconds, which is too coarse for responsive scheduling.
The following are the results from Figure 4 of the original paper:
该图像是一个示意图,展示了不同批量大小下 Inception 模型的节点持续时间的累积分布函数 (CDF)。蓝色和红色曲线分别表示批量大小为 10 和 100 时的节点持续时间,横轴为节点持续时间(微秒),纵轴为 CDF 值。
-
4.2.2.2. Defining the Quantum
Unlike traditional CPU scheduling where a quantum is simply elapsed wall-clock time, Olympian defines a quantum as the total duration of time that a DNN runs at least one Tensorflow node on the GPU.
- Rationale:
-
A fixed
CPU time quantumis insufficient becauseDNNsexecute some nodes on theCPUand others on theGPU. ADNN'sGPU usagecan vary widely within a fixedCPU time slice. -
Defining the
quantumbased onGPU duration(Figure 5) provides a more meaningful and controllable unit for allocatingGPU resources, allowingOlympianto allocateGPUsto eachDNNfor predictable durations of actualGPU usage. -
Other definitions, like actual
GPU resources consumed(e.g., number ofcores), are not exposed by currentGPUsor would require expensiveonline profiling.The following are the results from Figure 5 of the original paper:
该图像是示意图,展示了在调度中,各节点(Node 1 到 Node 4)之间的时间间隔关系,具体时间跨度由 , , 表示。该图有助于理解多任务调度时各节点的资源分配情况。
-
4.2.2.3. How to Estimate Quantum Expiration
To determine when a quantum expires based on GPU duration, Olympian relies on offline profiling rather than online instrumentation.
- Rationale for Offline Profiling:
-
Tensorflowprovides acost profiler(usingCUDA Profiling Tools Interface) that can trackDNN executiondetails. However, running thiscost profileronline incurs high overhead, inflating execution times by 21% to 29% (Figure 6). This is unacceptable for a productionserving system. -
DNN executionis largely predictable for a givenmodelandbatch size.Olympianexploits this by profilingGPU resource usageoffline (when theGPUis idle). This profile is computed once permodel(for a few commonbatch sizes, usinglinear regressionfor others) and then reused. This adds nooverheadtoinference tasks.The following are the results from Figure 6 of the original paper:
该图像是一个柱状图,展示了使用在线和不使用在线配置的深度神经网络模型(如Inception、GoogLeNet、AlexNet等)完成任务所需的时间(秒)。每个模型的完成时间以蓝色(不使用在线配置)和红色(使用在线配置)表示,显示出优化后的性能差异。
-
4.2.2.4. How to Switch Between Jobs
Olympian employs cooperative co-scheduling of CPU thread gangs.
- Mechanism: At the end of a
quantum,Olympiansuspends allCPU threadsbelonging to the currentDNN job's gangand resumes allCPU threadsof the nextjob's gang. - Rationale:
-
Each
TF-Serving jobcan involve a thousand or moreCPU threads. Suspending and resuming onlyGPU-related threadsis complex. -
This cooperative approach avoids
preempting CPU threads(and their correspondingGPU operations), a design choice rejected due to highoverhead. -
Implication (Partial Isolation): Not all threads of a
jobare suspended simultaneously. If aCPU threadis managing aGPU kernelthat has already been launched, thatkernelwill continue to execute on theGPUeven after thescheduling decisionandthread suspension. This meansOlympiancannot guarantee perfect exclusiveGPU accessat all times. However, since mostnodesexecute for very small durations (Figure 4), this "overflow" is minimal and its impact is managed by charging thecostto the original job (Figure 10).The following are the results from Figure 10 of the original paper:
该图像是一个示意图,展示了多线程调度的时间线及其任务执行情况。在时间 处,多个线程的运行状态被标记,显示了任务的切换和并发执行的特征。
-
4.2.3. Olympian Architecture
Olympian consists of two main components: a Profiler and a Scheduler.
The following are the results from Figure 7 of the original paper:
该图像是一个示意图,展示了TF-Serving系统的组成和工作流程,其中包括Profiler和Scheduler模块。客户端任务通过TF-Serving进行处理,调度策略由管理员配置,并通过Profiler获取新模型的数据。TF-Serving在操作系统和GPU之间进行调度以实现高效计算。
4.2.3.1. The Profiler
The Profiler is responsible for offline profiling DNNs to determine their GPU resource usage characteristics.
-
Tensorflow Cost Model API:
Tensorflowprovides anAPIto obtain acost modelfor adata flow graph. Thiscost modelincludes the time taken to execute eachnode(referred to asnode cost), which is an approximate measure ofGPU resourcesneeded. While the sum ofnode costsisn't directly the totalGPU execution time(due to concurrency), the individualnode costsand the totalGPU durationfor aDNNare stable across runs when it has exclusiveGPU access. -
Cost Accumulation Rate: This is a key concept for
Olympianto estimatequantum completion. Let be the sum of allnode costsinDNN j, obtained fromTensorflow's cost model API. Let be the totaldurationofDNN jon theGPU, defined as the total time that at least onenodebelonging to runs on theGPU(as illustrated in Figure 5). is measured by instrumentingTensorflow offline. The averagecost accumulation rateforDNN jis then .If a desired
quantumduration is , then ajobis considered to have completed onequantumwhen it has accumulated acostequal to acost accumulation threshold, . The formula for is: Where:- : The
cost accumulation thresholdforDNN j, indicating when onequantumforDNN jhas expired. - : The desired
quantum duration(in time units, e.g., microseconds). - : The sum of all
node costsinDNN j, as estimated byTensorflow's cost model API(in arbitrary cost units). - : The total
GPU durationforDNN j(in time units, e.g., microseconds), measured offline whenDNN jhas exclusive access to theGPU. The ratio effectively represents the "cost per unit of GPU time."
- : The
-
Overhead-Q Curves: To determine the optimal ,
Olympianprofiles theoverheadas a function of .- For a given
DNN j, theProfilerfirst computes and . - It then runs two instances of
DNN jon unmodifiedTF-Serving(case ) and two instances onOlympian(case ) for various values of . Overheadis calculated as the ratio of the difference in finish times between case (b) and case (a), divided by the finish time in case (a).- These experiments generate
Overhead-Q curves(Figure 8), showing howoverheaddecreases as increases.Overhead-Q curvesare precomputed for commonbatch sizesand can be estimated for others usinglinear regression.
- For a given
The following are the results from Figure 8 of the original paper:
该图像是图表,展示了不同深度神经网络模型(如Inception、GoogLeNet、AlexNet等)在不同请求数量Q下的开销(Overhead)变化趋势。随着Q的增加,所有模型的开销均呈下降趋势,表明模型在资源共享时的高效性。
-
Determining Q: The final step is to select an appropriate . This is driven by an
operator-specified overhead tolerance(e.g., 2.5%).Olympianfinds the value on theOverhead-Q curvethat corresponds to this tolerance. If multipleDNNsare involved, it takes the largest among them to ensure theoverheadnever exceeds the tolerance for anyDNN. Once is determined, theProfilercomputes for eachDNN. -
Intuition for Predictability: The approach relies on the observed stability of total
node cost() and totalGPU duration() for a givenDNNandbatch sizeacross different executions when theDNNhas exclusiveGPU access. This stability arises because individualnodeshave deterministic executions, and theDNN'sduration is determined by itsdata flow graphstructure.Olympianaims to ensure eachjobgets exclusiveGPU accessduring itsquantum, effectively spreading out its execution timeline (Figure 9).The following are the results from Figure 9 of the original paper:
该图像是示意图,展示了在时间轴T上,节点Node1到Node4在不同时刻t1和t2的调度情况。左侧图展示了并行进程,右侧图则显示了可能的切换点。具体调度理论可以用公式描述,比如 和 之间的时间分配。
4.2.3.2. The Scheduler
Olympian's scheduler is integrated into TF-Serving's processing loop with minimal modifications.
The modified scheduling logic is shown in Algorithm 2:
- Job Registration and Cost Tracking (Lines 2-6, 8, 15):
- Each
job's session(srInfo) is registered with thescheduler(Line 4). - A
cumulatedCostvariable (Line 5), shared by all threads in thejob's gang, tracks thecostconsumed by thejob. - When a
GPU nodecompletes execution (compute(curNode)in Line 13), theCPU threadmanaging it updatescumulatedCostby addingcurNode.cost()(Line 15).
- Each
- Quantum Expiration and Scheduling Decision (Lines 16-18):
- If
cumulatedCostexceeds thethreshold() (Line 16), it signifies that thejobhas completed itsquantum. cumulatedCostis decremented bythreshold(Line 17) to account for the just-completedquantum.- The
scheduler.updateTokenInfo(srInfo)function is called (Line 18) to trigger ascheduling decision. This function applies the activescheduling policy(e.g.,round-robin) to determine whichjobgetsexclusive GPU accessnext and updates an internal "token" accordingly.
- If
- Cooperative Yielding (Line 12):
- Before
compute(curNode)(Line 13),scheduler.yield(srInfo)is called (Line 12). - During this call, if the current
jobno longer holds the "token" (i.e., it doesn't haveexclusive GPU accessaccording to thescheduler's decision), theCPU threadwill suspend itself. - The
schedulerusescondition variables:CPU threadsof ajobwait on theircondition variableif theirjoblosesGPU access. When ajobis grantedGPU access, itscondition variableis notified, waking up itsCPU threads.
- Before
- Thread Management (Lines 19-24):
- Similar to
TF-Serving,child nodesare processed: if notasync(e.g.,CPU nodes), they are pushed tobfsQueue; ifasync(e.g.,GPU nodes), a newCPU threadis fetched toProcessthem.
- Similar to
- Job Deregistration (Line 7): When a
jobcompletes, itderegistersfrom thescheduler.
Cooperative Nature and "Overflow" (Revisited):
As noted in Figure 10 and discussed in the paper, Olympian's cooperative co-scheduling means that not all CPU threads of a job are suspended simultaneously. If a CPU thread has already launched a GPU kernel (compute(curNode) in Algorithm 2, Line 13) before scheduler.yield(srInfo) (Line 12) leads to its suspension, that GPU kernel will continue running on the GPU even after the scheduling decision has shifted GPU access to another job. This can lead to a slight "overflow" where kernels from the previous job briefly run concurrently with kernels from the new job. However, given that most Tensorflow nodes have very short durations, this overlap is minimal. Olympian accounts for this by charging any cost accumulated by these "overflow" nodes to the quantum of the original job.
4.2.4. Scheduling Policies
Olympian provides a framework for implementing various scheduling policies on top of its time-slicing mechanism. The paper mentions three implemented policies:
- Fair Sharing: Jobs are scheduled in a
round-robinfashion, with eachjobreceiving onequantumin its turn. - Weighted Fair Sharing: A variant of
fair sharingwherejobsreceive a number ofquantaproportional to an assigned integerweightduring their turn. - Priority Scheduling: At any given time, the
jobwith the highestpriorityreceives the nextquantum.
4.3. GPU Multiplexing Strategy
The paper notes that TF-Serving can efficiently multiplex GPU kernels both spatially (assigning different kernels' thread blocks to different stream multiprocessors) and temporally (switching between kernels within the same stream multiprocessor). This often leads to high GPU utilization. However, for DNNs with batching, which perform pixel-level data parallel operations on large batches of images, the total number of pixels can exceed the GPU's parallelism. This leaves little room for spatial multiplexing between different client requests, as exemplified by two concurrent Inception jobs taking twice as long as one, indicating lack of spatial multiplexing.
For this reason, Olympian consciously chooses to focus only on temporal multiplexing. It aims to give one DNN exclusive temporal access to the GPU for a quantum, rather than trying to spatially interleave kernels from different DNNs, which is less effective for large batch sizes and hinders predictability.
5. Experimental Setup
The evaluation of Olympian was conducted using an implementation built on TF-Serving (version 1.2). The Olympian codebase, encompassing the profiler, scheduler, and scheduling policies, consists of 5234 lines of code.
5.1. Datasets
The paper uses several well-known Deep Neural Network (DNN) models for image classification. These models serve as the "datasets" in the sense that their computational graphs and resource demands are the subject of the scheduling experiments.
The DNN models used are:
-
Inception-v4 (
[25], abbreviated as Inception) -
ResNet-152 (
[13], abbreviated as ResNet) -
GoogLeNet (
[26]) -
AlexNet (
[16]) -
VGG (
[22]) -
ResNet-50 (
[13]) -
ResNet-101 (
[13])These models vary in terms of their complexity, number of
nodes, andGPU usagecharacteristics, making them suitable for evaluatingOlympianunder diverse workloads. For instance, Table 2 provides details on some of these models with varyingbatch sizes:
The following are the results from Table 2 of the original paper:
| Model | Batch Size | Nodes | GPU Nodes | Runtime (s) |
|---|---|---|---|---|
| Inception | 150 | 15599 | 13309 | 0.81 |
| GoogLeNet | 200 | 18980 | 15948 | 1.09 |
| AlexNet | 256 | 23774 | 19902 | 1.13 |
| VGG | 120 | 11297 | 9965 | 0.83 |
| ResNet-50 | 144 | 14472 | 12280 | 0.79 |
| ResNet-101 | 128 | 14034 | 12082 | 0.85 |
| ResNet-152 | 100 | 12495 | 10963 | 0.80 |
- Why these datasets (models) were chosen: They represent a range of large, computationally intensive
DNNscommonly used for image classification, allowing for a robust evaluation ofOlympian'sability to manage diverseGPU workloads. Many of these models (all except Inception) were originally developed on Caffe and converted to Tensorflow, demonstratingOlympian'sportability across model origins. - Data samples: The models are typically used for tasks like image classification, meaning input data would be images. For instance, an Inception model would take an image as input and output a classification label. No specific example image is provided in the text.
5.2. Evaluation Metrics
The paper uses several metrics to evaluate Olympian's performance:
5.2.1. Overhead
- Conceptual Definition:
Overheadmeasures the additional time or resource consumption introduced byOlympian's scheduling mechanismcompared to the baselineTF-Serving. A loweroverheadindicates a more efficient scheduling system. - Mathematical Formula:
The
overheadis defined as the ratio of the difference between the finish times of two jobs run onOlympianand onTF-Serving, divided by the finish time onTF-Serving. - Symbol Explanation:
- : The measured overhead, expressed as a fraction or percentage.
- : The time taken for jobs to complete when scheduled by
Olympian. - : The time taken for jobs to complete when scheduled by the default
TF-Serving. The finish times are typically measured from when a client sends a request until it receives a response.
5.2.2. GPU Duration (per quantum)
- Conceptual Definition: This metric quantifies the actual amount of time a
DNN jobspends actively utilizing theGPUduring a singlescheduling quantum. It is a direct measure of how effectivelyOlympiancan enforce itsquantumdefinition based onGPU usage. - Mathematical Formula: The paper defines
GPU durationas the total time that at least oneTensorflow nodebelonging to aDNNruns on theGPU(as illustrated in Figure 5). This is measured usingoffline instrumentationofTensorflow. For a singlequantum, it is the duration ofGPU activitybetween twoscheduling decisionsfor thatjob. The averageGPU duration per quantumis obtained by averaging these individualquantum durationsover ajob'sexecution. where is theGPU durationofnode kwithin aquantum, and is the number ofGPU nodesexecuted in thatquantum. The average is then taken over manyquanta. - Symbol Explanation:
- : The
GPU durationfor a singlequantum. - : The
GPU durationof an individualTensorflow node k. - : The number of
GPU nodesexecuted within a specificquantum. Theaverage GPU duration per quantumis crucial for validating if the actualquantaachieved byOlympianmatch the predicted fromoffline profiling.
- : The
5.2.3. Job Finish Times
- Conceptual Definition: This metric measures the total elapsed time from when a client sends an inference request until it receives the complete response. It directly reflects the
latencyexperienced by the user. Predictability and fairness injob finish timesare primary goals ofOlympian. - Measurement: Measured as wall-clock time from request submission to response reception.
5.2.4. GPU Utilization
- Conceptual Definition:
GPU utilizationmeasures the percentage of time theGPU's computing unitsare actively performing computations. Higher utilization generally indicates more efficient use ofGPU resources.Olympianaims to maintain highutilizationwhile introducingtime-slicing. - Measurement: Measured using the
NVIDIA System Management Interface (NVIDIA SMI), which periodically samplesGPU utilization. The averageutilizationis computed over the duration of the experiment.
5.2.5. Scalability
- Conceptual Definition:
Scalabilityrefers to the maximum number of concurrentDNN clientsorjobsthat theserving systemcan support before encountering resource limitations (e.g.,memory,CPU threads,GPU capacity). - Measurement: Determined by observing the maximum number of concurrent clients that can be supported by
Olympiancompared toTF-Serving.
5.3. Baselines
The primary baseline model used for comparison is the default, unmodified Tensorflow-Serving (TF-Serving) (version 1.2).
- Why it's representative:
TF-Servingis a widely adopted and representativeDNN model serving system. ComparingOlympianagainstTF-Servingdirectly highlights the improvementsOlympianbrings in terms of predictability, fairness, and differentiated services, which are inherently lacking inTF-Serving'sdefaultGPU schedulingbehavior.TF-Servingrelies on the underlyingGPU driverforkernel scheduling, which, as the paper argues, is unaware ofDNN-specific job boundaries.
5.4. Hardware and Workload Configuration
- Primary Hardware:
- CPU: Intel i7-8700 CPU
- RAM: 32 GB
- Operating System: Ubuntu 16.04 on Linux 4.4.0-112-generic
- GPU: GeForce GTX 1080 Ti
- Secondary Hardware (for portability validation):
- CPU: Intel Xeon E5-2603 v4 CPU
- RAM: 16 GB
- Operating System: Ubuntu 14.04 on Linux 4.4.0-72-generic
- GPU: NVIDIA Titan X
- Default Workload Configuration (unless specified):
- Clients: Multiple concurrent clients.
- Input Batches: Each client submits 10
batchessequentially. - Batch Size: 100.
- DNN Models: Inception-v4 or ResNet-152. This configuration is chosen to stress concurrent execution and simplify result interpretation under a uniform client load.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Demonstrating GPU Isolation
6.1.1.1. Homogeneous Workload
Olympian aims to provide strong GPU isolation, ensuring predictable execution times. When 10 concurrent Inception clients (with identical resource demands) are run:
-
Olympianwith Fair Scheduling: Achieves nearly identical finish times for all jobs, completing between 48 and 50 seconds. This demonstrates its ability to enforce fairness and predictability. -
Default
TF-Serving: Shows significant variability in finish times, ranging from 42 to 50 seconds, highlighting its lack ofGPU isolationand predictability.The following are the results from Figure 11 of the original paper:
该图像是对比图表,展示了在不同客户端 ID 下,TF-Serving 和 Olympian 公平共享的完成时间(单位:秒)。从图中可以看出,两个系统在大多数情况下的完成时间相似,表明 Olympian 方案在公平性方面表现良好。
The granular details of Olympian's operation are revealed by examining the duration of successive scheduling intervals (quanta). As shown in Figure 12:
-
The
average scheduling interval durationis 1.8 ms. This demonstratesOlympian'scapability for fine-timescale interleaving. -
Individual
scheduling intervalsvary widely in duration. This is expected becausequantum completionis based oncost accumulation(ajobfinishes aquantumwhen it accumulates a certaincost), andDNNsdo not accumulatecostat a perfectly even rate throughout their execution. At different stages, aDNNmight execute more or fewerGPU nodesconcurrently. -
The chosen
overheadin theprofiling phasefor this experiment was 2.5%, aligning with the observed low overhead.Olympian's fair schedulereffectively provides fairGPU accesswhen averaged over longer timescales (tens or hundreds of intervals).The following are the results from Figure 12 of the original paper:
该图像是一个示意图,展示了不同间隔之间的持续时间分布。横轴为持续时间ID,纵轴为间隔之间的持续时间(秒),可以看出大部分持续时间集中在0.01秒到0.06秒之间,分布呈现出一定的波动性。
6.1.1.2. Heterogeneous Workload
To evaluate Olympian's performance with diverse workloads, an experiment was conducted with 10 concurrent clients: 5 Inception jobs and 5 ResNet jobs.
- Finish Times (Figure 13): Inception jobs finish at similar times, and ResNet jobs finish at similar times, but the finish times between the two model types are slightly different. This difference arises because models like Inception and ResNet have different total execution times even for the same
batch size. - Equilizing Workloads: The authors attempted to equalize total execution times by choosing an Inception
batch sizeof 150 (estimated to have similar total run time as ResNet-100). Even then, finish times were not identical. This is becauseOlympianfair shares theGPU, not theCPU.DNNscan have varying proportions ofCPUandGPUwork. - Verification of Fair GPU Sharing: To confirm
Olympian'sintended function, the averageGPU duration(total time a job's nodes run onGPU) perquantumwas measured while all jobs were active (Figure 14).-
Each job received a nearly identical share of
GPU time. The averagequantumduration ranged from 1084 s to 1257 s, with a standard deviation of 4.9% to 10.1%. -
This closely matches the value (1190 s) predicted by the
offline profilerusing theOverhead-Q curves(Figure 8) for these models. This validates the accuracy ofOlympian's profilingand its ability to enforce predictedquanta.The following are the results from Figure 13 of the original paper:
该图像是一个条形图,展示了不同客户端在使用 Inception 和 ResNet 模型的任务完成时间。横轴为客户端 ID,纵轴为完成时间(秒)。蓝色条形表示 Inception-batch-100 和 ResNet152-batch-100 的完成时间,红色条形表示 Inception-batch-150 和 ResNet152-batch-100 的完成时间。
-
The following are the results from Figure 14 of the original paper:
该图像是一个柱状图,展示了不同客户 ID 在 Olympian 系统中平均 GPU 持续时间(微秒)。图中显示,所有客户的平均 GPU 持续时间大致相同,范围在 1000 到 1400 微秒之间,误差条表示了数据的波动性。
-
Impact of "Overflow" (Figure 15): The variance in
actual quanta receivedcan be attributed toOlympian'scooperative co-scheduling. WhenOlympianswitches to a new job, someGPU nodesfrom the old job might still be executing. Thesenodes(typically 2-3 per context switch) interfere with the new job'snodes. Theircostis still charged to the old job'scumulated cost. This explains why some jobs might have a slightly higheraverage quantum(if the overflowedcostis significant) or lowerquantum(if the nextquantumis reduced because of accumulated overflowcostfrom a previous turn).The following are the results from Figure 15 of the original paper:
该图像是示意图,展示了在不同节点上调度任务的时间关系。上部分表示节点间的任务执行时间 t1 和 t2,下部分显示了引入的新节点 Node_x 和其任务执行时间 t3 的调度关系。通过对比,可以分析调度算法在 GPU 资源使用中的效率。 -
Observed Overhead: For this heterogeneous workload, the actual observed
overheadwas 2.4%, very close to the targetoverheadof 2.5% used for determination. This further confirms theprofiler'saccuracy.
6.1.1.3. Complex Workload
A more complex scenario involved 14 clients running 7 different DNNs (Inception, GoogLeNet, AlexNet, VGG, ResNet-50/101/152), each with a different batch size. As shown in Table 2, these DNNs vary significantly in terms of GPU/CPU node counts and runtime.
-
Average GPU Duration (Figure 16): Even under this diverse workload, all
DNNsachieved comparableaverage GPU durations. The value chosen by theprofilerwas 1620 s. The observedaverage GPU durationsfor clients ranged from 1438 s to 1662 s, with standard deviations between 4.1% and 12.0%. This demonstratesOlympian'seffectiveness in fairGPU sharingacross highly varied tasks. -
Overhead Match: The predicted
overheadfor the chosen was 2%, and the actual observedoverheadwas 1.8%, again confirming theprofiler'spredictive power.The following are the results from Figure 16 of the original paper:
该图像是一个柱状图,展示了使用 Olympian 系统的多个深度神经网络模型的平均 GPU 持续时间(微秒)。每个柱子代表不同模型的 GPU 使用时长,结果显示大部分模型的 GPU 持续时间在 1500 微秒左右,误差条表示标准差。
6.1.2. Other Scheduling Policies
Olympian's flexibility allows for implementing various scheduling policies:
6.1.2.1. Weighted Fair Sharing
- Experiment: 10 Inception clients were run. The first 5 clients were assigned a weight of 2, and the remaining 5 clients a weight of 1.
- Results (Figure 17): The first set of clients (weight 2) finished between 36 and 38 seconds, while the second set (weight 1) finished around 50 seconds.
- Validation: This outcome aligns with theoretical expectations. For a system where jobs have weights of or 1, the ratio of finish times is expected to be .
- For (weights 2 and 1), the ratio is . The observed finish time ratio (approx. 37s / 50s = 0.74) closely matches this.
- For (if implemented with a weight ratio of 10:1), a similar close match to the expected 55% ratio was observed.
This confirms
Olympian'sability to provideweighted fair sharing, a capability not available in defaultTF-Serving.
The following are the results from Figure 17 of the original paper:
该图像是一个柱状图,展示了不同客户端在TF-Serving和Olympian系统下的完成时间(秒)。柱状图中包括三组数据,分别为TF-Serving、Olympian-2:1加权公平和Olympian-10:1加权公平,X轴表示客户端ID,Y轴表示完成时间。可以观察到,Olympian系统在各客户端上表现出比TF-Serving更好的时间公平性。
6.1.2.2. Priority Scheduling
- Experiment 1 (Strict Priority): 10 Inception clients with strictly decreasing priorities (client 0 highest, client 9 lowest).
- Results (Figure 18, 10-level priority): Clients are effectively serialized, with higher priority jobs completing before lower priority ones can begin significant execution.
- Experiment 2 (Two-Level Priority): The first 5 clients have the same higher priority, and the next 5 have a lower priority.
- Results (Figure 18, 2-level priority): The first five clients fair share the
GPUamong themselves and finish at approximately 25 seconds. After they complete, the lower-priority clients are then assigned theGPUand finish at roughly the same time. This demonstratesOlympian'scapability to implementpriority scheduling, a crucial feature forservice differentiation.
The following are the results from Figure 18 of the original paper:
该图像是图表,展示了不同客户端在使用 TF-Serving 和 Olympian 调度系统时的完成时间(秒)。图表中显示了三种调度方式:TF-Serving(空白条),Olympian-10-level-priority(红色条)和 Olympian-2-level-priority(蓝色条)。
6.1.3. Utilization and Scaling
6.1.3.1. Utilization
- Comparison:
Olympiangenerally has slightly lowerGPU utilizationcompared to defaultTF-Serving. - Measurements: For 10 Inception clients:
- Default
TF-Serving: 84.74% Olympian(Fair Sharing): 78.62%Olympian(Weighted Fair Sharing): 78.10%Olympian(Priority Scheduling): 76.35%
- Default
- Loss:
Olympianreducesutilizationby 6-8%. This is becauseOlympianprovidesexclusive GPU accessto a singlejobduring aquantum, potentially losing opportunities forspatial multiplexing(which is already limited for largeDNN batches) or when ajobbriefly idles theGPU. - Priority Scheduling's Lower Utilization:
Priority schedulingshows the lowestutilizationbecause, with strict priorities, jobs are effectively serialized, minimizing the brief "overlapped job execution" (Figure 10) that can occur in fair or weighted fair sharing, which slightly boostsutilization.
6.1.3.2. Scalability
- Memory Limit: Both
TF-ServingandOlympianare limited by theGPU memoryof the GeForce GTX 1080 Ti, supporting about 45 concurrent clients. - Olympian's Thread Pool Limit: For some
DNNs(e.g., Inception),Olympian's scalabilitycan be limited byoperating system thread poollimits. It supports 40-60 concurrent Inception clients, whereasTF-Servingcan support 100.- Reason: In
Olympian,CPU thread gangsare alternately suspended and resumed, meaningthreadsare allocated for longer durations. InTF-Serving,threadsare alwaysrunnableand returned to thethread poolas soon as they finish their work, allowing it to serve more concurrent clients with a smallerthread pool.Olympianreaches thethread pool limitfaster.
- Reason: In
6.1.4. Other Validations
6.1.4.1. The Importance of Profiling
-
Experiment:
Olympian'sprofiler-based time-slicingwas replaced with a simpleCPU timer(quantumexpires after fixed wall-clock time). -
Results (Figure 19):
- Homogeneous Workload:
CPU timer-based schedulingresulted in unequal finish times for Inception jobs, similar to defaultTF-Serving'sunpredictability. - Heterogeneous Workload: It led to widely varying
GPU durationsfor Inception and ResNet jobs.
- Homogeneous Workload:
-
Conclusion: This strongly validates
Olympian'schoice ofcareful GPU profilingandtime-slicingbased onGPU cost accumulation rate, as opposed to simplisticCPU timer-based approaches.The following are the results from Figure 19 of the original paper:
该图像是柱状图,左侧展示了不同客户端(Client id 0-9)的完成时间(Finish time),每个客户端的完成时间相近,均接近50秒;右侧展示了平均GPU持续时间(Average GPU duration),客户端0的时间最高为1872微秒,而其他客户端的时间明显较少,数据显示调度效果。
6.1.4.2. Portability
-
Experiment: The
fair sharingexperiment (10 Inception clients) was re-run on a different hardware platform (Intel Xeon E5-2603 v4 CPU, 16 GB RAM, NVIDIA Titan X GPU). -
Results (Figure 21):
Olympianachieved similar performance to the originalfair sharingexperiment. The total finish times were different due to the distinct hardware, but the fairness characteristic was preserved. -
Conclusion:
Olympiandemonstrated highportability, requiring no modifications to run on a differentGPU platform. This is attributed to inheritingframeworkanddevice independencefromTensorflowandTF-Serving.The following are the results from Figure 21 of the original paper:
该图像是一个条形图,显示了在不同硬件平台上运行的 Olympian 的完成时间。横坐标为客户端 ID,纵坐标为完成时间(秒),所有条形高度相近,表明在多个客户请求下,系统实现了稳定的延迟表现。
6.1.4.3. Cost and Duration Stability
Olympian's profiler relies on the stability of DNN costs and GPU durations.
- Validation: An Inception model (batch size 100) was run 100 times on
TF-Servingto measure stability. - Results:
- Total Cost: Mean = 4,058,477 ns, Standard Deviation = 100,536 ns.
- GPU Duration: Mean = 262,773 ns, Standard Deviation = 4,462 ns.
- Conclusion: The very low standard deviations relative to the means confirm that both the
total costandGPU durationare highly stable across multiple executions, validating a core assumption ofOlympian's offline profiling.
6.1.4.4. Linear Models for Node Costs
-
Strategy: Instead of profiling every
batch size,Olympianprofiles a few commonbatch sizesand useslinear regressionto estimatenode costsfor otherbatch sizes. -
Experiment: 10 Inception clients were run with 3 different
batch sizes(25, 75, 150). For each,node costswere derived from alinear modeltrained on profiles of twobatch sizes(50 and 100). -
Results (Figure 20): In all cases,
Olympianprovidedfairness in completion timescomparable to that seen in Figure 11 (homogeneous workload with direct profiling). -
Conclusion: This demonstrates that
linear modelsare effective for generalizingnode costsacrossbatch sizes, further reducing theoffline profilingeffort without sacrificing performance.The following are the results from Figure 20 of the original paper:
该图像是一个柱状图,展示了三个不同批次(Batch-005、Batch-025 和 Batch-150)在不同客户端(Client id)上的完成时间(Finish time),单位为秒。每个批次的完成时间分布表现出一定的规律性。
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| §4.1 | Even under a diverse workload containing 7 DNNswith different batch sizes, Olympian can achieve the predicted Q while keeping overheads under 2% |
|---|---|
| §4.2 | Olympian can support priority scheduling and weighted fair sharing |
| §4.3 | Olympian sacrifices GPU utilization by less than 8% and scales similar to TF-Serving for some DNNs |
| §4.4 | Olympian is portable to other GPUs, its cost estimates are stable across runs, and its offline profiler is crucial. |
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces Olympian, a novel system designed to improve GPU scheduling in Deep Neural Network (DNN) model serving systems like Tensorflow-Serving (TF-Serving). The core innovation lies in implementing middleware-level time-slicing, where DNN jobs are given exclusive access to the GPU for short, carefully determined quanta. Olympian achieves this by leveraging the predictable nature of DNN computations through offline profiling to estimate GPU resource usage (costs and durations) and determine quantum sizes. It employs cooperative co-scheduling of CPU thread gangs to switch between DNNs with minimal overhead.
The evaluation demonstrates Olympian's effectiveness:
- It significantly improves the
predictabilityandfairnessofDNN job executioncompared to defaultTF-Serving, even for heterogeneous and complex workloads. - It can interleave
DNNsatmillisecond timescales(around 1.8 ms averagequantum) while incurring very lowoverhead(typically less than 2%). - It successfully supports advanced
scheduling policiessuch asfair sharing,weighted fair sharing, andpriority scheduling, enablingservice differentiationcrucial for cloud offerings. Olympianmaintains highGPU utilization(sacrificing less than 8%) and is portable across differentGPU hardware platformsandDNN frameworkswithout modification. Its reliance onoffline profilingand the stability ofDNN costsanddurationsare key to its performance.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
- Number of Models and GPUs: Future work can expand
Olympianto serve moreDNN modelsand supportmultiple GPUswithin a single server. The current work focuses on schedulingDNN jobson a singleGPU. - Realistic Workloads: The current evaluation primarily uses homogeneous or controlled heterogeneous workloads. Future work should explore
Olympian'sperformance under more realistic and dynamic workloads (e.g., varying arrival rates, differing job characteristics over time). - Scheduling Policies: The set of supported
scheduling policiescould be expanded to include more sophisticated or domain-specific policies. - Exploiting DNN Dependencies: Investigating how to exploit
dependenciesbetweenDNNs(if they exist) could potentially improveGPU utilizationfurther. - Power Measurement: The paper notes that
power usageis an important metric that was not evaluated, suggesting it as future work.
7.3. Personal Insights & Critique
Olympian presents an elegant and practical solution to a critical problem in DNN model serving: the lack of predictability and control over GPU resource allocation.
- Innovation: The core innovation of performing
DNN-aware time-slicingat themiddleware layerby leveragingoffline profilingis particularly insightful. It cleverly sidesteps the highoverheadofhardware preemptionand the lack ofDNN contextat lower levels. This "sweet spot" in the software stack allows for both fine-grained control and high portability. The definition of aquantumbased onGPU duration(derived fromcost accumulation rate) is also a smart design choice, making the scheduling trulyGPU-centricrather than justtime-centric. - Applicability: The methods and conclusions are highly transferable to any
model serving systembuilt ondata flow graphframeworks whereDNN executionis predictable. This could extend beyondTensorflowto other frameworks if similarcost modelsorprofiling APIsare available. The concept ofoffline profilingforresource modelingcould also be applied to other types of predictable,accelerator-boundworkloads. - Potential Issues/Areas for Improvement:
-
Predictability of DNNs: While the paper validates the stability of
DNN costsfor the explored models, this assumption might not hold universally for all futureDNN architecturesor highly dynamic models (e.g., those with adaptive computation or sparse activation patterns). Continuous monitoring or adaptive re-profiling might be needed in such cases. -
"Overflow" Implications: Although deemed minimal, the
cooperative co-scheduling's"overflow" effect (Figure 10) meansOlympiandoesn't provide perfect hardGPU isolation. For extremelylatency-sensitiveorsecurity-criticalapplications requiring absoluteresource guarantees, this slight interference might be a concern. A more detailed analysis of the worst-caseoverflowand its impact could be beneficial. -
Thread Pool Limit: The observation that
Olympianhitsoperating system thread poollimits faster thanTF-Servingfor someDNNsis an interesting scalability bottleneck. Future work could explore optimizations tothread managementwithinOlympian(e.g., more aggressivethread pooling,event-drivenmodels) to mitigate this. -
Dynamic Workloads and Hot-swapping: The
offline profilingapproach is efficient for static or slowly changing workloads. For scenarios with frequentmodel updates,A/B testing, orcold startsof newDNNs, theprofilingprocess needs to be integrated efficiently (e.g., in aCI/CD pipeline) to avoid introducing delays. -
Multi-GPU Scheduling: The paper explicitly defers
multi-GPU schedulingto future work. ExtendingOlympianto orchestrateDNNsacross multipleGPUswhile preserving its core principles offairnessandpredictabilitywould be a significant next step. This would involve complex decisions aboutjob placementandinter-GPU communication. -
Power Consumption: As noted by the authors, investigating the
power overheadwould provide a more complete picture ofOlympian'sefficiency, especially in energy-conscious environments.Overall,
Olympianrepresents a robust and well-thought-out contribution to the field, offering a practical path to managingGPU resourceseffectively and predictably in the context ofDNN serving.
-
Similar papers
Recommended via semantic vector search.