Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU
TL;DR Summary
Miriam is a task coordination framework for multi-DNN inference on edge GPUs, integrating an elastic-kernel generator and runtime coordinator. It boosts system throughput by 92% with less than 10% latency for critical tasks.
Abstract
Many applications such as autonomous driving and augmented reality, require the concurrent running of multiple deep neural networks (DNN) that poses different levels of real-time performance requirements. However, coordinating multiple DNN tasks with varying levels of criticality on edge GPUs remains an area of limited study. Unlike server-level GPUs, edge GPUs are resource-limited and lack hardware-level resource management mechanisms for avoiding resource contention. Therefore, we propose Miriam, a contention-aware task coordination framework for multi-DNN inference on edge GPU. Miriam consolidates two main components, an elastic-kernel generator, and a runtime dynamic kernel coordinator, to support mixed critical DNN inference. To evaluate Miriam, we build a new DNN inference benchmark based on CUDA with diverse representative DNN workloads. Experiments on two edge GPU platforms show that Miriam can increase system throughput by 92% while only incurring less than 10% latency overhead for critical tasks, compared to state of art baselines.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU".
1.2. Authors
The authors are:
-
Zhihe Zhao
-
Neiwen Ling
-
Nan Guan
-
Guoliang Xing
Their affiliations are:
-
The Chinese University of Hong Kong (Zhihe Zhao, Neiwen Ling, Guoliang Xing)
-
City University of Hong Kong (Nan Guan)
1.3. Journal/Conference
The paper was published on arXiv, a preprint server. While not a peer-reviewed journal or conference proceeding itself, arXiv is widely used in academic communities for rapid dissemination of research findings before formal publication. Its reputation is high for sharing cutting-edge research, but the work hasn't undergone formal peer review at the time of this publication.
1.4. Publication Year
The paper was published at 2023-07-10T04:30:44.000Z, which corresponds to July 10, 2023.
1.5. Abstract
This paper addresses the challenge of coordinating multiple deep neural network (DNN) inference tasks with varying real-time requirements on resource-limited edge GPUs. Unlike server-level GPUs, edge GPUs lack hardware-level resource management mechanisms, leading to significant resource contention when running multiple DNNs concurrently. To solve this, the authors propose Miriam, a contention-aware task coordination framework. Miriam consists of two main components: an elastic-kernel generator and a runtime dynamic kernel coordinator, designed to support mixed-critical DNN inference. The framework enables fine-grained control over GPU resources by transforming traditional kernels into elastic kernels that can be dynamically scheduled. To evaluate Miriam, the authors created a new CUDA-based DNN inference benchmark with diverse workloads. Experiments on two edge GPU platforms demonstrate that Miriam can significantly increase system throughput by 92% while maintaining low latency overhead (less than 10%) for critical tasks compared to state-of-the-art baselines.
1.6. Original Source Link
-
Official Source Link:
https://arxiv.org/abs/2307.04339 -
PDF Link:
https://arxiv.org/pdf/2307.04339v1.pdfThis paper is currently a preprint on
arXiv.
2. Executive Summary
2.1. Background & Motivation
Deep learning (DL) applications, such as autonomous driving and augmented reality, are increasingly deployed on edge devices. These applications often require the concurrent execution of multiple deep neural network (DNN) tasks, each potentially having different real-time performance requirements (i.e., some tasks are critical and must meet strict deadlines, while others are normal and can run in a best-effort manner).
The core problem the paper aims to solve is the efficient coordination of these multi-DNN inference tasks on edge GPUs. This problem is important because:
-
Resource Limitation: Edge GPUs are inherently resource-constrained compared to server-level GPUs.
-
Lack of Hardware Management: Unlike server GPUs (which might have
NVIDIA Multi-Process Service (MPS)orMulti-Instance GPU (MIG)), edge GPUs typically lack sophisticated hardware-level resource management mechanisms to prevent or mitigateresource contentionamong concurrently running tasks. -
Conflicting Objectives: There's a fundamental conflict between prioritizing
latency-criticaltasks (to guarantee their real-time requirements) and maximizingoverall throughput(by concurrently executing all tasks in a best-effort manner). Dedicating the entire GPU to critical tasks leads to low GPU utilization, while running everything concurrently without coordination can cause unacceptable latency for critical tasks.Prior research often struggles with these constraints, either requiring unavailable hardware support, modifications to closed-source GPU drivers, or incurring high overheads for offline profiling. The paper's innovative idea, or entry point, is to leverage an
elastic kernelabstraction that allows for more fine-grained resource mapping on the GPU, enabling dynamic scheduling and remapping of resources based on task priority and criticality. This aims to maximize GPU utilization without causing significant resource contention for critical tasks.
2.2. Main Contributions / Findings
The paper's primary contributions are:
MiriamFramework: ProposingMiriam, a contention-aware task coordination framework specifically designed formulti-DNN inferenceonedge GPUs. This framework addresses bothintra-SM (Streaming Multiprocessor)andinter-SMresource contention.Elastic-Kernel Generator: Introducing anelastic-kernel generatorthat transforms original GPU kernels intoelastic kernelswith adjustablegrid sizeandblock size. This allows for fine-grained control over GPU resource usage. Key components include anelastic grid/block generatorand asource-to-source kernel transformerthat ensures computational consistency during transformation.Runtime Dynamic Kernel Coordinator: Developing aruntime dynamic kernel coordinatorthat dynamically scheduleselastic kernelsandcritical kernels. It employs a greedy scheduling policy and adynamic-sized shaded binary treestructure forelastic kernel shardsformation to optimize resource utilization and minimize interference.MDTBBenchmark: Building a newDNN inference benchmark (MDTB)based on CUDA, featuring diverse representativeDNN workloadsand different task priority settings for edge GPUs.- Experimental Validation: Demonstrating through experiments on two edge GPU platforms (NVIDIA RTX 2060 and NVIDIA Jetson AGX Xavier) that
Miriamcan:-
Increase system throughput by up to 92% compared to state-of-the-art baselines.
-
Incur less than 10% latency overhead for critical tasks, effectively balancing the conflicting objectives of high throughput and low critical-task latency.
-
Achieve higher
achieved occupancy(a measure of GPU utilization) compared to baselines.These findings solve the problem of efficiently coordinating mixed-critical
multi-DNN inferenceon resource-limitededge GPUsby enabling adaptive resource sharing and contention mitigation without requiring specialized hardware support or compromising critical task performance.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand Miriam, a novice reader should be familiar with the following core concepts:
- Deep Neural Network (DNN) Inference: This is the process of using a pre-trained
DNNmodel to make predictions or classify new, unseen data. In the context of edge devices, this typically involves feeding sensor data (e.g., images from a camera, lidar data) through theDNNto get an output (e.g., object detection, pose estimation). - Edge GPU (Graphics Processing Unit): A specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device.
Edge GPUsare specifically designed for deployment on edge devices (e.g., autonomous vehicles, drones, IoT devices) which are characterized by limited computational resources, power budgets, and often lack the advanced features (like hardware virtualization) found in data centerGPUs. - Real-time Systems / Mixed Criticality:
- Real-time System: A system that must respond to events within a specified time constraint. Failure to do so is considered a system failure.
- Mixed Criticality: Refers to systems where tasks with different levels of
criticality(i.e., importance and strictness of deadlines) run on the same platform. In this paper,critical taskshave strict real-time requirements (e.g., obstacle detection in autonomous driving), whilenormal taskscan be executed in a best-effort manner (e.g., driver fatigue monitoring).
- CUDA Programming Model: NVIDIA's parallel computing platform and application programming interface (API) that allows software developers to use a
GPUfor general-purpose processing.GPUArchitecture (NVIDIA Pascal/Ampere):NVIDIA GPUsare composed of multipleStreaming Multiprocessors (SMs). EachSMcontains multipleCUDA cores(processing units), shared memory, and registers.GPU Kernels: A function that is executed on theGPU. When aCPU(host) invokes akernel, theGPU(device) executes it in parallel using manythreads.Threads: The smallest unit of execution on aGPU. They execute instructions in parallel.Thread Blocks: A group ofthreadsthat can execute concurrently on a singleSM.Threadswithin the sameblockcan communicate efficiently throughshared memoryand synchronize.Grids: A collection ofthread blocksorganized in a multi-dimensional array. Agriddefines the overall structure of thekernel's computation.GPU Streams: A sequence ofGPUoperations (e.g.,kernellaunches, memory copies) that are executed in order. Operations in differentstreamscan run concurrently, enabling asynchronous execution and overlapping computation with data transfers.SM Occupancy: A measure of how well anSMis utilized. It is the ratio of the number of activewarps(groups of 32threadsthat execute instructions in lockstep) on anSMto the maximum number of activewarpssupported by thatSM. Higheroccupancygenerally means betterGPU utilization.
- Resource Contention: When multiple
DNN tasksorGPU kernelstry to use the same limited hardware resources simultaneously, they compete for access. This competition is calledresource contention.Intra-SM Contention: Competition for resources within a singleStreaming Multiprocessor (SM). This can occur when multiplethread blocks(even from differentkernels) are dispatched to the sameSMand compete forshared memory,registers, orexecution units.Inter-SM Contention: Competition for resources among differentStreaming Multiprocessors (SMs). This typically involves shared global resources likeglobal memory bandwidth,L2 cache, ormemory controllers. It can also refer to contention for dispatchingthread blocksto availableSMs.
3.2. Previous Works
The paper discusses several categories of prior approaches for multi-DNN inference on edge devices:
Joint DNN Model Compression: Methods likeNestDNN[8] sacrifice a small amount of accuracy in individualDNN modelsto reduce their computational cost, allowing more models to fit on resource-limited devices.- Differentiation:
Miriamdoes not compromise on accuracy and is considered an orthogonal approach, meaning it can potentially be combined withmodel compression.
- Differentiation:
New Compiling Techniques:Veltair[24] generates multiple versions of compiledDNN modelswith differentresource contention intensitiesthat can be selected at runtime for scheduling.- Differentiation: These methods often lead to high storage overhead and require time-consuming offline profiling, making them difficult to scale.
Miriamaims for dynamic, runtime adaptation without such heavy offline costs.
Operator Interleaving: Systems likeDeepEye[25],Abacus[6], andDart[35] interleaveDNN operatorsbased on theircontention channels(e.g.,memory-boundvs.compute-boundoperations) to improve efficiency.- Differentiation: These methods require extensive offline profiling and are hard to generalize for new
DNN tasks.Miriamfocuses on a more generalkernel-level coordinationapproach.
- Differentiation: These methods require extensive offline profiling and are hard to generalize for new
Kernel-level Preemption:REEF[11] achieveskernel-level preemptionforcritical tasksto ensure their responsiveness.- Differentiation:
REEFrequires modifications to theGPU driver library, which is often impractical forclosed-sourcedevices common in theedgeecosystem.Miriamavoidsdriver modifications.
Resource Contention Solutionswith Different Settings:Heimdall[38] andBand[19] also addressresource contentionformulti-DNN inferencebut in different application contexts (e.g.,Augmented RealityforHeimdall,heterogeneous mobile processorsforBand).- Differentiation: While related, their specific focuses and underlying assumptions may differ from
Miriam's generalreal-time mixed-criticalityproblem onedge GPUs.
Performance Modeling & Prediction:Warped-Slicer[36] usesperformance vs. computing unit occupancy curvesto select optimizedsimultaneous kernel patterns.HSM[40] and [31] modellatency degradationof concurrentGPU kernelexecutions based on hardware information.- Differentiation:
Warped-Slicerdoes not fully addressresource contentionbetween kernels.Performance predictorsbuilt in these works are often difficult to adapt toreal-world multi-DNN inference scenariosdue tonon-deterministic kernel overlappingand the lack of runtime event support onedge GPUs.
Resource Management(Space-multiplexing):Smcentric[33] andEffisha[4] tackleGPU multitaskingfrom aresource managementperspective usingspace-multiplexing(e.g., partitioningSMsamong tasks) [18, 34].- Differentiation: These approaches are
orthogonaltoMiriam's.Miriamfocuses ontime-multiplexingand fine-grainedkernel-level coordinationrather than staticspace partitioning.
3.3. Technological Evolution
The field of DNN inference on edge devices has evolved from initial attempts at running single DNNs to now demanding multi-DNN workloads with diverse real-time requirements. Early solutions focused on model compression or quantization to fit models within edge device memory and computational limits. As DNN applications grew more complex (e.g., multiple perception modules in autonomous driving), the need for concurrent execution became paramount.
Initially, simple sequential execution was used, guaranteeing critical task latency but sacrificing throughput. Then, multi-stream approaches allowed for basic parallelism, but without sophisticated contention management, critical task latency suffered. Server-grade GPUs introduced hardware features like MIG and MPS to address these multi-tenant challenges, but edge GPUs often lack these due to architectural differences and cost considerations.
This paper's work (Miriam) fits within the current technological timeline by addressing the specific gap of contention-aware real-time multi-DNN inference on resource-limited edge GPUs that lack server-grade hardware management. It builds upon the understanding of GPU resource contention (intra-SM and inter-SM) and leverages software-based kernel transformation and dynamic coordination to bridge this gap, pushing the boundaries of what's achievable without driver modifications or specialized hardware support.
3.4. Differentiation Analysis
Compared to the main methods in related work, Miriam offers several core differences and innovations:
- No Accuracy Compromise: Unlike
joint DNN model compressiontechniques,Miriamfocuses purely onschedulingandresource managementwithout altering theDNN modelsthemselves, thus preserving original inference accuracy. - Avoidance of Hardware/Driver Modifications: Unlike
REEFwhich requiresGPU drivermodifications, or approaches relying onNVIDIA MPS/MIG(unavailable onedge GPUs),Miriamoperates at theCUDAAPI level, making it widely applicable to variousedge GPU platforms. - Dynamic, Runtime Adaptation:
Miriamprovidesruntime dynamic kernel coordinationwhich is crucial forreal-time mixed-criticalityscenarios. This contrasts with methods that rely heavily ontime-consuming offline profiling(Veltair,DeepEye,Abacus,Dart,Warped-Slicer), which struggle to adapt to dynamic workloads and newDNN tasks. - Fine-grained
Kernel-levelControl: By introducing the concept ofelastic kernelswith adjustablegridandblocksizes,Miriamachieves much finer-grained control overGPU resources(bothintra-SMandinter-SM) than coarser-grainedoperator interleavingorspace-multiplexingapproaches. - Contention-aware Scheduling:
Miriamexplicitly identifies and addressesintra-SMandinter-SM contentionas primary causes oflatency degradation. Itselastic kerneldesign anddynamic coordinatorare specifically tailored to mitigate these types of contention, a focus often less explicit or comprehensively addressed in other generalGPU multitaskingsolutions. - Computational Consistency: The
source-to-source kernel transformeris an innovation that ensures that modifyinggridandblocksizes (which can easily lead to computation errors) does not compromise the correctness of theDNN inference, a challenge explicitly acknowledged and solved byMiriam.
4. Methodology
4.1. Principles
The core idea behind Miriam is to introduce an elastic kernel abstraction that allows for fine-grained control over GPU resources, specifically grid size and block size. This elasticity enables Miriam to dynamically adjust resource allocation based on the priority and criticality of DNN tasks, thereby mitigating resource contention (both intra-SM and inter-SM) while maximizing GPU utilization. The theoretical basis is that by making kernels "elastic," they can be dynamically scheduled and remapped to different GPU resources in a way that prioritizes critical tasks without causing significant interference, allowing normal tasks to efficiently use leftover resources.
The intuition is to break down rigid GPU kernel execution into smaller, more flexible units (elastic blocks, elastic grids, and elastic kernel shards). This allows the system to intelligently "bin-pack" these flexible units around critical kernels, filling idle SM resources and improving overall throughput without delaying critical tasks.
4.2. Core Methodology In-depth (Layer by Layer)
Miriam is a compiler-runtime synergistic framework with two main components: Offline Elastic Kernel Generation and Online Kernel Coordination.
4.2.1. Offline Elastic Kernel Generation
This component is responsible for transforming standard GPU kernels into elastic kernels that allow for adjustable resource usage patterns.
4.2.1.1. Controllable Intra-SM Resource by Elastic Block
Intra-SM contention occurs when multiple thread blocks on the same SM compete for resources like registers, shared memory, and execution units. To manage this, Miriam introduces elastic blocks.
The concept of elastic block is achieved by adjusting the number of threads within a thread block. Instead of static block sizes, elastic blocks can vary their thread count. This is implemented using the persistent thread technique [10], where threads remain active throughout the kernel function's execution, allowing for dynamic adjustment of the resident block size on an SM.
The key transformation here is to change the default 1:1 logical-to-physical threads mapping scheme to an N:1 mapping. This means that logical computation units are mapped to 1 physical GPU thread, which helps control intra-SM parallelism and resource usage. Miriam limits the range of each elastic block size to be between 1 and the maximum resident block size. The design ensures that the memory occupancy of an elastic kernel is either equal to or less than that of the original kernel, staying within shared memory limits.
4.2.1.2. Elastic Grid for Inter-SM Contention
Inter-SM contention primarily involves competition for global memory and memory controllers when many thread blocks across different SMs are launched rapidly. To address this, Miriam uses an elastic grid generator that slices the initial grid into multiple smaller grids.
Given an original kernel with total thread blocks, a slicing plan P(K) divides into a sequence of slices at thread-block-granularity. This allows these smaller, sliced grids to be scheduled independently by the GPU, enabling interleaving with other critical kernels and improving time-multiplexing.
The paper provides a dichotomy algorithm-based slicing plan for a DNN kernel with thread blocks:
Where:
-
S(K)represents the sequence of possibleslicing schemesforkernel. -
is the total number of
thread blocksin the originalkernel. -
is the largest power index of 2 such that is divisible by . This ensures that the
kernelcan be recursively divided into halves until the smallest possiblesliceis achieved, which isthread blocks. The sequence includes all possible divisions up to the original blocks.By using
elastic grids,normal kernelscan be issued with a flexible number ofthread blocksonSMs, co-locating withcritical kernelsand reducinginter-SM memory contention.
4.2.1.3. Workload-balanced-guided Design Space Shrinking
The design space for elastic kernels (combinations of grid size and block size ) is enormous. To make runtime selection feasible, Miriam prunes this space by eliminating configurations unlikely to perform well or leading to dispatch failures. This is done using two main constraints and corresponding metrics:
-
SM-level Workload Imbalance Constraint: The first constraint aims to address
unbalanced workload across SMswhen the number ofthread blocksis not a multiple of the number ofSMs. It prunes cases where theelastic kernel'sthread blockswould exceed the remaining availableSMsaftercritical kernelshave been dispatched.The second constraint addresses
intra-SM workload balance, ensuring that eachSMis utilized effectively without being either too light (wasted resources) or too heavy (resource contention). It prunes cases where the workingthreadsof anelastic kernelwould overly exceed the spareintra-SM resourcesalready occupied bycritical kernel blocks.These two constraints are formally represented as: Where:
-
: Number of
thread blocksin a dispatchedelastic normal kernel. -
: Number of
streaming multiprocessorson theGPU. -
: Number of
thread blocksin a dispatchedcritical kernel. -
: Number of working
threadsof eachthread blockin a dispatchedelastic normal kernel. -
: Limitations on the number of working
threads(maximumthreadcapacity perSM). -
: Number of
threadsof eachthread blockin a dispatchedcritical kernel.A
workload imbalance metriccalledWIScoreis defined: Where: -
WIScoreranges from [0, 1]. A higher score indicates better workload balance and utilization. The term seems to be a typo in the original paper and likely intended to be (or similar) to represent the combinedthreadusage, or simply relative to if is already accounted for in . For faithfulness, the formula is presented exactly as in the paper.
-
-
Dispatch Overhead Constraint: This constraint prunes candidates that incur excessive
dispatch overhead. TheOScoreis used to filter such cases: Where:-
LO()represents thelaunch overheadfor anelastic kernel fragment(a shard). -
: An
elastic kernel fragment. -
: Maximum acceptable
launch overheadforthread blocks. -
: Maximum acceptable
launch overheadforpersistent threads. -
: Total number of
elastic kernel shards. -
OScoreis 1 if the sum oflaunch overheadfor allelastic kernel fragments(bothblock-related andpersistent thread-related ) is below predefined maximum thresholds; otherwise, it's 0.The combined metric
(WIScore * OScore)helps identify promisingelastic kernel candidates.Miriamthen selects the top 20% of these candidates (based on this product) for runtimekernel coordination, aiming to reduce the design space without sacrificing performance.
-
4.2.1.4. Source-to-Source Elastic Kernel Transformer
Directly modifying grid and block sizes in CUDA kernels often leads to computation errors because these parameters are intrinsically linked to the kernel's computation schedule and how logical threads are mapped to physical GPU threads. This constraint is illustrated by Figure 6.
该图像是一个示意图,展示了物理线程和逻辑线程之间的关系,以及如何通过共享内存和寄存器进行矩阵乘法优化。图中强调了 Tile ext{ }size riangleq TB ext{ }size 的概念,并显示了不同内存层次结构的布局。
Figure 6: Grid/Block size cannot be directly modified in case of recomputation/computation error.
To overcome this, Miriam employs a source-to-source kernel transformer. This transformer inserts a piece of code at the beginning of each kernel. This injected code:
-
Computes a global thread identifier: This identifier acts as a basis for
SM-level workload distribution. It takes thethread IDas input and produces a corresponding index for the data element accessed by thethread. -
Replaces physical thread references with logical equivalents: Variables like
GridDim(totalgrid dimensions) andthreadIdx.x(threadindex within ablock) in the originalkernel codesare replaced with theirlogical equivalents. This ensures that even when the underlyinggridorblockstructure changes, thekernelcorrectly accesses the intended data elements and performs the correct computations.The transformer uses two approaches for implementing the index function:
-
Computation-based: The index is computed within the
kernelwhen thethreadaccesses a data element. -
Memory-based: Indices are pre-calculated on the
CPU(host) beforekernel launchand stored inshared memoryfor faster access duringkernel execution.This transformation guarantees the consistency of computation results despite the dynamic adjustments to
gridandblocksizes, making theelastic kerneldesign viable.
4.2.2. Runtime Dynamic Kernel Coordination
This component is the online scheduler that manages the execution of elastic and critical kernels at runtime.
4.2.2.1. Execution Timeline and Scheduling Policy
When Miriam receives multiple normal task requests, their kernels are initially placed into a normal tasks queue and dispatched via multiple streams to the GPU.
However, upon the arrival of a critical task:
-
Miriamimmediately monitors the availableGPU resources(from bothcriticaland already runningelastic kernels). -
It then selects appropriate
elastic kernel fragmentsfrom the subsequentnormal kernelin a "bin-packing" manner. This means it tries to fitelastic kernel fragmentsinto the availableGPU resource slotsleft by thecritical kernelwithout causing interference. -
Once the
critical kernelscomplete execution, thekernelsfromnormal tasks(including any remainingelastic kernel fragments) can then fully re-occupy theGPU.This greedy scheduling policy dynamically balances resource allocation. When an
elastic kernelpartially overlaps with acritical kernel, the coordinator ensures the paddedelastic kerneldoes not interfere with thecritical kernelwhile still maximizing resource usage. When anelastic kernelruns alone, it can utilize all available resources.
4.2.2.2. Dynamic-sized Shaded Binary Tree for Elastic Kernel Shards Formation
To efficiently manage and select elastic kernels at runtime, Miriam introduces a dynamic-sized shaded binary tree structure.
该图像是一个示意图,展示了用于内核碎片形成的阴影二叉树构建。图中标注了关键内核、正常内核及其在不同相位的调度过程,同时展示了网格大小和分片程度。弹性块大小(Elastic Block Size)和可能的碎片(Possible Shard)与实际碎片(Actual Shard)也被说明。
Figure 7: Shaded Binary Tree Construction for Kernel Shards Formations. ES refers to the elastic kernel, and EBS refers to the elastic block size. The sharding degree represents the degree of elastic kernel splitting depth.
This structure abstracts the management of elastic kernel shards:
-
Tree Structure: The root represents the
kernelfromnormal taskswith its initialgrid size. Each node in the tree corresponds to a part of the computation or potentialthread blocksto be dispatched. -
Shading Property: The
shading propertyfor each node indicates theelastic block size (EBS)of itsthread block. -
Directed Edges: These show potential
sliced peersfor unfinished computations if a predecessorshardis partially executed or evicted. -
Actual vs. Virtual Shards:
Actual shardsare theelastic kernel shardsthat are eventually dispatched.Virtual shardsare potential fragments that might not be dispatched. -
Sharding Degree: This represents the splitting depth of the
elastic kernel.The policy for selecting
elastic fragmentsfromnormal kernelsis to pick a set ofelastic blocksfrom the head of thisshaded kernel binary tree. These chosenelastic blocksare designed to shareSM-level resourceswith co-locatingthread blocksfrom residentcritical kernelswith minimalcontention. The core principle is thatelastic blocksfromnormal kernelsshould only use the leftover resources from thecritical kernels, ensuringcritical taskperformance.
This dynamic-sized shaded binary tree allows Miriam to adapt to runtime changes in resource availability and efficiently determine the optimal configuration for elastic kernel shards, leading to high runtime efficiency and low resource contention.
5. Experimental Setup
5.1. Datasets
The authors built a new DNN inference benchmark called MDTB (Mixed-critical DNN Task Benchmarks) based on CUDA implementations of popular DNN models.
The MDTB benchmark simulates three common patterns for inference tasks from user requests:
-
Arrival in uniform distribution: The client sends inference requests at a fixed frequency (e.g., 10 requests/second). This simulates
critical applicationslikepose estimation. -
Arrival in Poisson distribution: This simulates
event-driven applicationssuch asobstacle detection, where events (and thus requests) arrive randomly over time but at a known average rate. -
Closed-loop workloads: The client continuously sends inference requests as soon as the previous one is completed, simulating maximum load.
The
MDTBbenchmark uses six representativeDNN modelsfrom bothcomputer visionandlanguage processingfields, all implemented in CUDA:
-
AlexNet[22] -
SqueezeNet[15] -
GRU (Gated Recurrent Unit)[7] -
LSTM (Long Short-Term Memory)[14] -
ResNet[13] -
CifarNet[30]For evaluation,
neural network inferenceis conducted with a single batch of images as input, mimicking real-world application scenarios.
The paper defines four specific MDTB workload configurations for evaluation, as summarized in Table 2:
The following are the results from Table 2 of the original paper:
| MDTB | A | B | C | D |
|---|---|---|---|---|
| Critical Task | AlexNet | SqueezeNet | GRU | LSTM |
| Frequency (req/s) | Closed-loop | Uniform (10 reqs/s) | Poisson (10 reqs/s) | Uniform (10 reqs/s) |
| Normal Tasks | CifarNet | AlexNet | ResNet | SqueezeNet |
| Frequency (req/s) | Closed-loop | Closed-loop | Closed-loop | Closed-loop |
-
MDTB A:
AlexNetas aclosed-loop critical taskco-running withCifarNetas aclosed-loop normal task. This represents a high-contention scenario. -
MDTB B:
SqueezeNetas auniform (10 reqs/s) critical taskco-running withAlexNetas aclosed-loop normal task. -
MDTB C:
GRUas aPoisson (10 reqs/s) critical taskco-running withResNetas aclosed-loop normal task. -
MDTB D:
LSTMas auniform (10 reqs/s) critical taskco-running withSqueezeNetas aclosed-loop normal task.These diverse workloads were chosen to effectively evaluate
Miriam's performance and generalization across differentDNN models,task criticalities, andarrival patterns.
5.2. Evaluation Metrics
The paper uses three key metrics to evaluate the performance of Miriam:
-
End-to-end Latency of Critical Tasks:
- Conceptual Definition: This metric quantifies the total time taken for a
critical DNN inference taskfrom its submission to the system until its completion. It is a direct measure of thereal-time performancefor tasks that have strict deadlines, focusing on how quicklycritical tasksare serviced. - Mathematical Formula: Not explicitly provided in the paper, but conceptually, it's the duration from request initiation to result availability for a specific
critical task. - Symbol Explanation: N/A (as no explicit formula given).
- Conceptual Definition: This metric quantifies the total time taken for a
-
Overall Throughput:
- Conceptual Definition: This metric represents the total number of
inference requests(from bothcriticalandnormal tasks) that the system can successfully serve within a given time period. It quantifies the system's efficiency and capacity to handle a high volume ofDNN inferences. - Mathematical Formula: Not explicitly provided in the paper, but conceptually, it's the
Total Number of Requests Served / Total Time. - Symbol Explanation: N/A (as no explicit formula given).
- Conceptual Definition: This metric represents the total number of
-
Achieved Occupancy:
- Conceptual Definition:
Achieved occupancyis a measure of how effectively theStreaming Multiprocessors (SMs)on theGPUare utilized. It specifically quantifies the average ratio of activewarps(groups of 32threadsexecuting in parallel) on anSMto the maximum number of activewarpsthat theSMcan support. A higherachieved occupancygenerally indicates betterGPU utilizationand efficiency, as morewarpsare actively executing. - Mathematical Formula: The paper provides the following definition:
- Symbol Explanation:
- : The number of
warpsthat are actively executing instructions on anSMat a given time. - : The number of
clock cyclesduring whichwarpsare actively executing instructions. The ratio likely refers to the average number of activewarpsover a period. - : The maximum number of
warpsthat a singleStreaming Multiprocessor (SM)can support concurrently. This is a hardware-defined limit.
- : The number of
- Conceptual Definition:
5.3. Baselines
Miriam is compared against several existing DNN scheduling approaches on edge GPUs:
Sequential:- Description: This baseline selects one model from the
critical task queueand one from thenormal task queuein around-robin fashion, executing them one after another. - Rationale: This mode allows
critical tasksto run independently, effectively monopolizingGPU resourceswhen active. It is expected to yield the lowestend-to-end latencyforcritical tasksbut at the cost of significantly reducedoverall throughputbecausenormal taskscannot run concurrently.
- Description: This baseline selects one model from the
GPU Multi-stream with Priority:- Description: This approach enqueues
kernelsfrom bothcriticalandnormal taskssimultaneously into differentGPU streams. TheGPU's default scheduler then executes them in parallel, potentially giving preference tostreamsmarked with higher priority (thoughCUDA stream prioritymechanisms can be complex and platform-dependent). This is a common approach adopted by systems likeNVIDIA Triton[3]. - Rationale: This aims to increase
concurrencyandthroughputby allowing tasks to run in parallel. However, without explicitcontention management, it can lead to increasedlatencyforcritical tasksdue toresource contentionwithnormal tasks.
- Description: This approach enqueues
Inter-stream Barrier (IB):- Description: This is described as a
state-of-the-art multi-DNN operator scheduling methodbased onmulti-stream. It usesinter-stream barriers(synchronization primitives) to manually synchronizekernel dispatchamong differentkernelsorkernel groups. - Rationale:
IBattempts to controlconcurrencyby explicitly coordinatingkernelexecution acrossstreams. This allows for more deliberateoverlappingofoperatorsorkernels. It is expected to perform better than simpleMulti-streamin managingcontentionto some extent.
- Description: This is described as a
6. Results & Analysis
6.1. Core Results Analysis
The experiments evaluated Miriam against the baselines (Sequential, Multi-stream, and Inter-stream Barrier (IB)) under MDTB A-D workloads on two edge GPU platforms: NVIDIA GeForce RTX 2060 and NVIDIA Jetson AGX Xavier. The analysis focuses on end-to-end latency for critical tasks, overall throughput, and achieved occupancy.
6.1.1. Closed-loop Critical Tasks (MDTB A)
-
Workload:
AlexNet(critical, closed-loop) co-running withCifarNet(normal, closed-loop), representing a highresource contentionscenario. -
Results (Figures 8a-8f):
该图像是图表,展示了在2060和Xavier平台上不同任务的关键延迟、整体吞吐量和平均实现占用率的比较。使用Miriam框架的任务具有更优的性能,尤其在延迟和吞吐量方面。图中各任务的占用率数据显示,Miriam有效地提高了资源的利用效率。The above figures illustrate the performance comparison across different GPU scheduling approaches under various MDTB workloads.
-
Critical Task Latency (Figures 8a, 8b):
- Compared to
Sequential(which has the lowestcritical task latencyby design),Multi-streamincreasedlatencyby on RTX 2060 and on Xavier. IBincreasedlatencyby on 2060 and on Xavier.Miriamincurred only a21%(2060) and28%(Xavier)latency overheadforcritical tasks. This demonstratesMiriam's effectiveness in preservingreal-time performanceforcritical taskseven under heavycontention.
- Compared to
-
Overall Throughput (Figures 8c, 8d):
Miriamsignificantly improvedoverall throughputby64%on 2060 and83%on Xavier compared toSequential.- Notably,
IB'sthroughput performancewas sometimes worse thanSequential's. This is attributed to frequentcritical task launchesrequiring moresynchronization barriersbetweenGPU streams, leading to significantoverhead. Multi-streamalso improvedthroughputbut less effectively thanMiriamwhile incurring highercritical task latency.
-
Achieved Occupancy (Figures 8e, 8f):
Miriamconsistently led to higherSM-level GPU resource utilization(achieved occupancy) compared to other baselines. This indicates thatMiriameffectively "fills in"idle GPU cycleswithnormal task computationswithout hurtingcritical tasks.- The paper notes that achieving nearly 100% theoretical
occupancyis difficult forDNN inference tasksdue to largethread blockspotentially causingresource idlenessorSM incapacityto covermemory access latency.
-
6.1.2. Uniform/Poisson Critical Tasks (MDTB B, C, and D)
-
Workload: Scenarios with lower
critical task launching frequencies(Uniform 10 reqs/s or Poisson 10 reqs/s) forSqueezeNet,GRU, andLSTM, co-running withclosed-loop normal taskslikeAlexNet,ResNet, andSqueezeNet. -
Results:
- As
critical workload frequencydecreases, all approaches generally show improvedoverall throughputdue to more opportunities fornormal tasksto shareGPU resources. Miriammaintained its superior performance:- On
Xavier,Miriamincreasedoverall throughputby (MDTB B), (MDTB C), and (MDTB D) overSequential. Multi-streamandIBalso yielded improvedthroughput( to ) but caused severelatency degradationforcritical tasks(32% to 88%).Miriamonly incurred alatency overheadof less than21%for these benchmarks, confirming its ability to maximizethroughputwhile preservingcritical task latency.
- On
- From a
GPU utilizationperspective,Miriamincreased the average activewarpsper cycle, leading to betterSM utilization. This validates the effectiveness ofelastic kernel shardinganddynamic padding.
- As
-
Observation on Jetson Xavier:
- The
performance improvementsfromMiriamsometimes did not directly translate to higherSM occupancyonJetson Xavier. This is explained byXavier's lower onboard resources, fewerSMs, relatively lowmemory bandwidth, and lowerthermal design powercompared to RTX 2060. These factors can limitclock speed,parallelism, and thus the direct relationship betweenSM occupancyandperformance.
- The
6.1.3. In-depth Analysis of Miriam (AlexNet Co-running)
To understand Miriam's performance under severe contention, an in-depth analysis was conducted with two AlexNet models co-running on an RTX 2060: AlexNet-C (critical) and AlexNet-N (normal), both in closed-loop.
该图像是图表,展示了多流和Miriam方法下两种AlexNet模型的活跃内核级时效性以及各层的平均占用率。上部分比较了多流和Miriam的时效性表现,底部则展示了不同层中取得的占用率,表明Miriam方法有效减少了延迟。
Figure 9: (Upper) The active kernel-level timeliness of two co-running AlexNet models with mixed-criticality, which is profiled from the Nsight System. (Lower) The average achieved occupancy for each layer of the critical AlexNet.
- Timelines (Figure 9, Upper): The figure (sketched from
NVIDIA Nsight Sysprofiles) showsMiriam(pink blocks forelastic kernels) successfully paddingcritical kernels(blue blocks) much more densely thanMulti-stream(green blocks). This visualizes howMiriam'selastic kernel shardsare tightly packed withcritical kernels, leading to a significantly lowerend-to-end latencyforAlexNet-CinMiriamcompared toMulti-stream. - Achieved Occupancy (Figure 9, Lower): The average layer-wise
achieved occupancyforMiriamwas65.25%, significantly higher than32.9%forMulti-stream. This confirms thatMiriamachieves betterSM utilizationandparallelismby reducingcontention overheadwhile preserving the speed ofcritical tasks.
6.1.4. Evaluations on Design Space Shrinking
Miriam's design space shrinking mechanism (Chapter 6.3) prunes 80% of definitely-slow cases using hardware limiters.
该图像是示意图,展示了不同 DNN 模型的弹性候选设计空间收缩情况。图中标出了帕累托前沿(Pareto Frontier)、弹性内核候选(Elastic Kernel Candidates)和不合格候选(Unqualified Candidates),并且提供了在修剪后的空间中各个模型的对应效果。该图为理解 Miriam 整体框架的性能优化提供了可视化参考。
Figure 10: Shrinking the design space of elastic candidates for different DNN Models. Miriam picks up elastic kernels lying in the Pareto Frontier (for visualization) of tradeoff space between the elasticized scale and the scheduling granularity.
Figure 10 illustrates the pruned space of candidate elastic kernels for different DNN models in MDTB. The pruning percentage ranges from 84% to 95.2%. This indicates that the constraints (WIScore and OScore) effectively reduce the search space for elastic kernel configurations. The authors ensure that this pruning does not remove optimal configurations by testing representative tensor operations and finding that the top 20% of candidates always included the best-performing parameters. The Pareto Frontier is used to visualize the tradeoff between elasticized scale (depth of the dynamic shaded binary tree) and scheduling granularity, guiding the selection of optimal elastic kernels.
6.1.5. Case Study: Autonomous Driving with LGSVL
A real-world trace from the LG SVL [16] autonomous driving platform was used as a workload: ResNet for obstacle detection (critical, uniform) and SqueezeNet for pose estimation (normal, uniform) on a GTX 2060.
该图像是实验结果的图表,展示了在2060平台上,Miriam方法在端到端关键任务延迟、整体吞吐量以及平均实现占用率上的表现。图(a)显示关键任务的标准化延迟,图(b)展示整体吞吐量,而图(c)则表现了实现的占用率。Miriam方法在延迟和吞吐量方面均优于其他基准。
The above figures illustrate the performance comparison of different scheduling schemes with LGSVL simulated workloads.
该图像是图表,展示了从LGSVL模拟器收集的真实轨迹,其中(a)为基于图像数据的目标检测结果,(b)为基于激光雷达点云数据的结果,(c)展示了我们收集轨迹的设置。
Figure 12: Real-world trace collected from LGSVL simulator, where (a) is the object detection result based on image data, (b) is the result with lidar point cloud data, and (c) depicts the setting of our collected trace.
- Results (Figure 11):
Multi-streamandIBincreasedoverall throughputby and respectively, but amplifiedcritical task latencyby and .Miriamachieved an89% improvementinoverall throughputcompared toSequential, while only incurring an11% latency overheadfor thecritical task.- This strong performance is attributed to the relatively low
launching frequencyof bothcriticalandnormal tasks, allowingelastic kernelsto execute concurrently with minimaleviction overheadforelastic kernel shards. - Figure 11(c) and Figure 12(c) show
Miriam's highSM occupancyamong all baselines, confirming its efficient resource utilization in this realistic scenario.
6.2. System Overhead
The scheduling overhead of Miriam consists of two main parts:
-
Runtime Elastic Kernel Shards Selection: This involves scanning
shard candidates. The complexity is , where is the number of candidates. The average overhead for serving eachDNN modelwas found to be less than . -
Launch Time Overhead for Critical Kernels: This overhead is due to the
paddingofelastic kernels. In over80%of cases, thisoverheadwas less than . The paper notes that thislatency overheadis primarily due tocontentionon thetexture cacheandL2 memory, which is left for future work.These overheads are considered minimal, demonstrating
Miriam's practicality forreal-time edge deployments.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper proposes Miriam, a novel system designed to address the challenges of latency and throughput in co-running multiple DNN inference tasks on resource-limited edge GPUs. Miriam introduces the concept of elastic kernels which enable fine-grained GPU resource remapping, and a runtime dynamic kernel coordinator that intelligently schedules these elastic kernels alongside critical tasks. Key contributions include the elastic-kernel generator (with elastic blocks and elastic grids and a source-to-source transformer for computational consistency) and the dynamic coordinator (using a greedy scheduling policy and shaded binary trees). Experimental results on a custom MDTB benchmark across two edge GPU platforms demonstrate that Miriam significantly improves overall system throughput (up to 92%) while maintaining minimal latency overhead (less than 10%) for critical tasks compared to state-of-the-art baselines.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
- Scalability for General Multi-DNN Tasks: While
Miriamshows promise forpair-wise DNN tasks, scaling it to support a larger and more general number of concurrently runningDNN tasksrequires further consideration. This includes developing robustscheduling policiesfornormal tasksthat share the same priority and finding efficient ways to performoffline kernel profilingas thedesign spacegrows exponentially with moreco-running kernels. - Integration with
DNN Compilers: The paper suggests integratingMiriamwith existingDNN compilerslikeTVM[5]. WhileTVMcan generate high-performance kernels, itsauto-tuningand compilation areoffline processes, making it difficult to adapt to dynamic runtime scenarios onedge devices.Miriamcould serve as apost-compiling runtimeto bridge this gap, ensuring fullresource utilizationadaptively. - Orthogonality to Other Approaches: The authors state that
Miriamcanwork symbioticallywith otherDNN optimization approachessuch asmodel compression[23] andedge-cloud offloading[41]. This implies thatMiriamitself does not perform these optimizations but could enhance their effectiveness when combined, leading to improvedruntime performanceandresource utilizationinresource-constrained edge computing environments. - L2 Memory/Texture Cache Contention: The observed
launch time overheadforcritical kernelsdue topadding(less than ) is mainly attributed tocontentionon thetexture cacheandL2 memory. This specific area ofcontentionis left for future investigation to further minimizeoverhead.
7.3. Personal Insights & Critique
Miriam presents a highly innovative and practical solution for a critical problem in edge AI – managing mixed-criticality multi-DNN inference on resource-limited GPUs. The core concept of elastic kernels is particularly insightful, moving beyond static kernel configurations to a dynamic, contention-aware resource allocation strategy. This software-centric approach, avoiding driver modifications or reliance on server-grade hardware features, makes it widely applicable to current edge platforms.
The rigorous experimental validation on diverse DNN workloads and realistic edge GPU platforms (RTX 2060 and Jetson AGX Xavier) lends strong credibility to the proposed framework. The detailed breakdown of intra-SM and inter-SM contention, followed by specific mechanisms (elastic blocks, elastic grids, design space shrinking) to address them, showcases a deep understanding of GPU architecture and performance bottlenecks. The source-to-source kernel transformer is a clever solution to a fundamental challenge of kernel modification, ensuring computational correctness.
One area for potential improvement or further exploration could be a more detailed analysis of the WIScore formula, specifically the term which appears to be a minor typo in the paper. Clarifying its intended meaning would enhance the rigor. Additionally, while the paper mentions L2 memory and texture cache contention as sources of overhead, a deeper dive into how elastic kernel shards might interact with these specific cache hierarchies and potential mitigation strategies would be valuable.
The idea of combining Miriam with other DNN optimization techniques like model compression or edge-cloud offloading is powerful. This orthogonality suggests Miriam could serve as a foundational runtime coordination layer in a holistic edge AI system, optimizing resource utilization across various dimensions. The framework's ability to adapt to dynamic runtime conditions is crucial for IoT applications where on-device resources can fluctuate. This research provides a significant step towards enabling robust and efficient real-time AI at the edge.
Similar papers
Recommended via semantic vector search.