Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU

Guoliang Xing

Paper status: completed

Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU

Published:07/10/2023

Multi-DNN Inference (1)Edge GPU Task Coordination Framework (1)Elastic Kernel Generator (1)Real-Time Critical Task Inference (1)CUDA DNN Benchmark (1)

Original Link PDF

Price: 0.10

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Miriam is a task coordination framework for multi-DNN inference on edge GPUs, integrating an elastic-kernel generator and runtime coordinator. It boosts system throughput by 92% with less than 10% latency for critical tasks.

Abstract

Many applications such as autonomous driving and augmented reality, require the concurrent running of multiple deep neural networks (DNN) that poses different levels of real-time performance requirements. However, coordinating multiple DNN tasks with varying levels of criticality on edge GPUs remains an area of limited study. Unlike server-level GPUs, edge GPUs are resource-limited and lack hardware-level resource management mechanisms for avoiding resource contention. Therefore, we propose Miriam, a contention-aware task coordination framework for multi-DNN inference on edge GPU. Miriam consolidates two main components, an elastic-kernel generator, and a runtime dynamic kernel coordinator, to support mixed critical DNN inference. To evaluate Miriam, we build a new DNN inference benchmark based on CUDA with diverse representative DNN workloads. Experiments on two edge GPU platforms show that Miriam can increase system throughput by 92% while only incurring less than 10% latency overhead for critical tasks, compared to state of art baselines.

Mind Map

In-depth Reading

English Analysis~25 min read · 30,502 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is "Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU".

1.2. Authors

The authors are:

Zhihe Zhao
Neiwen Ling
Nan Guan
Guoliang Xing

Their affiliations are:
The Chinese University of Hong Kong (Zhihe Zhao, Neiwen Ling, Guoliang Xing)
City University of Hong Kong (Nan Guan)

1.3. Journal/Conference

The paper was published on arXiv, a preprint server. While not a peer-reviewed journal or conference proceeding itself, arXiv is widely used in academic communities for rapid dissemination of research findings before formal publication. Its reputation is high for sharing cutting-edge research, but the work hasn't undergone formal peer review at the time of this publication.

1.4. Publication Year

The paper was published at 2023-07-10T04:30:44.000Z, which corresponds to July 10, 2023.

1.5. Abstract

This paper addresses the challenge of coordinating multiple deep neural network (DNN) inference tasks with varying real-time requirements on resource-limited edge GPUs. Unlike server-level GPUs, edge GPUs lack hardware-level resource management mechanisms, leading to significant resource contention when running multiple DNNs concurrently. To solve this, the authors propose Miriam, a contention-aware task coordination framework. Miriam consists of two main components: an elastic-kernel generator and a runtime dynamic kernel coordinator, designed to support mixed-critical DNN inference. The framework enables fine-grained control over GPU resources by transforming traditional kernels into elastic kernels that can be dynamically scheduled. To evaluate Miriam, the authors created a new CUDA-based DNN inference benchmark with diverse workloads. Experiments on two edge GPU platforms demonstrate that Miriam can significantly increase system throughput by 92% while maintaining low latency overhead (less than 10%) for critical tasks compared to state-of-the-art baselines.

1.6. Original Source Link

Official Source Link: https://arxiv.org/abs/2307.04339
PDF Link: https://arxiv.org/pdf/2307.04339v1.pdf

This paper is currently a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

Deep learning (DL) applications, such as autonomous driving and augmented reality, are increasingly deployed on edge devices. These applications often require the concurrent execution of multiple deep neural network (DNN) tasks, each potentially having different real-time performance requirements (i.e., some tasks are critical and must meet strict deadlines, while others are normal and can run in a best-effort manner).

The core problem the paper aims to solve is the efficient coordination of these multi-DNN inference tasks on edge GPUs. This problem is important because:

Resource Limitation: Edge GPUs are inherently resource-constrained compared to server-level GPUs.
Lack of Hardware Management: Unlike server GPUs (which might have NVIDIA Multi-Process Service (MPS) or Multi-Instance GPU (MIG)), edge GPUs typically lack sophisticated hardware-level resource management mechanisms to prevent or mitigate resource contention among concurrently running tasks.
Conflicting Objectives: There's a fundamental conflict between prioritizing latency-critical tasks (to guarantee their real-time requirements) and maximizing overall throughput (by concurrently executing all tasks in a best-effort manner). Dedicating the entire GPU to critical tasks leads to low GPU utilization, while running everything concurrently without coordination can cause unacceptable latency for critical tasks.

Prior research often struggles with these constraints, either requiring unavailable hardware support, modifications to closed-source GPU drivers, or incurring high overheads for offline profiling. The paper's innovative idea, or entry point, is to leverage an elastic kernel abstraction that allows for more fine-grained resource mapping on the GPU, enabling dynamic scheduling and remapping of resources based on task priority and criticality. This aims to maximize GPU utilization without causing significant resource contention for critical tasks.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Miriam Framework: Proposing Miriam, a contention-aware task coordination framework specifically designed for multi-DNN inference on edge GPUs. This framework addresses both intra-SM (Streaming Multiprocessor) and inter-SM resource contention.
Elastic-Kernel Generator: Introducing an elastic-kernel generator that transforms original GPU kernels into elastic kernels with adjustable grid size and block size. This allows for fine-grained control over GPU resource usage. Key components include an elastic grid/block generator and a source-to-source kernel transformer that ensures computational consistency during transformation.
Runtime Dynamic Kernel Coordinator: Developing a runtime dynamic kernel coordinator that dynamically schedules elastic kernels and critical kernels. It employs a greedy scheduling policy and a dynamic-sized shaded binary tree structure for elastic kernel shards formation to optimize resource utilization and minimize interference.
MDTB Benchmark: Building a new DNN inference benchmark (MDTB) based on CUDA, featuring diverse representative DNN workloads and different task priority settings for edge GPUs.
Experimental Validation: Demonstrating through experiments on two edge GPU platforms (NVIDIA RTX 2060 and NVIDIA Jetson AGX Xavier) that Miriam can:
- Increase system throughput by up to 92% compared to state-of-the-art baselines.
- Incur less than 10% latency overhead for critical tasks, effectively balancing the conflicting objectives of high throughput and low critical-task latency.
- Achieve higher achieved occupancy (a measure of GPU utilization) compared to baselines.
  
  These findings solve the problem of efficiently coordinating mixed-critical multi-DNN inference on resource-limited edge GPUs by enabling adaptive resource sharing and contention mitigation without requiring specialized hardware support or compromising critical task performance.

3.1. Foundational Concepts

To understand Miriam, a novice reader should be familiar with the following core concepts:

Deep Neural Network (DNN) Inference: This is the process of using a pre-trained DNN model to make predictions or classify new, unseen data. In the context of edge devices, this typically involves feeding sensor data (e.g., images from a camera, lidar data) through the DNN to get an output (e.g., object detection, pose estimation).
Edge GPU (Graphics Processing Unit): A specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. Edge GPUs are specifically designed for deployment on edge devices (e.g., autonomous vehicles, drones, IoT devices) which are characterized by limited computational resources, power budgets, and often lack the advanced features (like hardware virtualization) found in data center GPUs.
Real-time Systems / Mixed Criticality:
- Real-time System: A system that must respond to events within a specified time constraint. Failure to do so is considered a system failure.
- Mixed Criticality: Refers to systems where tasks with different levels of criticality (i.e., importance and strictness of deadlines) run on the same platform. In this paper, critical tasks have strict real-time requirements (e.g., obstacle detection in autonomous driving), while normal tasks can be executed in a best-effort manner (e.g., driver fatigue monitoring).
CUDA Programming Model: NVIDIA's parallel computing platform and application programming interface (API) that allows software developers to use a GPU for general-purpose processing.
- GPU Architecture (NVIDIA Pascal/Ampere): NVIDIA GPUs are composed of multiple Streaming Multiprocessors (SMs). Each SM contains multiple CUDA cores (processing units), shared memory, and registers.
- GPU Kernels: A function that is executed on the GPU. When a CPU (host) invokes a kernel, the GPU (device) executes it in parallel using many threads.
- Threads: The smallest unit of execution on a GPU. They execute instructions in parallel.
- Thread Blocks: A group of threads that can execute concurrently on a single SM. Threads within the same block can communicate efficiently through shared memory and synchronize.
- Grids: A collection of thread blocks organized in a multi-dimensional array. A grid defines the overall structure of the kernel's computation.
- GPU Streams: A sequence of GPU operations (e.g., kernel launches, memory copies) that are executed in order. Operations in different streams can run concurrently, enabling asynchronous execution and overlapping computation with data transfers.
- SM Occupancy: A measure of how well an SM is utilized. It is the ratio of the number of active warps (groups of 32 threads that execute instructions in lockstep) on an SM to the maximum number of active warps supported by that SM. Higher occupancy generally means better GPU utilization.
Resource Contention: When multiple DNN tasks or GPU kernels try to use the same limited hardware resources simultaneously, they compete for access. This competition is called resource contention.
- Intra-SM Contention: Competition for resources within a single Streaming Multiprocessor (SM). This can occur when multiple thread blocks (even from different kernels) are dispatched to the same SM and compete for shared memory, registers, or execution units.
- Inter-SM Contention: Competition for resources among different Streaming Multiprocessors (SMs). This typically involves shared global resources like global memory bandwidth, L2 cache, or memory controllers. It can also refer to contention for dispatching thread blocks to available SMs.

3.2. Previous Works

The paper discusses several categories of prior approaches for multi-DNN inference on edge devices:

Joint DNN Model Compression: Methods like NestDNN [8] sacrifice a small amount of accuracy in individual DNN models to reduce their computational cost, allowing more models to fit on resource-limited devices.
- Differentiation: Miriam does not compromise on accuracy and is considered an orthogonal approach, meaning it can potentially be combined with model compression.
New Compiling Techniques:
- Veltair [24] generates multiple versions of compiled DNN models with different resource contention intensities that can be selected at runtime for scheduling.
- Differentiation: These methods often lead to high storage overhead and require time-consuming offline profiling, making them difficult to scale. Miriam aims for dynamic, runtime adaptation without such heavy offline costs.
Operator Interleaving: Systems like DeepEye [25], Abacus [6], and Dart [35] interleave DNN operators based on their contention channels (e.g., memory-bound vs. compute-bound operations) to improve efficiency.
- Differentiation: These methods require extensive offline profiling and are hard to generalize for new DNN tasks. Miriam focuses on a more general kernel-level coordination approach.
Kernel-level Preemption:
- REEF [11] achieves kernel-level preemption for critical tasks to ensure their responsiveness.
- Differentiation: REEF requires modifications to the GPU driver library, which is often impractical for closed-source devices common in the edge ecosystem. Miriam avoids driver modifications.
Resource Contention Solutions with Different Settings:
- Heimdall [38] and Band [19] also address resource contention for multi-DNN inference but in different application contexts (e.g., Augmented Reality for Heimdall, heterogeneous mobile processors for Band).
- Differentiation: While related, their specific focuses and underlying assumptions may differ from Miriam's general real-time mixed-criticality problem on edge GPUs.
Performance Modeling & Prediction:
- Warped-Slicer [36] uses performance vs. computing unit occupancy curves to select optimized simultaneous kernel patterns.
- HSM [40] and [31] model latency degradation of concurrent GPU kernel executions based on hardware information.
- Differentiation: Warped-Slicer does not fully address resource contention between kernels. Performance predictors built in these works are often difficult to adapt to real-world multi-DNN inference scenarios due to non-deterministic kernel overlapping and the lack of runtime event support on edge GPUs.
Resource Management (Space-multiplexing):
- Smcentric [33] and Effisha [4] tackle GPU multitasking from a resource management perspective using space-multiplexing (e.g., partitioning SMs among tasks) [18, 34].
- Differentiation: These approaches are orthogonal to Miriam's. Miriam focuses on time-multiplexing and fine-grained kernel-level coordination rather than static space partitioning.

3.3. Technological Evolution

The field of DNN inference on edge devices has evolved from initial attempts at running single DNNs to now demanding multi-DNN workloads with diverse real-time requirements. Early solutions focused on model compression or quantization to fit models within edge device memory and computational limits. As DNN applications grew more complex (e.g., multiple perception modules in autonomous driving), the need for concurrent execution became paramount.

Initially, simple sequential execution was used, guaranteeing critical task latency but sacrificing throughput. Then, multi-stream approaches allowed for basic parallelism, but without sophisticated contention management, critical task latency suffered. Server-grade GPUs introduced hardware features like MIG and MPS to address these multi-tenant challenges, but edge GPUs often lack these due to architectural differences and cost considerations.

This paper's work (Miriam) fits within the current technological timeline by addressing the specific gap of contention-aware real-time multi-DNN inference on resource-limited edge GPUs that lack server-grade hardware management. It builds upon the understanding of GPU resource contention (intra-SM and inter-SM) and leverages software-based kernel transformation and dynamic coordination to bridge this gap, pushing the boundaries of what's achievable without driver modifications or specialized hardware support.

3.4. Differentiation Analysis

Compared to the main methods in related work, Miriam offers several core differences and innovations:

No Accuracy Compromise: Unlike joint DNN model compression techniques, Miriam focuses purely on scheduling and resource management without altering the DNN models themselves, thus preserving original inference accuracy.
Avoidance of Hardware/Driver Modifications: Unlike REEF which requires GPU driver modifications, or approaches relying on NVIDIA MPS/MIG (unavailable on edge GPUs), Miriam operates at the CUDA API level, making it widely applicable to various edge GPU platforms.
Dynamic, Runtime Adaptation: Miriam provides runtime dynamic kernel coordination which is crucial for real-time mixed-criticality scenarios. This contrasts with methods that rely heavily on time-consuming offline profiling (Veltair, DeepEye, Abacus, Dart, Warped-Slicer), which struggle to adapt to dynamic workloads and new DNN tasks.
Fine-grained Kernel-level Control: By introducing the concept of elastic kernels with adjustable grid and block sizes, Miriam achieves much finer-grained control over GPU resources (both intra-SM and inter-SM) than coarser-grained operator interleaving or space-multiplexing approaches.
Contention-aware Scheduling: Miriam explicitly identifies and addresses intra-SM and inter-SM contention as primary causes of latency degradation. Its elastic kernel design and dynamic coordinator are specifically tailored to mitigate these types of contention, a focus often less explicit or comprehensively addressed in other general GPU multitasking solutions.
Computational Consistency: The source-to-source kernel transformer is an innovation that ensures that modifying grid and block sizes (which can easily lead to computation errors) does not compromise the correctness of the DNN inference, a challenge explicitly acknowledged and solved by Miriam.

4. Methodology

4.1. Principles

The core idea behind Miriam is to introduce an elastic kernel abstraction that allows for fine-grained control over GPU resources, specifically grid size and block size. This elasticity enables Miriam to dynamically adjust resource allocation based on the priority and criticality of DNN tasks, thereby mitigating resource contention (both intra-SM and inter-SM) while maximizing GPU utilization. The theoretical basis is that by making kernels "elastic," they can be dynamically scheduled and remapped to different GPU resources in a way that prioritizes critical tasks without causing significant interference, allowing normal tasks to efficiently use leftover resources.

The intuition is to break down rigid GPU kernel execution into smaller, more flexible units (elastic blocks, elastic grids, and elastic kernel shards). This allows the system to intelligently "bin-pack" these flexible units around critical kernels, filling idle SM resources and improving overall throughput without delaying critical tasks.

4.2. Core Methodology In-depth (Layer by Layer)

Miriam is a compiler-runtime synergistic framework with two main components: Offline Elastic Kernel Generation and Online Kernel Coordination.

4.2.1. Offline Elastic Kernel Generation

This component is responsible for transforming standard GPU kernels into elastic kernels that allow for adjustable resource usage patterns.

4.2.1.1. Controllable Intra-SM Resource by Elastic Block

Intra-SM contention occurs when multiple thread blocks on the same SM compete for resources like registers, shared memory, and execution units. To manage this, Miriam introduces elastic blocks.

The concept of elastic block is achieved by adjusting the number of threads within a thread block. Instead of static block sizes, elastic blocks can vary their thread count. This is implemented using the persistent thread technique [10], where threads remain active throughout the kernel function's execution, allowing for dynamic adjustment of the resident block size on an SM.

The key transformation here is to change the default 1:1 logical-to-physical threads mapping scheme to an N:1 mapping. This means that $N$ logical computation units are mapped to 1 physical GPU thread, which helps control intra-SM parallelism and resource usage. Miriam limits the range of each elastic block size to be between 1 and the maximum resident block size. The design ensures that the memory occupancy of an elastic kernel is either equal to or less than that of the original kernel, staying within shared memory limits.

4.2.1.2. Elastic Grid for Inter-SM Contention

Inter-SM contention primarily involves competition for global memory and memory controllers when many thread blocks across different SMs are launched rapidly. To address this, Miriam uses an elastic grid generator that slices the initial grid into multiple smaller grids.

Given an original kernel $K$ with $M$ total thread blocks, a slicing plan P(K) divides $K$ into a sequence of $n$ slices $[s_0, s_1, s_2, ..., s_{n-1}]$ at thread-block-granularity. This allows these smaller, sliced grids to be scheduled independently by the GPU, enabling interleaving with other critical kernels and improving time-multiplexing.

The paper provides a dichotomy algorithm-based slicing plan for a DNN kernel $K$ with $M$ thread blocks: $S ( K ) = ( { \frac { M } { 2 ^ { n } } } , { \frac { M } { 2 ^ { n - 1 } } } . . . , M ) , n = \operatorname* { m a x } _ { i } \{ M m o d 2 ^ { i } = 0 \}$ Where:

S(K) represents the sequence of possible slicing schemes for kernel $K$ .
$M$ is the total number of thread blocks in the original kernel $K$ .
$n$ is the largest power index of 2 such that $M$ is divisible by $2^n$ . This ensures that the kernel can be recursively divided into halves until the smallest possible slice is achieved, which is $\frac{M}{2^n}$ thread blocks. The sequence includes all possible divisions up to the original $M$ blocks.

By using elastic grids, normal kernels can be issued with a flexible number of thread blocks on SMs, co-locating with critical kernels and reducing inter-SM memory contention.

4.2.1.3. Workload-balanced-guided Design Space Shrinking

The design space for elastic kernels (combinations of grid size $N_{blk\_be}$ and block size $S_{blk\_be}$ ) is enormous. To make runtime selection feasible, Miriam prunes this space by eliminating configurations unlikely to perform well or leading to dispatch failures. This is done using two main constraints and corresponding metrics:

SM-level Workload Imbalance Constraint: The first constraint aims to address unbalanced workload across SMs when the number of thread blocks is not a multiple of the number of SMs. It prunes cases where the elastic kernel's thread blocks would exceed the remaining available SMs after critical kernels have been dispatched.

The second constraint addresses intra-SM workload balance, ensuring that each SM is utilized effectively without being either too light (wasted resources) or too heavy (resource contention). It prunes cases where the working threads of an elastic kernel would overly exceed the spare intra-SM resources already occupied by critical kernel blocks.

These two constraints are formally represented as: $\left\{ \begin{array} { l l } & { N _ { b lk \_ b e } \leqslant N _ { S M } - \ N _ { b lk \_ r t } \ m o d \ N _ { S M } } \\ & { S _ { b lk \_ b e } \leqslant L _ { threads } - \ b lk \_ s i z e _ { r t } } \end{array} \right.$ Where:
- $N_{blk\_be}$ : Number of thread blocks in a dispatched elastic normal kernel.
- $N_{SM}$ : Number of streaming multiprocessors on the GPU.
- $N_{blk\_rt}$ : Number of thread blocks in a dispatched critical kernel.
- $S_{blk\_be}$ : Number of working threads of each thread block in a dispatched elastic normal kernel.
- $L_{threads}$ : Limitations on the number of working threads (maximum thread capacity per SM).
- $blk\_size_{rt}$ : Number of threads of each thread block in a dispatched critical kernel.
  
  A workload imbalance metric called WIScore is defined: $W I S c o r e = \frac { N _ { b lk _ { - } r t } \ m o d \ N _ { S M } + N _ { b lk _ { - } b e } } { N _ { S M } } * \frac { S _ { b lk _ { - } b e } + S _ { b lk _ { - } b e } } { L _ { t h r e a d s } }$ Where:
- WIScore ranges from [0, 1]. A higher score indicates better workload balance and utilization. The term $S_{blk\_be} + S_{blk\_be}$ seems to be a typo in the original paper and likely intended to be $S_{blk\_rt} + S_{blk\_be}$ (or similar) to represent the combined thread usage, or simply $S_{blk\_be}$ relative to $L_{threads}$ if $S_{blk\_rt}$ is already accounted for in $L_{threads}$ . For faithfulness, the formula is presented exactly as in the paper.
Dispatch Overhead Constraint: This constraint prunes candidates that incur excessive dispatch overhead. The OScore is used to filter such cases: $O S c o r e = \left\{ \begin{array} { r l } & { 1 \sum L O _ { b lk } ( k _ { b e \_ i } ) < M A X _ { b lk } , \forall i \in [ 1 , N _ { s h a r d } ] } \\ & { a n d \sum L O _ { p t } ( k _ { b e \_ i } ) < M A X _ { p t } , \forall i \in [ 1 , N _ { s h a r d } ] } \\ & { 0 \ O t h e r w i s e } \end{array} \right.$ Where:
- LO() represents the launch overhead for an elastic kernel fragment (a shard).
- $k_{be\_i}$ : An elastic kernel fragment $i$ .
- $MAX_{blk}$ : Maximum acceptable launch overhead for thread blocks.
- $MAX_{pt}$ : Maximum acceptable launch overhead for persistent threads.
- $N_{shard}$ : Total number of elastic kernel shards.
- OScore is 1 if the sum of launch overhead for all elastic kernel fragments (both block-related $LO_{blk}$ and persistent thread-related $LO_{pt}$ ) is below predefined maximum thresholds; otherwise, it's 0.
  
  The combined metric (WIScore * OScore) helps identify promising elastic kernel candidates. Miriam then selects the top 20% of these candidates (based on this product) for runtime kernel coordination, aiming to reduce the design space without sacrificing performance.

4.2.1.4. Source-to-Source Elastic Kernel Transformer

Directly modifying grid and block sizes in CUDA kernels often leads to computation errors because these parameters are intrinsically linked to the kernel's computation schedule and how logical threads are mapped to physical GPU threads. This constraint is illustrated by Figure 6.

Figure 6: Grid/Block size cannot be directly modified in case of recomputation/computation error. 该图像是一个示意图，展示了物理线程和逻辑线程之间的关系，以及如何通过共享内存和寄存器进行矩阵乘法优化。图中强调了 Tile ext{ }size riangleq TB ext{ }size 的概念，并显示了不同内存层次结构的布局。

Figure 6: Grid/Block size cannot be directly modified in case of recomputation/computation error.

To overcome this, Miriam employs a source-to-source kernel transformer. This transformer inserts a piece of code at the beginning of each kernel. This injected code:

Computes a global thread identifier: This identifier acts as a basis for SM-level workload distribution. It takes the thread ID as input and produces a corresponding index for the data element accessed by the thread.
Replaces physical thread references with logical equivalents: Variables like GridDim (total grid dimensions) and threadIdx.x (thread index within a block) in the original kernel codes are replaced with their logical equivalents. This ensures that even when the underlying grid or block structure changes, the kernel correctly accesses the intended data elements and performs the correct computations.

The transformer uses two approaches for implementing the index function:

Computation-based: The index is computed within the kernel when the thread accesses a data element.
Memory-based: Indices are pre-calculated on the CPU (host) before kernel launch and stored in shared memory for faster access during kernel execution.

This transformation guarantees the consistency of computation results despite the dynamic adjustments to grid and block sizes, making the elastic kernel design viable.

4.2.2. Runtime Dynamic Kernel Coordination

This component is the online scheduler that manages the execution of elastic and critical kernels at runtime.

4.2.2.1. Execution Timeline and Scheduling Policy

When Miriam receives multiple normal task requests, their kernels are initially placed into a normal tasks queue and dispatched via multiple streams to the GPU. However, upon the arrival of a critical task:

Miriam immediately monitors the available GPU resources (from both critical and already running elastic kernels).
It then selects appropriate elastic kernel fragments from the subsequent normal kernel in a "bin-packing" manner. This means it tries to fit elastic kernel fragments into the available GPU resource slots left by the critical kernel without causing interference.
Once the critical kernels complete execution, the kernels from normal tasks (including any remaining elastic kernel fragments) can then fully re-occupy the GPU.

This greedy scheduling policy dynamically balances resource allocation. When an elastic kernel partially overlaps with a critical kernel, the coordinator ensures the padded elastic kernel does not interfere with the critical kernel while still maximizing resource usage. When an elastic kernel runs alone, it can utilize all available resources.

4.2.2.2. Dynamic-sized Shaded Binary Tree for Elastic Kernel Shards Formation

To efficiently manage and select elastic kernels at runtime, Miriam introduces a dynamic-sized shaded binary tree structure.

Figure 7: Shaded Binary Tree Construction for Kernel Shards Formations. ES refers to the elastic kernel, and EBS refers to the elastic block size. The sharding degree represents the degree of elastic kernel splitting depth. 该图像是一个示意图，展示了用于内核碎片形成的阴影二叉树构建。图中标注了关键内核、正常内核及其在不同相位的调度过程，同时展示了网格大小和分片程度。弹性块大小（Elastic Block Size）和可能的碎片（Possible Shard）与实际碎片（Actual Shard）也被说明。

Figure 7: Shaded Binary Tree Construction for Kernel Shards Formations. ES refers to the elastic kernel, and EBS refers to the elastic block size. The sharding degree represents the degree of elastic kernel splitting depth.

This structure abstracts the management of elastic kernel shards:

Tree Structure: The root represents the kernel from normal tasks with its initial grid size $M$ . Each node in the tree corresponds to a part of the computation or potential thread blocks to be dispatched.
Shading Property: The shading property for each node indicates the elastic block size (EBS) of its thread block.
Directed Edges: These show potential sliced peers for unfinished computations if a predecessor shard is partially executed or evicted.
Actual vs. Virtual Shards: Actual shards are the elastic kernel shards that are eventually dispatched. Virtual shards are potential fragments that might not be dispatched.
Sharding Degree: This represents the splitting depth of the elastic kernel.

The policy for selecting elastic fragments from normal kernels is to pick a set of elastic blocks from the head of this shaded kernel binary tree. These chosen elastic blocks are designed to share SM-level resources with co-locating thread blocks from resident critical kernels with minimal contention. The core principle is that elastic blocks from normal kernels should only use the leftover resources from the critical kernels, ensuring critical task performance.

This dynamic-sized shaded binary tree allows Miriam to adapt to runtime changes in resource availability and efficiently determine the optimal configuration for elastic kernel shards, leading to high runtime efficiency and low resource contention.

5. Experimental Setup

5.1. Datasets

The authors built a new DNN inference benchmark called MDTB (Mixed-critical DNN Task Benchmarks) based on CUDA implementations of popular DNN models.

The MDTB benchmark simulates three common patterns for inference tasks from user requests:

Arrival in uniform distribution: The client sends inference requests at a fixed frequency (e.g., 10 requests/second). This simulates critical applications like pose estimation.
Arrival in Poisson distribution: This simulates event-driven applications such as obstacle detection, where events (and thus requests) arrive randomly over time but at a known average rate.
Closed-loop workloads: The client continuously sends inference requests as soon as the previous one is completed, simulating maximum load.

The MDTB benchmark uses six representative DNN models from both computer vision and language processing fields, all implemented in CUDA:

AlexNet [22]
SqueezeNet [15]
GRU (Gated Recurrent Unit) [7]
LSTM (Long Short-Term Memory) [14]
ResNet [13]
CifarNet [30]

For evaluation, neural network inference is conducted with a $224 \times 224 \times 3$ single batch of images as input, mimicking real-world application scenarios.

The paper defines four specific MDTB workload configurations for evaluation, as summarized in Table 2:

The following are the results from Table 2 of the original paper:

MDTB	A	B	C	D
Critical Task	AlexNet	SqueezeNet	GRU	LSTM
Frequency (req/s)	Closed-loop	Uniform (10 reqs/s)	Poisson (10 reqs/s)	Uniform (10 reqs/s)
Normal Tasks	CifarNet	AlexNet	ResNet	SqueezeNet
Frequency (req/s)	Closed-loop	Closed-loop	Closed-loop	Closed-loop

MDTB A: AlexNet as a closed-loop critical task co-running with CifarNet as a closed-loop normal task. This represents a high-contention scenario.
MDTB B: SqueezeNet as a uniform (10 reqs/s) critical task co-running with AlexNet as a closed-loop normal task.
MDTB C: GRU as a Poisson (10 reqs/s) critical task co-running with ResNet as a closed-loop normal task.
MDTB D: LSTM as a uniform (10 reqs/s) critical task co-running with SqueezeNet as a closed-loop normal task.

These diverse workloads were chosen to effectively evaluate Miriam's performance and generalization across different DNN models, task criticalities, and arrival patterns.

5.2. Evaluation Metrics

The paper uses three key metrics to evaluate the performance of Miriam:

End-to-end Latency of Critical Tasks:
- Conceptual Definition: This metric quantifies the total time taken for a critical DNN inference task from its submission to the system until its completion. It is a direct measure of the real-time performance for tasks that have strict deadlines, focusing on how quickly critical tasks are serviced.
- Mathematical Formula: Not explicitly provided in the paper, but conceptually, it's the duration from request initiation to result availability for a specific critical task.
- Symbol Explanation: N/A (as no explicit formula given).
Overall Throughput:
- Conceptual Definition: This metric represents the total number of inference requests (from both critical and normal tasks) that the system can successfully serve within a given time period. It quantifies the system's efficiency and capacity to handle a high volume of DNN inferences.
- Mathematical Formula: Not explicitly provided in the paper, but conceptually, it's the Total Number of Requests Served / Total Time.
- Symbol Explanation: N/A (as no explicit formula given).
Achieved Occupancy:
- Conceptual Definition: Achieved occupancy is a measure of how effectively the Streaming Multiprocessors (SMs) on the GPU are utilized. It specifically quantifies the average ratio of active warps (groups of 32 threads executing in parallel) on an SM to the maximum number of active warps that the SM can support. A higher achieved occupancy generally indicates better GPU utilization and efficiency, as more warps are actively executing.
- Mathematical Formula: The paper provides the following definition: $AchievedOccupancy = \frac { Active\_warps / Active\_cycles } { MAX\_warps\_per\_SM }$
- Symbol Explanation:
  - $Active\_warps$ : The number of warps that are actively executing instructions on an SM at a given time.
  - $Active\_cycles$ : The number of clock cycles during which warps are actively executing instructions. The ratio $Active\_warps / Active\_cycles$ likely refers to the average number of active warps over a period.
  - $MAX\_warps\_per\_SM$ : The maximum number of warps that a single Streaming Multiprocessor (SM) can support concurrently. This is a hardware-defined limit.

5.3. Baselines

Miriam is compared against several existing DNN scheduling approaches on edge GPUs:

Sequential:
- Description: This baseline selects one model from the critical task queue and one from the normal task queue in a round-robin fashion, executing them one after another.
- Rationale: This mode allows critical tasks to run independently, effectively monopolizing GPU resources when active. It is expected to yield the lowest end-to-end latency for critical tasks but at the cost of significantly reduced overall throughput because normal tasks cannot run concurrently.
GPU Multi-stream with Priority:
- Description: This approach enqueues kernels from both critical and normal tasks simultaneously into different GPU streams. The GPU's default scheduler then executes them in parallel, potentially giving preference to streams marked with higher priority (though CUDA stream priority mechanisms can be complex and platform-dependent). This is a common approach adopted by systems like NVIDIA Triton [3].
- Rationale: This aims to increase concurrency and throughput by allowing tasks to run in parallel. However, without explicit contention management, it can lead to increased latency for critical tasks due to resource contention with normal tasks.
Inter-stream Barrier (IB):
- Description: This is described as a state-of-the-art multi-DNN operator scheduling method based on multi-stream. It uses inter-stream barriers (synchronization primitives) to manually synchronize kernel dispatch among different kernels or kernel groups.
- Rationale: IB attempts to control concurrency by explicitly coordinating kernel execution across streams. This allows for more deliberate overlapping of operators or kernels. It is expected to perform better than simple Multi-stream in managing contention to some extent.

6. Results & Analysis

6.1. Core Results Analysis

The experiments evaluated Miriam against the baselines (Sequential, Multi-stream, and Inter-stream Barrier (IB)) under MDTB A-D workloads on two edge GPU platforms: NVIDIA GeForce RTX 2060 and NVIDIA Jetson AGX Xavier. The analysis focuses on end-to-end latency for critical tasks, overall throughput, and achieved occupancy.

6.1.1. Closed-loop Critical Tasks (MDTB A)

Workload: AlexNet (critical, closed-loop) co-running with CifarNet (normal, closed-loop), representing a high resource contention scenario.
Results (Figures 8a-8f):

该图像是图表，展示了在2060和Xavier平台上不同任务的关键延迟、整体吞吐量和平均实现占用率的比较。使用Miriam框架的任务具有更优的性能，尤其在延迟和吞吐量方面。图中各任务的占用率数据显示，Miriam有效地提高了资源的利用效率。

The above figures illustrate the performance comparison across different GPU scheduling approaches under various MDTB workloads.
- Critical Task Latency (Figures 8a, 8b):
  - Compared to Sequential (which has the lowest critical task latency by design), Multi-stream increased latency by $1.95 \times$ on RTX 2060 and $2.02 \times$ on Xavier.
  - IB increased latency by $1.52 \times$ on 2060 and $1.77 \times$ on Xavier.
  - Miriam incurred only a 21% (2060) and 28% (Xavier) latency overhead for critical tasks. This demonstrates Miriam's effectiveness in preserving real-time performance for critical tasks even under heavy contention.
- Overall Throughput (Figures 8c, 8d):
  - Miriam significantly improved overall throughput by 64% on 2060 and 83% on Xavier compared to Sequential.
  - Notably, IB's throughput performance was sometimes worse than Sequential's. This is attributed to frequent critical task launches requiring more synchronization barriers between GPU streams, leading to significant overhead.
  - Multi-stream also improved throughput but less effectively than Miriam while incurring higher critical task latency.
- Achieved Occupancy (Figures 8e, 8f):
  - Miriam consistently led to higher SM-level GPU resource utilization (achieved occupancy) compared to other baselines. This indicates that Miriam effectively "fills in" idle GPU cycles with normal task computations without hurting critical tasks.
  - The paper notes that achieving nearly 100% theoretical occupancy is difficult for DNN inference tasks due to large thread blocks potentially causing resource idleness or SM incapacity to cover memory access latency.

6.1.2. Uniform/Poisson Critical Tasks (MDTB B, C, and D)

Workload: Scenarios with lower critical task launching frequencies (Uniform 10 reqs/s or Poisson 10 reqs/s) for SqueezeNet, GRU, and LSTM, co-running with closed-loop normal tasks like AlexNet, ResNet, and SqueezeNet.
Results:
- As critical workload frequency decreases, all approaches generally show improved overall throughput due to more opportunities for normal tasks to share GPU resources.
- Miriam maintained its superior performance:
  - On Xavier, Miriam increased overall throughput by $1.85 \times$ (MDTB B), $1.79 \times$ (MDTB C), and $1.91 \times$ (MDTB D) over Sequential.
  - Multi-stream and IB also yielded improved throughput ( $1.34 \times$ to $1.73 \times$ ) but caused severe latency degradation for critical tasks (32% to 88%).
  - Miriam only incurred a latency overhead of less than 21% for these benchmarks, confirming its ability to maximize throughput while preserving critical task latency.
- From a GPU utilization perspective, Miriam increased the average active warps per cycle, leading to better SM utilization. This validates the effectiveness of elastic kernel sharding and dynamic padding.
Observation on Jetson Xavier:
- The performance improvements from Miriam sometimes did not directly translate to higher SM occupancy on Jetson Xavier. This is explained by Xavier's lower onboard resources, fewer SMs, relatively low memory bandwidth, and lower thermal design power compared to RTX 2060. These factors can limit clock speed, parallelism, and thus the direct relationship between SM occupancy and performance.

6.1.3. In-depth Analysis of Miriam (AlexNet Co-running)

To understand Miriam's performance under severe contention, an in-depth analysis was conducted with two AlexNet models co-running on an RTX 2060: AlexNet-C (critical) and AlexNet-N (normal), both in closed-loop.

Figure 9: (Upper) The active kernel-level timeliness of two co-running AlexNet models with mixed-criticality, which is profiled from the Nsight System. (Lower) The average achieved occupancy for each layer of the critical AlexNet. 该图像是图表，展示了多流和Miriam方法下两种AlexNet模型的活跃内核级时效性以及各层的平均占用率。上部分比较了多流和Miriam的时效性表现，底部则展示了不同层中取得的占用率，表明Miriam方法有效减少了延迟。

Figure 9: (Upper) The active kernel-level timeliness of two co-running AlexNet models with mixed-criticality, which is profiled from the Nsight System. (Lower) The average achieved occupancy for each layer of the critical AlexNet.

Timelines (Figure 9, Upper): The figure (sketched from NVIDIA Nsight Sys profiles) shows Miriam (pink blocks for elastic kernels) successfully padding critical kernels (blue blocks) much more densely than Multi-stream (green blocks). This visualizes how Miriam's elastic kernel shards are tightly packed with critical kernels, leading to a significantly lower end-to-end latency for AlexNet-C in Miriam compared to Multi-stream.
Achieved Occupancy (Figure 9, Lower): The average layer-wise achieved occupancy for Miriam was 65.25%, significantly higher than 32.9% for Multi-stream. This confirms that Miriam achieves better SM utilization and parallelism by reducing contention overhead while preserving the speed of critical tasks.

6.1.4. Evaluations on Design Space Shrinking

Miriam's design space shrinking mechanism (Chapter 6.3) prunes 80% of definitely-slow cases using hardware limiters.

Figure 10: Shrinking the design space of elastic candidates for different DNN Models. Miriam picks up elastic kernels lying in the Pareto Frontier (for visualization) of tradeoff space between the elasticized scale and the scheduling granularity. 该图像是示意图，展示了不同 DNN 模型的弹性候选设计空间收缩情况。图中标出了帕累托前沿（Pareto Frontier）、弹性内核候选（Elastic Kernel Candidates）和不合格候选（Unqualified Candidates），并且提供了在修剪后的空间中各个模型的对应效果。该图为理解 Miriam 整体框架的性能优化提供了可视化参考。

Figure 10: Shrinking the design space of elastic candidates for different DNN Models. Miriam picks up elastic kernels lying in the Pareto Frontier (for visualization) of tradeoff space between the elasticized scale and the scheduling granularity.

Figure 10 illustrates the pruned space of candidate elastic kernels for different DNN models in MDTB. The pruning percentage ranges from 84% to 95.2%. This indicates that the constraints (WIScore and OScore) effectively reduce the search space for elastic kernel configurations. The authors ensure that this pruning does not remove optimal configurations by testing representative tensor operations and finding that the top 20% of candidates always included the best-performing parameters. The Pareto Frontier is used to visualize the tradeoff between elasticized scale (depth of the dynamic shaded binary tree) and scheduling granularity, guiding the selection of optimal elastic kernels.

6.1.5. Case Study: Autonomous Driving with LGSVL

A real-world trace from the LG SVL [16] autonomous driving platform was used as a workload: ResNet for obstacle detection (critical, $10 \mathrm{Hz}$ uniform) and SqueezeNet for pose estimation (normal, $12.5 \mathrm{Hz}$ uniform) on a GTX 2060.

该图像是实验结果的图表，展示了在2060平台上，Miriam方法在端到端关键任务延迟、整体吞吐量以及平均实现占用率上的表现。图(a)显示关键任务的标准化延迟，图(b)展示整体吞吐量，而图(c)则表现了实现的占用率。Miriam方法在延迟和吞吐量方面均优于其他基准。

The above figures illustrate the performance comparison of different scheduling schemes with LGSVL simulated workloads.

Figure 12: Real-world trace collected from LGSVL simulator, where (a) is the object detection result based on image data, (b) is the result with lidar point cloud data, and (c) depicts the setting of our collected trace. 该图像是图表，展示了从LGSVL模拟器收集的真实轨迹，其中(a)为基于图像数据的目标检测结果，(b)为基于激光雷达点云数据的结果，(c)展示了我们收集轨迹的设置。

Figure 12: Real-world trace collected from LGSVL simulator, where (a) is the object detection result based on image data, (b) is the result with lidar point cloud data, and (c) depicts the setting of our collected trace.

Results (Figure 11):
- Multi-stream and IB increased overall throughput by $1.41 \times$ and $1.25 \times$ respectively, but amplified critical task latency by $82\%$ and $56\%$ .
- Miriam achieved an 89% improvement in overall throughput compared to Sequential, while only incurring an 11% latency overhead for the critical task.
- This strong performance is attributed to the relatively low launching frequency of both critical and normal tasks, allowing elastic kernels to execute concurrently with minimal eviction overhead for elastic kernel shards.
- Figure 11(c) and Figure 12(c) show Miriam's high SM occupancy among all baselines, confirming its efficient resource utilization in this realistic scenario.

6.2. System Overhead

The scheduling overhead of Miriam consists of two main parts:

Runtime Elastic Kernel Shards Selection: This involves scanning shard candidates. The complexity is $O(N)$ , where $N$ is the number of candidates. The average overhead for serving each DNN model was found to be less than $0.35 \mathrm{ms}$ .
Launch Time Overhead for Critical Kernels: This overhead is due to the padding of elastic kernels. In over 80% of cases, this overhead was less than $15 \mathrm{\mu s}$ . The paper notes that this latency overhead is primarily due to contention on the texture cache and L2 memory, which is left for future work.

These overheads are considered minimal, demonstrating Miriam's practicality for real-time edge deployments.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper proposes Miriam, a novel system designed to address the challenges of latency and throughput in co-running multiple DNN inference tasks on resource-limited edge GPUs. Miriam introduces the concept of elastic kernels which enable fine-grained GPU resource remapping, and a runtime dynamic kernel coordinator that intelligently schedules these elastic kernels alongside critical tasks. Key contributions include the elastic-kernel generator (with elastic blocks and elastic grids and a source-to-source transformer for computational consistency) and the dynamic coordinator (using a greedy scheduling policy and shaded binary trees). Experimental results on a custom MDTB benchmark across two edge GPU platforms demonstrate that Miriam significantly improves overall system throughput (up to 92%) while maintaining minimal latency overhead (less than 10%) for critical tasks compared to state-of-the-art baselines.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

Scalability for General Multi-DNN Tasks: While Miriam shows promise for pair-wise DNN tasks, scaling it to support a larger and more general number of concurrently running DNN tasks requires further consideration. This includes developing robust scheduling policies for normal tasks that share the same priority and finding efficient ways to perform offline kernel profiling as the design space grows exponentially with more co-running kernels.
Integration with DNN Compilers: The paper suggests integrating Miriam with existing DNN compilers like TVM [5]. While TVM can generate high-performance kernels, its auto-tuning and compilation are offline processes, making it difficult to adapt to dynamic runtime scenarios on edge devices. Miriam could serve as a post-compiling runtime to bridge this gap, ensuring full resource utilization adaptively.
Orthogonality to Other Approaches: The authors state that Miriam can work symbiotically with other DNN optimization approaches such as model compression [23] and edge-cloud offloading [41]. This implies that Miriam itself does not perform these optimizations but could enhance their effectiveness when combined, leading to improved runtime performance and resource utilization in resource-constrained edge computing environments.
L2 Memory/Texture Cache Contention: The observed launch time overhead for critical kernels due to padding (less than $15 \mathrm{\mu s}$ ) is mainly attributed to contention on the texture cache and L2 memory. This specific area of contention is left for future investigation to further minimize overhead.

7.3. Personal Insights & Critique

Miriam presents a highly innovative and practical solution for a critical problem in edge AI – managing mixed-criticality multi-DNN inference on resource-limited GPUs. The core concept of elastic kernels is particularly insightful, moving beyond static kernel configurations to a dynamic, contention-aware resource allocation strategy. This software-centric approach, avoiding driver modifications or reliance on server-grade hardware features, makes it widely applicable to current edge platforms.

The rigorous experimental validation on diverse DNN workloads and realistic edge GPU platforms (RTX 2060 and Jetson AGX Xavier) lends strong credibility to the proposed framework. The detailed breakdown of intra-SM and inter-SM contention, followed by specific mechanisms (elastic blocks, elastic grids, design space shrinking) to address them, showcases a deep understanding of GPU architecture and performance bottlenecks. The source-to-source kernel transformer is a clever solution to a fundamental challenge of kernel modification, ensuring computational correctness.

One area for potential improvement or further exploration could be a more detailed analysis of the WIScore formula, specifically the term $S_{blk\_be} + S_{blk\_be}$ which appears to be a minor typo in the paper. Clarifying its intended meaning would enhance the rigor. Additionally, while the paper mentions L2 memory and texture cache contention as sources of overhead, a deeper dive into how elastic kernel shards might interact with these specific cache hierarchies and potential mitigation strategies would be valuable.

The idea of combining Miriam with other DNN optimization techniques like model compression or edge-cloud offloading is powerful. This orthogonality suggests Miriam could serve as a foundational runtime coordination layer in a holistic edge AI system, optimizing resource utilization across various dimensions. The framework's ability to adapt to dynamic runtime conditions is crucial for IoT applications where on-device resources can fluctuate. This research provides a significant step towards enabling robust and efficient real-time AI at the edge.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~25 min read · 30,502 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Offline Elastic Kernel Generation

4.2.1.1. Controllable Intra-SM Resource by Elastic Block

4.2.1.2. Elastic Grid for Inter-SM Contention

4.2.1.3. Workload-balanced-guided Design Space Shrinking

4.2.1.4. Source-to-Source Elastic Kernel Transformer

4.2.2. Runtime Dynamic Kernel Coordination

4.2.2.1. Execution Timeline and Scheduling Policy

4.2.2.2. Dynamic-sized Shaded Binary Tree for Elastic Kernel Shards Formation

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Closed-loop Critical Tasks (MDTB A)

6.1.2. Uniform/Poisson Critical Tasks (MDTB B, C, and D)

6.1.3. In-depth Analysis of Miriam (AlexNet Co-running)

6.1.4. Evaluations on Design Space Shrinking

6.1.5. Case Study: Autonomous Driving with LGSVL

6.2. System Overhead

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers