Mitigating transient load effects on ML serving latency: Concepts

This document describes best practices for deploying machine learning (ML) models in Google Kubernetes Engine (GKE). The document is intended for organizations that want to create a centralized shared-service platform to serve their ML models using graphics processing units (GPUs), with a specific focus on meeting service level objectives (SLOs) and cost effectiveness. The typical reader is an ML engineer and is expected to have a basic understanding of Kubernetes and ML model serving.


  • Graphics processing unit (GPU): a specialized processor designed for math-intensive applications such as graphics rendering, machine learning, and scientific computing.
  • Queries per second (QPS): a measurement of load. For ML models, QPS measures how many requests the model receives.
  • Latency: a common measurement of performance in an online serving platform. Latency is measured as the time it takes for a model request to respond, measured at the model server endpoint.
  • Service level indicator (SLI): an instance of a measurement. In ML, for example, an SLI can be whether the latency measurement of an inference request is under 50 ms. In this case, a pass response is at or under 50 ms.
  • Service level objective (SLO): The minimum percentage of SLI measurements that pass the SLI requirement. With the SLI, an SLO could be that 99.9% or more of all requests are responded to in 50 ms. This SLO means that out of 1,000 inference requests, 1 or fewer can exceed 50-ms latency.
  • Transient load spike: a sudden spike of incoming load beyond the capacity that's currently deployed. For example, for Cloud Storage the workload effectively hits a limit as GKE adds additional nodes.
  • NVIDIA Triton Inference Server® (Triton): a general ML model server manufactured by the NVIDIA Corporation that supports multiple frameworks on both GPU and CPU. This guide focuses on Triton as the model server.

Using GPUs to serve ML models

As organizations deploy more complex and deep ML models, GPUs are becoming increasingly prevalent in both model training and model inference. Compared to CPUs, GPUs can often perform large-scale training faster and run inference with shorter latency—particularly for deep and complex models. When processing these models at scale, GPUs can run more cost-effectively than CPUs. A shared-service inference platform built on top of GKE can economically serve most ML models.

Using GPUs instead of CPUs to serve ML models in GKE requires a few additional considerations:

  1. Individual GPUs have a high processing capacity. If resources aren't properly allocated when working with a group of GPUs, some GPUs can idle. Idle GPUs waste resources and add cost.
  2. GKE maximizes resource use by scheduling Pods from the same or different deployments on a given node. This is possible because CPU and system memory are shareable resources. However, as of Q2-2021, GPUs aren't shareable across multiple Pods. Once a Pod is deployed on a node with a GPU and owns the GPU, that Pod's model serving process is responsible for optimizing the GPU's resource use.
  3. ML model load, measured in QPS, can fluctuate. Slow fluctuations can be hourly load fluctuations when QPS increases as users come into work and decreases as they leave. Quick fluctuations can be ML model transient load spikes, where QPS rises unexpectedly and quickly. These spikes particularly happen when the ML model serves an upstream application, a partner, or a customer that you have no visibility into.
  4. Stringent SLOs are another constraint to using GPUs to serve ML models in GKE. A possible SLO could be that 99.9% of predictions must return in 50 ms or fewer. This rigor, combined with models that have unexpected spikes (#3) and have a slow scaling process for adding nodes (#2), means your system needs to have excess compute capacity available.
  5. Excess compute capacity (#4) adds additional cost. Be judicious about how you deploy your minimal excess capacity so that you meet your SLO.

Identifying model importance and temperature

In most organizations, some ML models are more important than others, and some models are more sensitive to latency than others. The following diagram shows a priority matrix that ranks ML models relative to business importance and latency.

Priority matrix showing model types in quadrants. It groups ML models into the
following buckets: more business critical, more latency sensitive, less business
critical, and less latency sensitive.

Figure 1: Model importance

This article defines Tier-1 models, as shown in the previous diagram, as models that are both business critical and latency sensitive. Tier-1 models must absolutely, without reasonable exceptions, meet a latency SLO. Models that aren't deemed Tier 1 are considered Tier 2.

It's common for organizations to have a so-called long-tailed model distribution. A few models are highly active (hot), progressively more models are less active (warm), and most models are rarely active (cold). Just because models are hot, it doesn't necessarily mean they are Tier 1. Telemetry use cases usually involve high volume, but not all cases are latency sensitive. In fact, cold or warm isn't a determination of a model's tier.

Grouping single or multiple GPU node types

As mentioned earlier, GPUs aren't shareable across multiple Pods. It's possible, however, for multiple Pods, each requiring one or more GPUs, to be deployed on the same node. On the left side of the following diagram, nodes are configured with only one GPU—each replica of Triton is deployed on its own node and GPU (A). The center of the diagram shows a node with multiple GPUs (B). While multiple replicas of Triton can be deployed into a node, the GPUs in this node can only be accessed by one Triton replica at any given time. The dotted red lines show GPU ownership for specific Triton replicas. The right side of the diagram shows two node types, meaning two separate node pools (C). By using the A2 VM with A100 GPUs instead of assigning multiple T4 GPUs to a Triton replica, this organization can accommodate Triton models that might require higher processing or memory requirements.

Single and multiple GPU node types grouped by node type.

Figure 2: Single and multiple GPU node types

Unless you have a specific need to deploy nodes with multiple GPUs, such as to serve models that are so large that they exceed the memory of a single GPU, you should consider using nodes with one GPU each. This deployment generally provides the following benefits:

  • Higher availability: If one node fails, a smaller portion of compute resources are taken out of service. This is important if you plan to use preemptible nodes to reduce cost, where you are guaranteed that nodes are preempted within 24 hours.
  • Better resource use: Smaller increments let you better tailor your resources to the needs of the actual load.

This document focuses on the one-GPU-per-node topology.

Defining deployment strategies

This section discusses three deployment strategies from simple to complex, along with their advantages and disadvantages. This discussion is based on a scenario where eight models are served and two models are Tier-1. The model stacking with priority deployment uses NVIDIA's Triton Inference Server to take advantage of compute unified device architecture (CUDA) prioritization.

One model per deployment

Using one model per deployment is a straightforward way to deploy models to GKE. In this scenario, there are eight models and eight deployments. Because Tier-1 models must perform against an SLO, they need to be highly available and configured with enough compute headroom to accommodate load spikes. To delay autoscaling, set a higher autoscaling trigger for Tier-2 models that don't need to meet strict SLOs.

The following diagram depicts the implementation of one model per deployment with the following rules: Tier-1 models are set to autoscale earlier at 50% GPU duty cycle and have at least two replicas. Tier-2 models can operate with only one replica and only scale at 80% duty cycle.

One model per deployment diagram showing eight deployments and 12 nodes.

Figure 3: One model per deployment


  • The one-model-per-deployment strategy provides the highest level of isolation. A noisy neighbor model prone to load spiking can't negatively affect the performance of other models running on the same cluster.
  • The one-model-per-deployment strategy is the simplest to manage. You can use Kubernetes to natively manage each model without the risk of affecting another model—each model is its own deployment.


  • The one-model-per-deployment strategy can be cost effective for models that can consume multiple GPUs, like model 5 in the preceding diagram. For warm and cold models, however, one model per deployment becomes costly with many underused GPUs.
  • The one-model-per-deployment strategy can expose Tier-1 models that have large transient load spikes to SLO violations. Models 1 and 2 can spike to 2X more than the autoscaling trigger before exceeding their limits. If these models frequently experience transient spikes greater than 2X, you might have to lower the autoscaling trigger. Although lowering the autoscaling trigger gives you more headroom for load spikes, it also increases your operational cost with unused capacity.

Use one model per deployment when your organization has only a few hot models that have infrequent transient spikes.

Model stacking by tier

Model stacking by tier separates models into deployments by tiers. The following diagram shows two deployments: Deployment 1 is for Tier 1. Deployment 2 is for Tier 2. Models 1 and 2 are stacked in Deployment 1 with an autoscaling trigger of 50%. The remaining Tier-2 models are stacked in Deployment 2 with an autoscaling trigger of 80%.

Model stacking by tiers showing two deployments and seven nodes.

Figure 4: Model stacking by tier


  • Model stacking by tier greatly reduces the compute footprint when compared to one model per deployment. It does so by combining warm and cold modes to reduce unused compute capacity.
  • Stacking multiple models together into a deployment also distributes spikes from individual models across more nodes. As long as loads from stacked models aren't correlated, from the same applications that activate together, it's unlikely that some or all models will spike at the same time.


  • Like the one-model-per-deployment strategy, the risk associated with model stacking by tier is that noisy models with large transient load spikes could limit the available resources in the deployment. This risk is made even greater because with stacking models the noisy neighbor can exceed limits on behalf of all the other models in the deployment.
  • Unlike the one-model-per-deployment strategy, model stacking by tier can't use Kubernetes to natively manage individual models.

Use model stacking by tier when you have many warm and cold Tier-2 models that can benefit from consolidation to reduce the compute footprint.

Model stacking with priority

Model stacking with priority takes advantage of Triton Inference Server's ability to leverage CUDA stream priority for models marked PRIORITY_MAX. Unlike model stacking by tier, model stacking with priority combines a Tier-1 model with Tier-2 models in a deployment. Ideally the Tier-2 models have a much larger combined nominal load than the Tier-1 model. The following diagram shows two deployments. Each deployment has one Tier-1 model stacked with one or more Tier-2 models. The nominal load for the Tier-1 models (Models 1 and 2) uses 15–20% of their deployment's capacity. To increase resource use, the autoscaling triggers for both deployments are set at 80%.

Model stacking with priority showing two deployments and six nodes.

Figure 5: Model stacking with priority

The following diagram illustrates three scenarios if there is a transient load spike.

  1. Tier-1 models experience a moderate transient load spike. Combined capacity requirements for both models are within the capacity of the nodes. Autoscaling is triggered, so GKE adds one or more additional nodes. Both models operate normally.
  2. Tier-1 models experience a severe transient load spike. Combined capacity requirements for both models exceed the capacity of the nodes. Autoscaling is triggered, so GKE adds one or more additional nodes. The Tier-1 models are given priority to resources and operate at full capacity. The Tier-2 models are restricted until GKE finishes deploying nodes.
  3. Tier-2 models experience a severe transient load spike. Combined capacity requirements for both models exceed the capacity of the nodes. Autoscaling is triggered, so GKE adds one or more additional nodes. The Tier-1 models are given priority to resources and operate at full capacity. The Tier-2 models are restricted until GKE finishes deploying nodes.

Transient load spikes distributed based on three different load priorities.

Figure 6: Transient load spiking with load priority


  • Model stacking with priority provides a much higher transient load spiking tolerance than the previous two strategies. This is particularly true when the average load from the Tier-1 models is significantly smaller than the average load from the Tier-2 models. Because the Tier-2 models are sacrificed during a severe transient load spike, the Tier-1 models can spike many times more than their average load before exceeding their limits.
  • Because the Tier-1 models operate with a large headroom to spike, all deployments can operate at a higher autoscale trigger, increasing resource use.


  • Transient load spike headroom for Tier-1 models depends on the load from Tier-2 models. When the load from most or all Tier-2 models disappears, GKE scales down appropriately. If the Tier-1 load now represents most of the load in this deployment, any transient spike is much more likely to cause the Tier-1 model to exceed its limits. To help reduce the risk, deploy multiple non-correlated Tier-2 models together so their combined load doesn't drop at the same time. Or, deploy a Tier-2 model that you know always has load.
  • As models change in load over time, they can both grow or shrink. You might end up with a particular deployment where a Tier-1 model has grown significantly more than any other model in that deployment. A transient load spike in this model might not be sufficiently absorbed by the now-diminished Tier-2 capacity. Therefore, over time you might need to redistribute some of your models across the different Triton deployments to ensure a good balance of Tier-1 and Tier-2 models.

Use model stacking by priority when you have a few Tier-1 models that can have large transient load spikes and you have an accompanying Tier-2 model load that is large enough to accommodate those spikes. and pause Pods

A time-consuming step of spinning up a new Pod to a new node is downloading the container image, especially if the image is large. Depending on your choice, an NVIDIA Triton container image can be as large as 10 GB. One way of accelerating a download is to cache an image in Container Registry and set imagePullPolicy to IfNotPresent.

Unlike provisioning CPU nodes, provisioning GPU nodes in GKE adds the additional step of installing related GPU drivers. The Triton Server image is also relatively large, depending upon which backends you decide to include with the server. These factors combined can make provisioning a new GPU node take up to nine minutes, delaying the availability of resources.

Pause Pods are Pods from a low-priority deployment that occupy nodes in a node pool awaiting preemption. Effectively, pause Pods act like hot spares for high-priority Pods. Deploying the same container image into pause Pods that are also used by the high-priority Pods is like prestaging images. Prestaging images reduces image pull times.

11 pause Pods in three separate deployments.

Figure 7: Pause Pods

What's next?