Jump to Content
Containers & Kubernetes

Turbocharge workloads with new multi-instance NVIDIA GPUs on GKE

April 27, 2021
Maulin Patel

Group Product Manager, Google Kubernetes Engine

Pradeep Venkatachalam

Software Engineer, GCP

Developers and data scientists are increasingly turning to Google Kubernetes Engine (GKE) to run demanding workloads such as machine learning, visualization/rendering and high-performance computing, leveraging GKE’s support for NVIDIA GPUs. GKE brings the flexibility, autoscaling and management simplicity, while GPUs bring superior processing power. Today, we are launching support for multi-instance GPUs in GKE (currently in Preview), which will help you drive better value from your GPU investments.

Open-source Kubernetes allocates one full GPU per container—even if the container only needs a fraction of the GPU for its workload. This can lead to wasted resources and cost overruns, especially if you are using the latest generation of powerful GPUs. This is of particular concern for inference workloads, which process only a handful of samples in real-time (in contrast, training workloads process millions of samples in large batches). Thus, for inference and other lightweight GPU workloads, GPU sharing is essential to improve utilization and lower costs. 

With the launch of multi-instance GPUs in GKE, now you can partition a single NVIDIA A100 GPU into up to seven instances that each have their own high-bandwidth memory, cache and compute cores.   Each instance can be allocated to one container, for a maximum of seven containers per one NVIDIA A100 GPU. Further, multi-instance GPUs provide hardware isolation between containers, and consistent and predictable QoS for all containers running on the GPU. 

Add to that the fact that A2 VMs, Google Cloud’s largest GPU-based Compute Engine instances, support up to 16 A100 GPUs per instance. That means you can have up to 112 schedulable GPU instances per node, where each can run one independent workload. By leveraging GKE’s industry-leading auto-scaling and auto-provisioning capabilities, multi-instance GPUs can be automatically scaled up or down, offering superior performance at lower costs. 

For CUDA® applications, multi-instance GPUs are largely transparent. Each GPU instance appears as a regular GPU resource, and the programming model remains unchanged, making multi-instance GPUs easy and convenient to use.

What customers are saying

Early adopters of multi-instance GPU nodes are using the technology to turbocharge their use of GKE for demanding workloads. Betterview, a provider of property insight and workflow tools for the insurance sector, uses GKE and NVIDIA GPUs to process aerial imagery.


"The multi-instance GPU architecture with A100s evolves working with GPUs in Kubernetes/GKE. By reducing the number of configuration hoops one has to jump through to attach a GPU to a resource, Google Cloud and NVIDIA have taken a needed leap to lower the barrier to deploying machine learning at scale. Alongside reduced configuration complexity, NVIDIA’s sheer GPU inference performance with the A100 is blazing fast. Partnering with Google Cloud has given us many exceptional options to deploy AI in the way that works best for us." - Jason Janofsky, VP Engineering & CTO, Betterview

Creating multi-instance GPU partitions

The A100 GPU consists of seven compute units and eight memory units, which can be partitioned into GPU instances of varying sizes, providing the flexibility and choice you need to scale your workloads. For example, you can create two multi-instance GPU instances with 20GB of memory each, three instances with 10GB, or seven with 5GB. 

The GPU partition instances use the following syntax: [compute]g.[memory]gb. For example, a GPU partition size 1g.5gb refers to a GPU instance with one compute unit (1/7th of streaming multiprocessors on the GPU), and 1 memory unit (5GB). The partition size for A100 GPUs can be specified through the GKE cluster or node pool API. 

Deploying containers on a multi-instance GPU node

You can deploy up to one container per multi-instance GPU instance on a node. With a partition size of 1g.5gb, there are seven multi-instance GPU partitions available on the node with one A100 GPU. As a result, you can deploy up to seven containers that request GPUs on this node. 

Each node is labeled with the size of it’s available GPU partitions. This labeling allows workloads to request the right sized GPU instances through node selectors or node affinity. 

Getting started 

Now, with multi-instance GPUs on GKE, you can easily match your workload acceleration needs with right sized resources. Moreover, you can exploit the power of GKE to automatically scale the infrastructure to efficiently serve your acceleration needs while delivering a better user experience and minimizing the operational costs. Get started today!

Posted in