This page provides instructions on how to increase utilization and reduce costs by running multi-instance GPUs. With this configuration, you partition an NVIDIA A100 or H100 graphics processing unit (GPU) to share a single GPU across multiple containers on Google Kubernetes Engine (GKE).
Before reading this page, ensure that you're familiar with Kubernetes concepts such as Pods, nodes, deployments, and namespaces and are familiar with GKE concepts such as node pools, autoscaling, and auto-provisioning.
Introduction
Kubernetes allocates one full GPU per container even if the container only needs a fraction of the GPU for its workload, which might lead to wasted resources and cost overrun, especially if you are using the latest generation of powerful GPUs. To improve GPU utilization, multi-instance GPUs allow you to partition a single supported GPU in up to seven slices. Each slice can be allocated to one container on the node independently, for a maximum of seven containers per GPU. Multi-instance GPUs provide hardware isolation between the workloads, and consistent and predictable QoS for all containers running on the GPU.
For CUDA® applications, multi-instance GPUs are largely transparent. Each GPU partition appears as a regular GPU resource, and the programming model remains unchanged.
For more information on multi-instance GPUs, refer to the NVIDIA multi-instance GPU user guide.
Supported GPUs
The following GPU types support multi-instance GPUs:
- NVIDIA A100 (40GB)
- NVIDIA A100 (80GB)
NVIDIA H100 (80GB)
Multi-instance GPU partitions
The A100 GPU and H100 GPU consist of seven compute units and eight memory units, which can
be partitioned into GPU instances of varying sizes. The GPU partition sizes use
the following syntax: [compute]g.[memory]gb
. For example, a GPU partition
size of 1g.5gb
refers to a GPU instance with one compute unit (1/7th of
streaming multiprocessors on the GPU), and one memory unit (5 GB). The
partition size for the GPUs can be specified when you deploy an Autopilot
workload or when you create a Standard cluster.
The partitioning table in the NVIDIA multi-instance GPU user guide lists all the different GPU partition sizes, along with the amount of compute and memory resources available on each GPU partition. The table also shows the number of GPU instances for each partition size that can be created on the GPU.
The following table lists the partition sizes that GKE supports:
Partition size | GPU instances |
---|---|
GPU: NVIDIA A100 (40GB) (nvidia-tesla-a100 ) |
|
1g.5gb |
7 |
2g.10gb |
3 |
3g.20gb |
2 |
7g.40gb |
1 |
GPU: NVIDIA A100 (80GB) (nvidia-a100-80gb ) |
|
1g.10gb |
7 |
2g.20gb |
3 |
3g.40gb |
2 |
7g.80gb |
1 |
GPU: NVIDIA H100 (80GB) (nvidia-h100-80gb and nvidia-h100-mega-80gb ) |
|
1g.10gb |
7 |
1g.20gb |
4 |
2g.20gb |
3 |
3g.40gb |
2 |
7g.80gb |
1 |
Each GPU on each node within a node pool is partitioned the same way. For
example, consider a node pool with two nodes, four GPUs on each node, and a
partition size of 1g.5gb
. GKE creates seven partitions of size
1g.5gb
on each GPU. Since there are four GPUs on each node, there are 28
1g.5gb
GPU partitions available on each node. Since there are two nodes in the
node pool, a total of 56 1g.5gb
GPU partitions are available in the entire
node pool.
To create a GKE Standard cluster with more than one type
of GPU partition, you must create multiple node pools. For example, if you want
nodes with 1g.5gb
and 3g.20gb
GPU partitions in a cluster, you must create two
node pools: one with the GPU partition size set to 1g.5gb
, and the other with
3g.20gb
.
A GKE Autopilot cluster automatically creates nodes with the correct partition configuration when you create workloads that require different partition sizes.
Each node is labeled with the size of GPU partitions that are available on the
node. This labeling allows workloads to target nodes with the needed GPU
partition size. For example, on a node with 1g.5gb
GPU instances, the node is
labeled as:
cloud.google.com/gke-gpu-partition-size=1g.5gb
How it works
To use multi-instance GPUs, you perform the following tasks:
- Create a cluster with multi-instance GPUs enabled.
- Manually install drivers.
- Verify how many GPU resources are on the node.
- Deploy containers using multi-instance GPUs.
Pricing
Multi-instance GPUs are exclusive to A100 GPUs and H100 GPUs and are subject to the corresponding GPU pricing in addition to any other products used to run your workloads. You can only attach whole GPUs to nodes in your cluster for partitioning. For GPU pricing information, refer to the GPUs pricing page.
Limitations
- Using multi-instance GPU partitions with GKE is not recommended for untrusted workloads.
- Autoscaling and auto-provisioning GPU partitions are fully supported on GKE version 1.20.7-gke.400 or later. In earlier versions only node pools with at least one node can be auto-scaled based on demand for specific GPU partition sizes from workloads.
- GPU utilization metrics (for example,
duty_cycle
) are not available for multi-instance GPUs. - Multi-instance splits a physical GPU into discrete instances, each of which is isolated from the others at the hardware level. A container that uses a multi-instance GPU instance can only access the CPU and memory resources available to that instance.
- A pod can only consume up to one multi-instance GPU instance.
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
- In Autopilot, multi-instance GPUs are supported in GKE version 1.29.3-gke.1093000 and later.
- You must have sufficient NVIDIA A100 GPU quota. See Requesting an increase in quota.
- If you want to use multi-instance GPUs with Autopilot, you can learn more about using GPUs with Autopilot at Deploy GPU workloads in Autopilot.
- GKE assigns the
Accelerator
compute class to all multi-instance GPU workloads in Autopilot clusters.
Create a cluster with multi-instance GPUs enabled
If you use GKE Standard, you must enable multi-instance GPUs in the cluster. Autopilot clusters that run version 1.29.3-gke.1093000 and later enable multi-instance GPUs by default. To use multi-instance GPUs in Autopilot, see the Deploy containers using multi-instance GPU section of this page.
When you create a Standard cluster with multi-instance GPUs, you must specify
gpuPartitionSize
along with acceleratorType
and acceleratorCount
. The
acceleratorType
must be nvidia-tesla-a100
, nvidia-a100-80gb
, or
nvidia-h100-80gb
.
The following example shows how to create a GKE cluster with one
node, and seven GPU partitions of size 1g.5gb
on the node. The other steps in
this page use a GPU partition size of 1g.5gb
, which creates seven partitions
on each GPU. You can also use any of the supported GPU partition sizes mentioned
earlier.
You can use the Google Cloud CLI or Terraform.
gcloud
Create a cluster with multi-instance GPUs enabled:
gcloud container clusters create CLUSTER_NAME \
--project=PROJECT_ID \
--zone ZONE \
--cluster-version=CLUSTER_VERSION \
--accelerator type=nvidia-tesla-a100,count=1,gpu-partition-size=1g.5gb,gpu-driver-version=DRIVER_VERSION \
--machine-type=a2-highgpu-1g \
--num-nodes=1
Replace the following:
CLUSTER_NAME
: the name of your new cluster.PROJECT_ID
: the ID of your Google Cloud project.ZONE
: the compute zone for the cluster control plane.CLUSTER_VERSION
: the version must be1.19.7-gke.2503
or later.DRIVER_VERSION
: the NVIDIA driver version to install. Can be one of the following:default
: Install the default driver version for your GKE version.latest
: Install the latest available driver version for your GKE version. Available only for nodes that use Container-Optimized OS.disabled
: Skip automatic driver installation. You must manually install a driver after you create the cluster. If you omitgpu-driver-version
, this is the default option.
Terraform
To create a cluster with multi-instance GPUs enabled using Terraform, refer to the following example:
To learn more about using Terraform, see Terraform support for GKE.
Connect to the cluster
Configure kubectl
to connect to the newly created cluster:
gcloud container clusters get-credentials CLUSTER_NAME
Install drivers
If you chose to disable automatic driver installation when creating the cluster, or if you're running a GKE version earlier than 1.27.2-gke.1200, you must manually install a compatible NVIDIA driver after creation completes. Multi-instance GPUs require an NVIDIA driver version 450.80.02 or later.
After the driver is installed, multi-instance GPU mode is enabled. If you automatically installed drivers, your nodes reboot when the GPU device plugin starts to create GPU partitions. If you manually installed drivers, your nodes reboot when driver installation completes. The reboot might take a few minutes to complete.
Verify how many GPU resources are on the node
Run the following command to verify that the capacity and allocatable count of
nvidia.com/gpu
resources is 7:
kubectl describe nodes
Here's the output from the command:
...
Capacity:
...
nvidia.com/gpu: 7
Allocatable:
...
nvidia.com/gpu: 7
Deploy containers using multi-instance GPU
You can deploy up to one container per multi-instance GPU device on the node. In
this example, with a partition size of 1g.5gb
, there are seven multi-instance
GPU partitions available on the node. As a result, you can deploy up to seven
containers that request GPUs on this node.
Here's an example that starts the
cuda:11.0.3-base-ubi7
container and runsnvidia-smi
to print the UUID of the GPU within the container. In this example, there are seven containers, and each container receives one GPU partition. This example also sets thecloud.google.com/gke-gpu-partition-size
node selector to target nodes with1g.5gb
GPU partitions.Autopilot
cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: cuda-simple spec: replicas: 7 selector: matchLabels: app: cuda-simple template: metadata: labels: app: cuda-simple spec: nodeSelector: cloud.google.com/gke-gpu-partition-size: 1g.5gb cloud.google.com/gke-accelerator: nvidia-tesla-a100 cloud.google.com/gke-accelerator-count: "1" containers: - name: cuda-simple image: nvidia/cuda:11.0.3-base-ubi7 command: - bash - -c - | /usr/local/nvidia/bin/nvidia-smi -L; sleep 300 resources: limits: nvidia.com/gpu: 1 EOF
This manifest does the following:
- Requests the
nvidia-tesla-a100
GPU type by setting thecloud.google.com/gke-accelerator
node selector. - Splits the GPU into the
1g.5gb
partition size. - Attaches a single GPU to the node by setting the
cloud.google.com/gke-accelerator-count
node selector.
Standard
cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: cuda-simple spec: replicas: 7 selector: matchLabels: app: cuda-simple template: metadata: labels: app: cuda-simple spec: nodeSelector: cloud.google.com/gke-gpu-partition-size: 1g.5gb containers: - name: cuda-simple image: nvidia/cuda:11.0.3-base-ubi7 command: - bash - -c - | /usr/local/nvidia/bin/nvidia-smi -L; sleep 300 resources: limits: nvidia.com/gpu: 1 EOF
This manifest does the following:
- Requests a single GPU with partition size
1g.5gb
.
- Requests the
Verify that all seven Pods are running:
kubectl get pods
Here's the output from the command:
NAME READY STATUS RESTARTS AGE cuda-simple-849c47f6f6-4twr2 1/1 Running 0 7s cuda-simple-849c47f6f6-8cjrb 1/1 Running 0 7s cuda-simple-849c47f6f6-cfp2s 1/1 Running 0 7s cuda-simple-849c47f6f6-dts6g 1/1 Running 0 7s cuda-simple-849c47f6f6-fk2bs 1/1 Running 0 7s cuda-simple-849c47f6f6-kcv52 1/1 Running 0 7s cuda-simple-849c47f6f6-pjljc 1/1 Running 0 7s
View the logs to see the GPU UUID, using the name of any Pod from the previous command:
kubectl logs cuda-simple-849c47f6f6-4twr2
Here's the output from the command:
GPU 0: A100-SXM4-40GB (UUID: GPU-45eafa61-be49-c331-f8a2-282736687ab1) MIG 1g.5gb Device 0: (UUID: MIG-GPU-45eafa61-be49-c331-f8a2-282736687ab1/11/0)
What's next
- Learn more about GPUs.
- Learn how to configure time-sharing on GPUs.
- Learn more about cluster multi-tenancy.
- Learn more about best practices for enterprise multi-tenancy.