Running multi-instance GPUs


This page provides instructions on how to partition an NVIDIA® A100 graphics processing unit (GPU) to share a single GPU across multiple containers on Google Kubernetes Engine (GKE).

This page assumes that you are familiar with Kubernetes concepts such as Pods, nodes, deployments, and namespaces and are familiar with GKE concepts such as node pools, autoscaling, and auto-provisioning.

Introduction

Kubernetes allocates one full GPU per container even if the container only needs a fraction of the GPU for its workload, which might lead to wasted resources and cost overrun, especially if you are using the latest generation of powerful GPUs. To improve GPU utilization, multi-instance GPUs allow you to partition a single NVIDIA A100 GPU in up to seven slices. Each slice can be allocated to one container on the node independently, for a maximum of seven containers per one NVIDIA A100 GPU. Multi-instance GPUs provide hardware isolation between the workloads, and consistent and predictable QoS for all containers running on the GPU.

For CUDA® applications, multi-instance GPUs are largely transparent. Each GPU partition appears as a regular GPU resource, and the programming model remains unchanged.

For more information on multi-instance GPUs, refer to the NVIDIA multi-instance GPU user guide.

Multi-instance GPU partitions

The A100 GPU consists of seven compute units and eight memory units, which can be partitioned into GPU instances of varying sizes. The GPU partition sizes use the following syntax: [compute]g.[memory]gb. For example, a GPU partition size of 1g.5gb refers to a GPU instance with one compute unit (1/7th of streaming multiprocessors on the GPU), and one memory unit (5 GB). The partition size for A100 GPUs can be specified when you create a cluster. See the Create a cluster with multi-instance GPUs enabled section for an example.

The partitioning table in the NVIDIA multi-instance GPU user guide lists all the different GPU partition sizes, along with the amount of compute and memory resources available on each GPU partition. The table also shows the number of GPU instances for each partition size that can be created on the A100 GPU.

The following table lists the partition sizes that GKE supports:

Partition size GPU instances Compute units per instance Memory units per instance
1g.5gb 7 1 1
2g.10gb 3 2 2
3g.20gb 2 3 4
7g.40gb 1 7 8

Each GPU on each node within a node pool is partitioned the same way. For example, consider a node pool with two nodes, four GPUs on each node, and a partition size of 1g.5gb. GKE creates seven partitions of size 1g.5gb on each GPU. Since there are four GPUs on each node, there will be 28 1g.5gb GPU partitions available on each node. Since there are two nodes in the node pool, a total of 56 1g.5gb GPU partitions are available in the entire node pool.

To create a GKE cluster with more than one type of GPU partition, you must create multiple node pools. For example, if you want nodes with 1g.5gb and 3g.20gb GPU partitions in a cluster, you must create two node pools: one with the GPU partition size set to 1g.5gb, and the other with 3g.20gb.

Each node is labeled with the size of GPU partitions that are available on the node. This labeling allows workloads to target nodes with the needed GPU partition size. For example, on a node with 1g.5gb GPU instances, the node is labeled as:

cloud.google.com/gke-gpu-partition-size=1g.5gb

How it works

To use multi-instance GPUs, you perform the following tasks:

  1. Create a cluster with multi-instance GPUs enabled.
  2. Install drivers and configure GPU partitions.
  3. Verify how many GPU resources are on the node.
  4. Deploy containers on the node.

Pricing

Multi-instance GPUs are exclusive to A100 GPUs and are subject to A100 GPU pricing in addition to any other products used to run your workloads. You can only attach whole A100 GPUs to nodes in your cluster for partitioning. For GPU pricing information, refer to the GPUs pricing page.

Limitations

  • Using multi-instance GPU partitions with GKE is not recommended for untrusted workloads.
  • Auto-scaling and auto-provisioning GPU partitions is fully supported on GKE version 1.20.7-gke.400 or later. In earlier versions only node pools with at least one node can be auto-scaled based on demand for specific GPU partition sizes from workloads.
  • GPU utilization metrics (for example duty_cycle) are not available for GPU instances.

Before you begin

Before you start, make sure you have performed the following tasks:

Set up default gcloud settings using one of the following methods:

  • Using gcloud init, if you want to be walked through setting defaults.
  • Using gcloud config, to individually set your project ID, zone, and region.

Using gcloud init

If you receive the error One of [--zone, --region] must be supplied: Please specify location, complete this section.

  1. Run gcloud init and follow the directions:

    gcloud init

    If you are using SSH on a remote server, use the --console-only flag to prevent the command from launching a browser:

    gcloud init --console-only
  2. Follow the instructions to authorize gcloud to use your Google Cloud account.
  3. Create a new configuration or select an existing one.
  4. Choose a Google Cloud project.
  5. Choose a default Compute Engine zone for zonal clusters or a region for regional or Autopilot clusters.

Using gcloud config

  • Set your default project ID:
    gcloud config set project PROJECT_ID
  • If you are working with zonal clusters, set your default compute zone:
    gcloud config set compute/zone COMPUTE_ZONE
  • If you are working with Autopilot or regional clusters, set your default compute region:
    gcloud config set compute/region COMPUTE_REGION
  • Update gcloud to the latest version:
    gcloud components update
  • Multi-instance GPUs are supported on GKE version 1.19.7-gke.2503 or later.
  • You must have sufficient NVIDIA A100 GPU quota. See Requesting an increase in quota.

Create a cluster with multi-instance GPUs enabled

When you create a cluster with multi-instance GPUs, you must specify gpuPartitionSize along with acceleratorType and acceleratorCount. Since multi-instance GPUs are supported only on A100 GPUs, the acceleratorType must be nvidia-tesla-a100.

The following example shows how to create a GKE cluster with one node, and seven GPU partitions of size 1g.5gb on the node. The other steps in this page use a GPU partition size of 1g.5gb, which creates seven partitions on each GPU. You can also use any of the supported GPU partition sizes mentioned earlier.

  1. To create a cluster with multi-instance GPUs enabled using the gcloud command-line tool, run the following command:

    gcloud container clusters create CLUSTER_NAME  \
        --project=PROJECT_ID  \
        --zone ZONE  \
        --cluster-version=CLUSTER_VERSION  \
        --accelerator type=nvidia-tesla-a100,count=1,gpu-partition-size=1g.5gb  \
        --machine-type=a2-highgpu-1g  \
        --num-nodes=1
    

    Replace the following:

    • CLUSTER_NAME: the name of your new cluster.
    • PROJECT_ID: the ID of your Google Cloud project.
    • ZONE: the compute zone for the cluster control plane.
    • CLUSTER_VERSION: the version must be 1.19.7-gke.2503 or later.
  2. Configure kubectl to connect to the newly created cluster:

    gcloud container clusters get-credentials CLUSTER_NAME
    

    Here's the output from the get-credentials command:

    Fetching cluster endpoint and auth data.
    kubeconfig entry generated for CLUSTER_NAME.
    

Install drivers and configure GPU partitions

After creating the cluster, you must update the device plugin and install NVIDIA's device drivers on the nodes.

  1. Update the device plugin to enable the discovery of GPU instances:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/cmd/nvidia_gpu/device-plugin.yaml
    
  2. Multi-instance GPUs require an NVIDIA driver version 450.80.02 or later. After the driver is installed, multi-instance GPU mode must be enabled, followed by a reboot of the node for the changes to take effect. After the GPUs are in multi-instance GPU mode, the required GPU partitions must be created.

    Run the following DaemonSet command to perform all of these actions:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-nvidia-mig.yaml
    

    This task restarts the node to enable multi-instance GPU mode, so it might take a few minutes to complete.

Verify how many GPU resources are on the node

Run the following command to verify that the capacity and allocatable count of nvidia.com/gpu resources is 7:

kubectl describe nodes

Here's the output from the command:

...
Capacity:
  ...
  nvidia.com/gpu:             7
Allocatable:
  ...
  nvidia.com/gpu:             7

Deploy containers on the node

You can deploy up to one container per multi-instance GPU device on the node. In this example, with a partition size of 1g.5gb, there are seven multi-instance GPU partitions available on the node. As a result, you can deploy up to seven containers that request GPUs on this node.

  1. Here's an example that starts the cuda:11.0-base container and runs nvidia-smi to print the UUID of the GPU within the container. In this example, there are seven containers, and each container receives one GPU partition. This example also sets the node selector to target nodes with 1g.5gb GPU partitions.

    cat <<EOF | kubectl apply -f -
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: cuda-simple
    spec:
      replicas: 7
      selector:
        matchLabels:
        app: cuda-simple
      template:
        metadata:
          labels:
            app: cuda-simple
        spec:
          nodeSelector:
            cloud.google.com/gke-gpu-partition-size: 1g.5gb
          containers:
          - name: cuda-simple
            image: nvidia/cuda:11.0-base
            command:
            - bash
            - -c
            - |
              /usr/local/nvidia/bin/nvidia-smi -L; sleep 300
            resources:
              limits:
                nvidia.com/gpu: 1
    EOF
    
  2. Verify that all seven Pods are running:

    kubectl get pods
    

    Here's the output from the command:

    NAME                           READY   STATUS    RESTARTS   AGE
    cuda-simple-849c47f6f6-4twr2   1/1     Running   0          7s
    cuda-simple-849c47f6f6-8cjrb   1/1     Running   0          7s
    cuda-simple-849c47f6f6-cfp2s   1/1     Running   0          7s
    cuda-simple-849c47f6f6-dts6g   1/1     Running   0          7s
    cuda-simple-849c47f6f6-fk2bs   1/1     Running   0          7s
    cuda-simple-849c47f6f6-kcv52   1/1     Running   0          7s
    cuda-simple-849c47f6f6-pjljc   1/1     Running   0          7s
    
  3. View the logs to see the GPU UUID, using the name of a Pod from the previous command:

    kubectl logs cuda-simple-849c47f6f6-4twr2
    

    Here's the output from the command:

    GPU 0: A100-SXM4-40GB (UUID: GPU-45eafa61-be49-c331-f8a2-282736687ab1)
      MIG 1g.5gb Device 0: (UUID: MIG-GPU-45eafa61-be49-c331-f8a2-282736687ab1/11/0)
    
  4. Repeat for any other logs that you want to view:

    kubectl logs cuda-simple-849c47f6f6-8cjrb
    

    Here's the output from the command:

    GPU 0: A100-SXM4-40GB (UUID: GPU-45eafa61-be49-c331-f8a2-282736687ab1)
      MIG 1g.5gb Device 0: (UUID: MIG-GPU-45eafa61-be49-c331-f8a2-282736687ab1/7/0)
    

What's next