Share GPUs with multiple workloads using time-sharing


This page shows you how to let multiple workloads get time-shared access to a single NVIDIA® GPU hardware accelerator in your Google Kubernetes Engine (GKE) nodes. To learn more about how time-sharing works, as well as limitations and examples of when you should use time-sharing GPUs, refer to Time-sharing GPUs on GKE.

Who should use this guide

The instructions in this topic apply to you if you are one of the following:

  • Platform administrator: Creates and manages a GKE cluster, plans infrastructure and resourcing requirements, and monitors the cluster's performance.
  • Application developer: Designs and deploys workloads on GKE clusters. If you want instructions for requesting time-shared GPUs, refer to Deploy workloads that use time-shared GPUs.

Requirements and limitations

  • You can enable time-sharing GPUs on GKE Standard clusters and node pools running GKE version 1.23.7-gke.1400 and later.
  • You can't update existing clusters or node pools to enable time-sharing GPUs.

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI.
  • Ensure that you have sufficient NVIDIA Tesla GPU quota. If you need more quota, refer to Requesting an increase in quota.
  • Ensure that your Google Cloud CLI components are at version 384.0.0 or later.
  • Plan your time-sharing GPU capacity based on the resource needs of the workloads and the capacity of the underlying GPU.

Enable time-sharing GPUs on GKE clusters and node pools

As a platform administrator, you must enable time-sharing GPUs on a GKE Standard cluster before developers can deploy workloads to use the GPUs. To enable time-sharing, you must do the following:

  1. Enable time-sharing GPUs on a GKE cluster.
  2. Install NVIDIA GPU device drivers.
  3. Verify the GPU resources available on your nodes.

Enable time-sharing GPUs on a GKE cluster

You can enable time-sharing GPUs when you create GKE Standard clusters. The default node pool in the cluster has the feature enabled. You still need to enable time-sharing GPUs when you manually create new node pools in that cluster.

gcloud container clusters create CLUSTER_NAME \
    --region=COMPUTE_REGION \
    --cluster-version=CLUSTER_VERSION \
    --machine-type=MACHINE_TYPE \
    --accelerator=type=GPU_TYPE,count=GPU_QUANTITY,gpu-sharing-strategy=time-sharing,max-shared-clients-per-gpu=CLIENTS_PER_GPU

Replace the following:

  • CLUSTER_NAME: the name of your new cluster.
  • COMPUTE_REGION: the Compute Engine region for your new cluster. For zonal clusters, specify --zone=COMPUTE_ZONE.
  • CLUSTER_VERSION: the GKE version for the cluster control plane and nodes. Use GKE version 1.23.7-gke.1400 or later. Alternatively, specify a release channel with that GKE version by using the --release-channel=RELEASE_CHANNEL flag.
  • MACHINE_TYPE: the Compute Engine machine type for your nodes. For A100 GPUs, use an A2 machine type. For all other GPUs, use an N1 machine type.
  • GPU_TYPE: the GPU type, which must be an NVIDIA Tesla GPU platform such as nvidia-tesla-v100.
  • GPU_QUANTITY: the number of physical GPUs to attach to each node in the default node pool.
  • CLIENTS_PER_GPU: the maximum number of containers that can share each physical GPU.

Enable time-sharing GPUs on a GKE node pool

You can enable time-sharing GPUs when you manually create new node pools in a GKE cluster.

gcloud container node-pools create NODEPOOL_NAME \
    --cluster=CLUSTER_NAME \
    --machine-type=MACHINE_TYPE \
    --region=COMPUTE_REGION \
    --accelerator=type=GPU_TYPE,count=GPU_QUANTITY,gpu-sharing-strategy=time-sharing,max-shared-clients-per-gpu=CLIENTS_PER_GPU

Replace the following:

  • NODEPOOL_NAME: the name of your new node pool.
  • CLUSTER_NAME: the name of your cluster, which must run GKE version 1.23.7-gke.1400 or later.
  • COMPUTE_REGION: the Compute Engine region of your cluster. For zonal clusters, specify --zone=COMPUTE_ZONE.
  • MACHINE_TYPE: the Compute Engine machine type for your nodes. For A100 GPUs, use an A2 machine type. For all other GPUs, use an N1 machine type.
  • GPU_TYPE: the GPU type, which must be an NVIDIA Tesla GPU platform such as nvidia-tesla-v100.
  • GPU_QUANTITY: the number of physical GPUs to attach to each node in the node pool.
  • CLIENTS_PER_GPU: the maximum number of containers that can share each physical GPU.

Install NVIDIA GPU device drivers

Before you proceed, connect to your cluster by running the following command:

gcloud container clusters get-credentials CLUSTER_NAME

After you create a new cluster or node pool and enable time-sharing GPUs, you need to install the GPU device drivers from NVIDIA that manage the time-sharing division of the physical GPUs. To install the drivers, you deploy a GKE installation DaemonSet that sets the drivers up.

For instructions, refer to Installing NVIDIA GPU device drivers.

If you plan to use node auto-provisioning in your cluster, you must also configure node auto-provisioning with the scopes that allow GKE to install the GPU device drivers for you. For instructions, refer to Using node auto-provisioning with GPUs

Verify the GPU resources available on your nodes

To verify that the number of GPUs visible in your nodes matches the number you specified when you enabled time-sharing, describe your nodes:

kubectl describe nodes NODE_NAME

The output is similar to the following:

...
Capacity:
  ...
  nvidia.com/gpu:             3
Allocatable:
  ...
  nvidia.com/gpu:             3

In this example output, the number of GPU resources on the node is 3 because the value that was specified for max-shared-clients-per-gpu was 3 and the count of physical GPUs to attach to the node was 1. As another example, if the count of physical GPUs was 2, the output would show 6 allocatable GPU resources, three on each physical GPU.

Deploy workloads that use time-shared GPUs

As an application operator who is deploying GPU workloads, you can select time-shared GPU nodes by specifying the appropriate node labels in a nodeSelector in your manifests. When planning your requests, review the request limits to ensure that GKE doesn't reject your deployments.

To deploy a workload to consume time-sharing GPUs, you need to do the following:

  1. Add a nodeSelector to your Pod manifest for the following labels:

    • cloud.google.com/gke-gpu-sharing-strategy: time-sharing: selects nodes that use time-sharing GPUs.
    • cloud.google.com/gke-max-shared-clients-per-gpu: "CLIENTS_PER_GPU": selects nodes that allow a specific number of containers to share the underlying GPU.
  2. Add the nvidia.com/gpu=1 GPU resource request to your container specification, in spec.containers.resources.limits.

For example, the following steps show you how to deploy three Pods to a time-sharing GPU node pool. GKE allocates a time-shared GPU to each container. The containers print the UUID of the GPU that's attached to that container.

  1. Save the following manifest as gpu-timeshare.yaml:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: cuda-simple
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: cuda-simple
      template:
        metadata:
          labels:
            app: cuda-simple
        spec:
          nodeSelector:
            cloud.google.com/gke-gpu-sharing-strategy: time-sharing
            cloud.google.com/gke-max-shared-clients-per-gpu: "3"
          containers:
          - name: cuda-simple
            image: nvidia/cuda:11.0.3-base-ubi7
            command:
            - bash
            - -c
            - |
              /usr/local/nvidia/bin/nvidia-smi -L; sleep 300
            resources:
              limits:
                nvidia.com/gpu: 1
    
  2. Apply the manifest:

    kubectl apply -f gpu-timeshare.yaml
    
  3. Check that all Pods are running:

    kubectl get pods -l=app=cuda-simple
    
  4. Check the logs for any Pod to view the UUID of the GPU:

    kubectl logs POD_NAME
    

    The output is similar to the following:

    GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-0771302b-eb3a-6756-7a23-0adcae8efd47)
    
  5. If your nodes have one physical GPU attached, check the logs for any other Pod on the same node to verify that the GPU UUID is the same:

    kubectl logs POD2_NAME
    

    The output is similar to the following:

    GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-0771302b-eb3a-6756-7a23-0adcae8efd47)
    

Use time-sharing GPUs with multi-instance GPUs

As a platform administrator, you might want to combine multiple GKE GPU features. Time-sharing GPUs works with multi-instance GPUs, which partition a single physical GPU into up to seven slices. These partitions are isolated from each other. You can configure time-sharing GPUs for each multi-instance GPU partition.

For example, if you set the gpu-partition-size to 1g.5gb, the underlying GPU would be split into seven partitions. If you also set max-shared-clients-per-gpu to 3, each partition would support up to three containers, for a total of 21 time-shared GPU devices available to allocate. To learn about how the gpu-partition-size converts to actual partitions, refer to Multi-instance GPU partitions.

To create a time-shared, multi-instance GPU cluster, run the following command:

gcloud container node-pools create NODEPOOL_NAME \
    --cluster=CLUSTER_NAME \
    --machine-type=MACHINE_TYPE \
    --region=COMPUTE_REGION \
    --accelerator=type=nvidia-tesla-a100,count=GPU_QUANTITY,gpu-partition-size=PARTITION_SIZE,gpu-sharing-strategy=time-sharing,max-shared-clients-per-gpu=CLIENTS_PER_GPU

Replace PARTITION_SIZE with the multi-instance GPU partition size that you want, such as 1g.5gb.

What's next