Share GPUs with multiple workloads using GPU time-sharing

Autopilot Standard

This page shows you how to let multiple workloads get GPU time-sharing access to a single NVIDIA® GPU hardware accelerator in your Google Kubernetes Engine (GKE) nodes. To learn more about how GPU time-sharing works, as well as limitations and examples of when you should use GPU time-sharing, refer to GPU time-sharing on GKE.

Overview

GPU time-sharing is a GKE feature that lets multiple containers share a single physical GPU attached to a node. Using GPU time-sharing in GKE lets you more efficiently use your attached GPUs and save running costs.

Who should use this guide

The instructions in this guide apply to you if you are one of the following:

Platform administrator: Creates and manages a GKE cluster, plans infrastructure and resourcing requirements, and monitors the cluster's performance.
Application developer: Designs and deploys workloads on GKE clusters. If you want instructions for requesting GPU time-sharing, refer to Deploy workloads that use GPU time-sharing.

Requirements

GKE version: You can enable GPU time-sharing on GKE Standard clusters running GKE version 1.23.7-gke.1400 and later. You can use time-sharing GPUs on GKE Autopilot clusters running GKE version 1.29.3-gke.1093000 and later.
GPU type: You can enable GPU time-sharing on nodes that use GPU types later than NVIDIA Tesla® K80.

Before you begin

Before you start, make sure you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.
Note: For existing gcloud CLI installations, make sure to set the compute/region and compute/zone properties. By setting default locations, you can avoid errors in gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location.

Ensure that you have sufficient NVIDIA Tesla GPU quota. If you need more quota, refer to Requesting an increase in quota.
Plan your GPU capacity based on the resource needs of the workloads and the capacity of the underlying GPU.
Review the limitations of GPU time-sharing.

Enable GPU time-sharing on GKE clusters and node pools

As a platform administrator, you must enable GPU time-sharing on a GKE Standard cluster before developers can deploy workloads to use the GPUs. To enable GPU time-sharing, you must do the following:

Enable GPU time-sharing on a GKE cluster.
Install NVIDIA GPU device drivers (if required).
Verify the GPU resources available on your nodes.

Autopilot clusters that run version 1.29.3-gke.1093000 and later enable time-sharing GPUs by default. Time-sharing on Autopilot clusters is configured in the workload specification. To learn more, see the Deploy workloads that use time-shared GPUs section.

Enable GPU time-sharing on a GKE Standard cluster

You can enable GPU time-sharing when you create GKE Standard clusters. The default node pool in the cluster has the feature enabled. You still need to enable GPU time-sharing when you manually create new node pools in that cluster.

gcloud container clusters create CLUSTER_NAME \
    --region=COMPUTE_REGION \
    --cluster-version=CLUSTER_VERSION \
    --machine-type=MACHINE_TYPE \
    --accelerator=type=GPU_TYPE,count=GPU_QUANTITY,gpu-sharing-strategy=time-sharing,max-shared-clients-per-gpu=CLIENTS_PER_GPU,gpu-driver-version=DRIVER_VERSION

Replace the following:

CLUSTER_NAME: the name of your new cluster.
COMPUTE_REGION: the Compute Engine region for your new cluster. For zonal clusters, specify --zone=COMPUTE_ZONE.
CLUSTER_VERSION: the GKE version for the cluster control plane and nodes. Use GKE version 1.23.7-gke.1400 or later. Alternatively, specify a release channel with that GKE version by using the --release-channel=RELEASE_CHANNEL flag.
MACHINE_TYPE: the Compute Engine machine type for your nodes. We recommend that you select an Accelerator-optimized machine type.
GPU_TYPE: the GPU type, which must be an NVIDIA Tesla GPU platform such as nvidia-tesla-v100.
GPU_QUANTITY: the number of physical GPUs to attach to each node in the default node pool.
CLIENTS_PER_GPU: the maximum number of containers that can share each physical GPU.
DRIVER_VERSION: the NVIDIA driver version to install. Can be one of the following:
- default: Install the default driver version for your GKE version.
- latest: Install the latest available driver version for your GKE version. Available only for nodes that use Container-Optimized OS.
- disabled: Skip automatic driver installation. You must manually install a driver after you create the node pool. If you omit gpu-driver-version, this is the default option.

Enable GPU time-sharing on a GKE node pool

You can enable GPU time-sharing when you manually create new node pools in a GKE cluster.

gcloud container node-pools create NODEPOOL_NAME \
    --cluster=CLUSTER_NAME \
    --machine-type=MACHINE_TYPE \
    --region=COMPUTE_REGION \
    --accelerator=type=GPU_TYPE,count=GPU_QUANTITY,gpu-sharing-strategy=time-sharing,max-shared-clients-per-gpu=CLIENTS_PER_GPU,gpu-driver-version=DRIVER_VERSION

Replace the following:

NODEPOOL_NAME: the name of your new node pool.
CLUSTER_NAME: the name of your cluster, which must run GKE version 1.23.7-gke.1400 or later.
COMPUTE_REGION: the Compute Engine region of your cluster. For zonal clusters, specify --zone=COMPUTE_ZONE.
MACHINE_TYPE: the Compute Engine machine type for your nodes. We recommend that you select an Accelerator-optimized machine type.
GPU_TYPE: the GPU type, which must be an NVIDIA Tesla GPU platform such as nvidia-tesla-v100.
GPU_QUANTITY: the number of physical GPUs to attach to each node in the node pool.
CLIENTS_PER_GPU: the maximum number of containers that can share each physical GPU.
DRIVER_VERSION: the NVIDIA driver version to install. Can be one of the following:
- default: Install the default driver version for your GKE version.
- latest: Install the latest available driver version for your GKE version. Available only for nodes that use Container-Optimized OS.
- disabled: Skip automatic driver installation. You must manually install a driver after you create the node pool. If you omit gpu-driver-version, this is the default option.
Note: The gpu-driver-version option is only available for GKE version 1.27.2-gke.1200 and later. In earlier versions, omit this flag and manually install a driver after you create the node pool.

Install NVIDIA GPU device drivers

Before you proceed, connect to your cluster by running the following command:

gcloud container clusters get-credentials CLUSTER_NAME

If you chose to disable automatic driver installation when creating the cluster, or if you use a GKE version earlier than 1.27.2-gke.1200, you must manually install a compatible NVIDIA driver to manage the GPU time-sharing division of the physical GPUs. To install the drivers, you deploy a GKE installation DaemonSet that sets the drivers up.

For instructions, refer to Installing NVIDIA GPU device drivers.

If you plan to use node auto-provisioning in your cluster, you must also configure node auto-provisioning with the scopes that allow GKE to install the GPU device drivers for you. For instructions, refer to Using node auto-provisioning with GPUs.

To verify that the number of GPUs visible in your nodes matches the number you specified when you enabled GPU time-sharing, describe your nodes:

kubectl describe nodes NODE_NAME

The output is similar to the following:

...
Capacity:
  ...
  nvidia.com/gpu:             3
Allocatable:
  ...
  nvidia.com/gpu:             3

In this example output, the number of GPU resources on the node is 3 because the value that was specified for max-shared-clients-per-gpu was 3 and the count of physical GPUs to attach to the node was 1. As another example, if the count of physical GPUs was 2, the output would show 6 allocatable GPU resources, three on each physical GPU.

Deploy workloads that use GPU time-sharing

As an application operator who is deploying GPU workloads, you can select GPU time-sharing enabled by specifying the appropriate node labels in a nodeSelector in your manifests. When planning your requests, review the request limits to ensure that GKE doesn't reject your deployments.

To deploy a workload to consume GPU time-sharing, complete the following steps:

Add a nodeSelector to your workload manifest for the following labels:
- cloud.google.com/gke-gpu-sharing-strategy: time-sharing: selects nodes that use GPU time-sharing.
- cloud.google.com/gke-max-shared-clients-per-gpu: "CLIENTS_PER_GPU": selects nodes that allow a specific number of containers to share the underlying GPU.
Add the nvidia.com/gpu=1 GPU resource request to your container specification, in spec.containers.resources.limits.

For example, the following steps show you how to deploy three Pods to a GPU time-sharing node pool. GKE allocates each container to the same physical GPU. The containers print the UUID of the GPU that's attached to that container.

Save the following manifest as gpu-timeshare.yaml:

Autopilot

        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: cuda-simple
        spec:
          replicas: 3
          selector:
            matchLabels:
              app: cuda-simple
          template:
            metadata:
              labels:
                app: cuda-simple
            spec:
              nodeSelector:
                cloud.google.com/gke-accelerator: "GPU_TYPE"
                cloud.google.com/gke-gpu-sharing-strategy: "time-sharing"
                cloud.google.com/gke-max-shared-clients-per-gpu: "CLIENTS_PER_GPU"
                cloud.google.com/gke-accelerator-count: "GPU_COUNT"
              containers:
              - name: cuda-simple
                image: nvidia/cuda:11.0.3-base-ubi7
                command:
                - bash
                - -c
                - |
                  /usr/local/nvidia/bin/nvidia-smi -L; sleep 300
                resources:
                  limits:
                    nvidia.com/gpu: 1

Replace the following:

GPU_TYPE: the GPU type.
CLIENTS_PER_GPU: the number of workloads that'll use this GPU. For this example, use 3.
GPU_COUNT: the number of physical GPUs to attach to the node. For this example, use 1.

Standard

        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: cuda-simple
        spec:
          replicas: 3
          selector:
            matchLabels:
              app: cuda-simple
          template:
            metadata:
              labels:
                app: cuda-simple
            spec:
              nodeSelector:
                cloud.google.com/gke-gpu-sharing-strategy: "SHARING_STRATEGY"
                cloud.google.com/gke-max-shared-clients-per-gpu: "CLIENTS_PER_GPU"
              containers:
              - name: cuda-simple
                image: nvidia/cuda:11.0.3-base-ubi7
                command:
                - bash
                - -c
                - |
                  /usr/local/nvidia/bin/nvidia-smi -L; sleep 300
                resources:
                  limits:
                    nvidia.com/gpu: 1

Replace the following:

SHARING_STRATEGY with "time-sharing" to request time-sharing for your GPU.
CLIENTS_PER_GPU: the number of workloads that will use this GPU. For this example, use 3.

Apply the manifest:
```
kubectl apply -f gpu-timeshare.yaml
```
Check that all Pods are running:
```
kubectl get pods -l=app=cuda-simple
```

Check the logs for any Pod to view the UUID of the GPU:

kubectl logs POD_NAME

The output is similar to the following:

GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-0771302b-eb3a-6756-7a23-0adcae8efd47)

If your nodes have one physical GPU attached, check the logs for any other Pod on the same node to verify that the GPU UUID is the same:
```
kubectl logs POD2_NAME
```
The output is similar to the following:
```
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-0771302b-eb3a-6756-7a23-0adcae8efd47)
```

Use GPU time-sharing with multi-instance GPUs

As a platform administrator, you might want to combine multiple GKE GPU features. GPU time-sharing works with multi-instance GPUs, which partition a single physical GPU into up to seven slices. These partitions are isolated from each other. You can configure GPU time-sharing for each multi-instance GPU partition.

For example, if you set the gpu-partition-size to 1g.5gb, the underlying GPU would be split into seven partitions. If you also set max-shared-clients-per-gpu to 3, each partition would support up to three containers, for a total of up to 21 GPU time-sharing devices available to allocate in that physical GPU. To learn about how the gpu-partition-size converts to actual partitions, refer to Multi-instance GPU partitions.

To create a multi-instance GPU cluster with GPU time-sharing enabled, run the following command:

Autopilot

With Autopilot, GPU time-sharing and multi-instance GPUs can be used together by using both sets of node selectors.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cuda-simple
spec:
  replicas: 7
  selector:
    matchLabels:
      app: cuda-simple
  template:
    metadata:
      labels:
        app: cuda-simple
    spec:
      nodeSelector:
        cloud.google.com/gke-gpu-partition-size: 1g.5gb
        cloud.google.com/gke-gpu-sharing-strategy: time-sharing
        cloud.google.com/gke-max-shared-clients-per-gpu: "3"
        cloud.google.com/gke-accelerator: nvidia-tesla-a100
        cloud.google.com/gke-accelerator-count: "1"
      containers:
      - name: cuda-simple
        image: nvidia/cuda:11.0.3-base-ubi7
        command:
        - bash
        - -c
        - |
          /usr/local/nvidia/bin/nvidia-smi -L; sleep 300
        resources:
          limits:
            nvidia.com/gpu: 1

Standard

With Standard, you need to create GPU time-shared, multi-instance cluster by running the following command:

gcloud container node-pools create NODEPOOL_NAME \
    --cluster=CLUSTER_NAME \
    --machine-type=MACHINE_TYPE \
    --region=COMPUTE_REGION \
    --accelerator=type=nvidia-tesla-a100,count=GPU_QUANTITY,gpu-partition-size=PARTITION_SIZE,gpu-sharing-strategy=time-sharing,max-shared-clients-per-gpu=CLIENTS_PER_GPU,gpu-driver-version=DRIVER_VERSION

Replace PARTITION_SIZE with the multi-instance GPU partition size that you want, such as 1g.5gb.

Limitations

With GPU time-sharing, GKE enforces address space isolation, performance isolation, and error isolation between containers that share a physical GPU. However, memory limits aren't enforced on GPUs. To avoid running into out-of-memory (OOM) issues, set GPU memory limits in your workloads. To avoid security issues, only deploy workloads that are in the same trust boundary to GPU time-sharing.
To prevent unexpected behavior during capacity allocation, GKE might reject certain GPU time-sharing requests. For details, see GPU requests for GPU time-sharing.
The maximum number of containers that can use time-sharing in a single physical GPU is 48. When planning your GPU time-sharing configuration, consider the resource needs of your workloads and the capacity of the underlying physical GPUs to optimize your performance and responsiveness.

What's next

Learn more about GPU sharing strategies available in GKE.
Learn more about GPUs.
Learn more about Running multi-instance GPUs.
For more information about compute preemption for the NVIDIA GPU, refer to the NVIDIA Pascal Tuning Guide.

Share GPUs with multiple workloads using GPU time-sharing

Overview

Who should use this guide

Requirements

Before you begin

Enable GPU time-sharing on GKE clusters and node pools

Enable GPU time-sharing on a GKE Standard cluster

Enable GPU time-sharing on a GKE node pool

Install NVIDIA GPU device drivers

Verify the GPU resources available on your nodes

Deploy workloads that use GPU time-sharing

Autopilot

Standard

Use GPU time-sharing with multi-instance GPUs

Autopilot

Standard

Limitations

What's next