This page shows you how to let multiple workloads get GPU time-sharing access to a single NVIDIA® GPU hardware accelerator in your Google Kubernetes Engine (GKE) nodes. To learn more about how GPU time-sharing works, as well as limitations and examples of when you should use GPU time-sharing, refer to GPU time-sharing on GKE.
Overview
GPU time-sharing is a GKE feature that lets multiple containers share a single physical GPU attached to a node. Using GPU time-sharing in GKE lets you more efficiently use your attached GPUs and save running costs.
Who should use this guide
The instructions in this guide apply to you if you are one of the following:
- Platform administrator: Creates and manages a GKE cluster, plans infrastructure and resourcing requirements, and monitors the cluster's performance.
- Application developer: Designs and deploys workloads on GKE clusters. If you want instructions for requesting GPU time-sharing, refer to Deploy workloads that use GPU time-sharing.
Requirements
- GKE version: You can enable GPU time-sharing on GKE Standard clusters running GKE version 1.23.7-gke.1400 and later. You can use time-sharing GPUs on GKE Autopilot clusters running GKE version 1.29.3-gke.1093000 and later.
- GPU type: You can enable GPU time-sharing on all NVIDIA GPU models.
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
- Ensure that you have sufficient NVIDIA GPU models quota. If you need more quota, refer to Requesting an increase in quota.
- Plan your GPU capacity based on the resource needs of the workloads and the capacity of the underlying GPU.
- Review the limitations of GPU time-sharing.
Enable GPU time-sharing on GKE clusters and node pools
As a platform administrator, you must enable GPU time-sharing on a GKE Standard cluster before developers can deploy workloads to use the GPUs. To enable GPU time-sharing, you must do the following:
- Enable GPU time-sharing on a GKE cluster.
- Install NVIDIA GPU device drivers (if required).
- Verify the GPU resources available on your nodes.
Autopilot clusters that run version 1.29.3-gke.1093000 and later enable time-sharing GPUs by default. Time-sharing on Autopilot clusters is configured in the workload specification. To learn more, see the Deploy workloads that use time-shared GPUs section.
Enable GPU time-sharing on a GKE Standard cluster
You can enable GPU time-sharing when you create GKE Standard clusters. The default node pool in the cluster has the feature enabled. You still need to enable GPU time-sharing when you manually create new node pools in that cluster.
gcloud container clusters create CLUSTER_NAME \
--region=COMPUTE_REGION \
--cluster-version=CLUSTER_VERSION \
--machine-type=MACHINE_TYPE \
--accelerator=type=GPU_TYPE,count=GPU_QUANTITY,gpu-sharing-strategy=time-sharing,max-shared-clients-per-gpu=CLIENTS_PER_GPU,gpu-driver-version=DRIVER_VERSION
Replace the following:
CLUSTER_NAME
: the name of your new cluster.COMPUTE_REGION
: the Compute Engine region for your new cluster. For zonal clusters, specify--zone=COMPUTE_ZONE
.CLUSTER_VERSION
: the GKE version for the cluster control plane and nodes. Use GKE version 1.23.7-gke.1400 or later. Alternatively, specify a release channel with that GKE version by using the--release-channel=RELEASE_CHANNEL
flag.MACHINE_TYPE
: the Compute Engine machine type for your nodes. We recommend that you select an Accelerator-optimized machine type.GPU_TYPE
: the GPU type, which must be an NVIDIA GPU platform such asnvidia-tesla-v100
.GPU_QUANTITY
: the number of physical GPUs to attach to each node in the default node pool.CLIENTS_PER_GPU
: the maximum number of containers that can share each physical GPU.DRIVER_VERSION
: the NVIDIA driver version to install. Can be one of the following:default
: Install the default driver version for your GKE version.latest
: Install the latest available driver version for your GKE version. Available only for nodes that use Container-Optimized OS.disabled
: Skip automatic driver installation. You must manually install a driver after you create the node pool. If you omitgpu-driver-version
, this is the default option.
Enable GPU time-sharing on a GKE node pool
You can enable GPU time-sharing when you manually create new node pools in a GKE cluster.
gcloud container node-pools create NODEPOOL_NAME \
--cluster=CLUSTER_NAME \
--machine-type=MACHINE_TYPE \
--region=COMPUTE_REGION \
--accelerator=type=GPU_TYPE,count=GPU_QUANTITY,gpu-sharing-strategy=time-sharing,max-shared-clients-per-gpu=CLIENTS_PER_GPU,gpu-driver-version=DRIVER_VERSION
Replace the following:
NODEPOOL_NAME
: the name of your new node pool.CLUSTER_NAME
: the name of your cluster, which must run GKE version 1.23.7-gke.1400 or later.COMPUTE_REGION
: the Compute Engine region of your cluster. For zonal clusters, specify--zone=COMPUTE_ZONE
.MACHINE_TYPE
: the Compute Engine machine type for your nodes. We recommend that you select an Accelerator-optimized machine type.GPU_TYPE
: the GPU type, which must be an NVIDIA GPU platform such asnvidia-tesla-v100
.GPU_QUANTITY
: the number of physical GPUs to attach to each node in the node pool.CLIENTS_PER_GPU
: the maximum number of containers that can share each physical GPU.DRIVER_VERSION
: the NVIDIA driver version to install. Can be one of the following:default
: Install the default driver version for your GKE version.latest
: Install the latest available driver version for your GKE version. Available only for nodes that use Container-Optimized OS.disabled
: Skip automatic driver installation. You must manually install a driver after you create the node pool. If you omitgpu-driver-version
, this is the default option.
Install NVIDIA GPU device drivers
Before you proceed, connect to your cluster by running the following command:
gcloud container clusters get-credentials CLUSTER_NAME
If you chose to disable automatic driver installation when creating the cluster, or if you use a GKE version earlier than 1.27.2-gke.1200, you must manually install a compatible NVIDIA driver to manage the GPU time-sharing division of the physical GPUs. To install the drivers, you deploy a GKE installation DaemonSet that sets the drivers up.
For instructions, refer to Installing NVIDIA GPU device drivers.
If you plan to use node auto-provisioning in your cluster, you must also configure node auto-provisioning with the scopes that allow GKE to install the GPU device drivers for you. For instructions, refer to Using node auto-provisioning with GPUs.
Verify the GPU resources available on your nodes
To verify that the number of GPUs visible in your nodes matches the number you specified when you enabled GPU time-sharing, describe your nodes:
kubectl describe nodes NODE_NAME
The output is similar to the following:
...
Capacity:
...
nvidia.com/gpu: 3
Allocatable:
...
nvidia.com/gpu: 3
In this example output, the number of GPU resources on the node is 3
because
the value that was specified for max-shared-clients-per-gpu
was 3
and the
count
of physical GPUs to attach to the node was 1
. As another example, if
the count
of physical GPUs was 2
, the output would show 6
allocatable GPU
resources, three on each physical GPU.
Deploy workloads that use GPU time-sharing
As an application operator who is deploying GPU workloads, you can select
GPU time-sharing enabled by specifying the appropriate node labels in a
nodeSelector
in your manifests. When planning your requests, review the
request limits
to ensure that GKE doesn't reject your deployments.
To deploy a workload to consume GPU time-sharing, complete the following steps:
Add a
nodeSelector
to your workload manifest for the following labels:cloud.google.com/gke-gpu-sharing-strategy: time-sharing
: selects nodes that use GPU time-sharing.cloud.google.com/gke-max-shared-clients-per-gpu: "CLIENTS_PER_GPU"
: selects nodes that allow a specific number of containers to share the underlying GPU.
Add the
nvidia.com/gpu=1
GPU resource request to your container specification, inspec.containers.resources.limits
.
For example, the following steps show you how to deploy three Pods to a GPU time-sharing node pool. GKE allocates each container to the same physical GPU. The containers print the UUID of the GPU that's attached to that container.
- Save the following manifest as
gpu-timeshare.yaml
:
Autopilot
apiVersion: apps/v1 kind: Deployment metadata: name: cuda-simple spec: replicas: 3 selector: matchLabels: app: cuda-simple template: metadata: labels: app: cuda-simple spec: nodeSelector: cloud.google.com/gke-accelerator: "GPU_TYPE" cloud.google.com/gke-gpu-sharing-strategy: "time-sharing" cloud.google.com/gke-max-shared-clients-per-gpu: "CLIENTS_PER_GPU" cloud.google.com/gke-accelerator-count: "GPU_COUNT" containers: - name: cuda-simple image: nvidia/cuda:11.0.3-base-ubi7 command: - bash - -c - | /usr/local/nvidia/bin/nvidia-smi -L; sleep 300 resources: limits: nvidia.com/gpu: 1
Replace the following:
GPU_TYPE
: the GPU type.CLIENTS_PER_GPU
: the number of workloads that'll use this GPU. For this example, use3
.GPU_COUNT
: the number of physical GPUs to attach to the node. For this example, use1
.
Standard
apiVersion: apps/v1 kind: Deployment metadata: name: cuda-simple spec: replicas: 3 selector: matchLabels: app: cuda-simple template: metadata: labels: app: cuda-simple spec: nodeSelector: cloud.google.com/gke-gpu-sharing-strategy: "SHARING_STRATEGY" cloud.google.com/gke-max-shared-clients-per-gpu: "CLIENTS_PER_GPU" containers: - name: cuda-simple image: nvidia/cuda:11.0.3-base-ubi7 command: - bash - -c - | /usr/local/nvidia/bin/nvidia-smi -L; sleep 300 resources: limits: nvidia.com/gpu: 1
Replace the following:
SHARING_STRATEGY
with "time-sharing" to request time-sharing for your GPU.CLIENTS_PER_GPU
: the number of workloads that will use this GPU. For this example, use3
.
Apply the manifest:
kubectl apply -f gpu-timeshare.yaml
Check that all Pods are running:
kubectl get pods -l=app=cuda-simple
Check the logs for any Pod to view the UUID of the GPU:
kubectl logs POD_NAME
The output is similar to the following:
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-0771302b-eb3a-6756-7a23-0adcae8efd47)
If your nodes have one physical GPU attached, check the logs for any other Pod on the same node to verify that the GPU UUID is the same:
kubectl logs POD2_NAME
The output is similar to the following:
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-0771302b-eb3a-6756-7a23-0adcae8efd47)
Use GPU time-sharing with multi-instance GPUs
As a platform administrator, you might want to combine multiple GKE GPU features. GPU time-sharing works with multi-instance GPUs, which partition a single physical GPU into up to seven slices. These partitions are isolated from each other. You can configure GPU time-sharing for each multi-instance GPU partition.
For example, if you set the gpu-partition-size
to 1g.5gb
, the underlying GPU
would be split into seven partitions. If you also set max-shared-clients-per-gpu
to 3
,
each partition would support up to three containers, for a total of up to 21 GPU time-sharing devices available to allocate in that physical GPU. To learn about how the gpu-partition-size
converts to actual partitions, refer to Multi-instance GPU partitions.
To create a multi-instance GPU cluster with GPU time-sharing enabled, run the following command:
Autopilot
With Autopilot, GPU time-sharing and multi-instance GPUs can be used together by using both sets of node selectors.
apiVersion: apps/v1
kind: Deployment
metadata:
name: cuda-simple
spec:
replicas: 7
selector:
matchLabels:
app: cuda-simple
template:
metadata:
labels:
app: cuda-simple
spec:
nodeSelector:
cloud.google.com/gke-gpu-partition-size: 1g.5gb
cloud.google.com/gke-gpu-sharing-strategy: time-sharing
cloud.google.com/gke-max-shared-clients-per-gpu: "3"
cloud.google.com/gke-accelerator: nvidia-tesla-a100
cloud.google.com/gke-accelerator-count: "1"
containers:
- name: cuda-simple
image: nvidia/cuda:11.0.3-base-ubi7
command:
- bash
- -c
- |
/usr/local/nvidia/bin/nvidia-smi -L; sleep 300
resources:
limits:
nvidia.com/gpu: 1
Standard
With Standard, you need to create GPU time-shared, multi-instance cluster by running the following command:
gcloud container node-pools create NODEPOOL_NAME \
--cluster=CLUSTER_NAME \
--machine-type=MACHINE_TYPE \
--region=COMPUTE_REGION \
--accelerator=type=nvidia-tesla-a100,count=GPU_QUANTITY,gpu-partition-size=PARTITION_SIZE,gpu-sharing-strategy=time-sharing,max-shared-clients-per-gpu=CLIENTS_PER_GPU,gpu-driver-version=DRIVER_VERSION
Replace PARTITION_SIZE
with the
multi-instance GPU partition size
that you want, such as 1g.5gb
.
Limitations
- With GPU time-sharing, GKE enforces address space isolation, performance isolation, and error isolation between containers that share a physical GPU. However, memory limits aren't enforced on GPUs. To avoid running into out-of-memory (OOM) issues, set GPU memory limits in your workloads. To avoid security issues, only deploy workloads that are in the same trust boundary to GPU time-sharing.
- To prevent unexpected behavior during capacity allocation, GKE might reject certain GPU time-sharing requests. For details, see GPU requests for GPU time-sharing.
- The maximum number of containers that can use time-sharing in a single physical GPU is 48. When planning your GPU time-sharing configuration, consider the resource needs of your workloads and the capacity of the underlying physical GPUs to optimize your performance and responsiveness.
What's next
- Learn more about GPU sharing strategies available in GKE.
- Learn more about GPUs.
- Learn more about Running multi-instance GPUs.
- For more information about compute preemption for the NVIDIA GPU, refer to the NVIDIA Pascal Tuning Guide.