Time-sharing GPUs on GKE


This page gives you information about time-sharing GPUs in Google Kubernetes Engine (GKE). To learn how to enable and configure time-sharing GPUs in your nodes, refer to Share GPUs with multiple workloads using time-sharing.

This page assumes that you're familiar with Kubernetes concepts such as Pods, nodes, deployments, and namespaces; and are familiar with GKE concepts such as node pools, autoscaling, and node auto-provisioning.

How GPU requests work in Kubernetes

Kubernetes enables applications to precisely request the resource amounts they need to function. While you can request fractional CPU units for applications, you can't request fractional GPU units. Pod manifests must request GPU resources in integers, which means that an entire physical GPU is allocated to one container even if the container only needs a fraction of the resources to function correctly. This is inefficient and can be costly, especially when you're running multiple workloads with similar low GPU requirements.

What is time-sharing?

Time-sharing is a GKE feature that lets multiple containers share a single physical GPU attached to a node. Using GPU time-sharing in GKE lets you more efficiently use your attached GPUs and save running costs.

How GPU time-sharing works

When you create a new node pool with GPUs attached, you can specify the maximum number of containers allowed to share a physical GPU. Every GPU in the node pool is shared based on the setting you specify at the node pool level.

GKE then adds node labels that specify the maximum number of containers allowed to share each GPU on the node, and the sharing strategy to use. For example, if you specified a maximum number of three when you created the node pool, the node labels would look similar to the following:

cloud.google.com/gke-max-shared-clients-per-gpu: "3"
cloud.google.com/gke-gpu-sharing-strategy: time-sharing

The GPU driver presents the physical GPU to the kubelet on your nodes as multiple GPU units. The kubelet allocates those GPU units to containers as requested. If you have multiple containers running on the physical GPU, the GPU hardware and the driver facilitate time-shared access to the GPU resources by context switching between the containers based on usage. The requests in your Pod manifests don't need to change, because the kubelet sees the time-shared GPU as multiple GPUs that are available for use.

The number of time-shared GPUs is not a measure of the physical GPU compute power that the workloads get. Context switching lets every container access the full power of the underlying physical GPU, with the driver and GPU hardware handling the logic for when each container can access the GPU resources.

When should you use time-sharing GPUs?

You can configure time-sharing GPUs on any NVIDIA GPU. Time-shared GPUs are ideal for running workloads that don't need to use high amounts of GPU resources all the time, such as the following:

  • Homogeneous workloads with low GPU requests
  • Burstable GPU workloads

Examples of these types of workloads include the following:

  • Rendering
  • Inference
  • Small-scale machine learning model training

GKE also provides multi-instance GPUs, which split a physical GPU into up to seven discrete instances, each of which is isolated from the others at the hardware level. A container that uses a multi-instance GPU instance can only access the CPU and memory resources available to that instance. Multi-instance GPUs require NVIDIA Tesla A100 accelerators.

You should use multi-instance GPUs if your workloads have requirements like the following:

  • Hardware isolation from other containers on the same physical GPU.
  • Predictable throughput and latency for parallel workloads, even when other containers saturate resource usage on other instances.

If you want to maximize your GPU utilization, you can configure time-sharing for each multi-instance GPU partition. You can then run multiple containers on each partition, with those containers sharing access to the resources on that partition.

Request time-shared GPUs in workloads

To start using time-shared GPUs, a cluster administrator needs to create a GPU cluster with time-sharing enabled by following the instructions in Share GPUs with multiple workloads using time-sharing. Before you specify a maximum number of containers to share a physical GPU, you should consider the resource needs of the workloads and the capacity of the underlying GPU.

After time-sharing GPUs are enabled on a cluster or node pool, application developers can request GPUs in workloads. Developers can tell GKE to schedule those workloads on time-shared GPU nodes by specifying a nodeSelector with the following labels:

  • cloud.google.com/gke-max-shared-clients-per-gpu: Select nodes that allow a specific number of clients to share the underlying GPU.
  • cloud.google.com/gke-gpu-sharing-strategy: Select nodes that use the time-sharing strategy for GPUs.

For instructions, refer to Deploy workloads that use time-shared GPUs.

Scheduling behavior with time-shared GPUs

If you're a developer deploying workloads on time-shared GPUs, you should familiarize yourself with how scheduling behavior changes based on the nodeSelector labels and GPU requests in your manifests.

The following table describes how scheduling behavior changes based on the node labels that you specify in your manifests.

Node labels
cloud.google.com/gke-max-shared-clients-per-gpu

and

cloud.google.com/gke-gpu-sharing-strategy

GKE schedules workloads onto available nodes that match both the labels.

If there are no available nodes, GKE uses autoscaling and node auto-provisioning to create new nodes or node pools that match both the labels.

cloud.google.com/gke-max-shared-clients-per-gpu only GKE schedules workloads onto available nodes that match the label.

If there are no available nodes, GKE uses autoscaling and node auto-provisioning to create new nodes or node pools that match the label. By default, auto-provisioned nodes are also given the cloud.google.com/gke-gpu-sharing-strategy=time-sharing label.

cloud.google.com/gke-gpu-sharing-strategy only

GKE schedules workloads onto available nodes that use time-sharing.

If there are multiple time-sharing node pools with different values for cloud.google.com/gke-max-shared-clients-per-gpu, the workload can be scheduled on any available node.

If there are no available nodes in any node pools, the cluster autoscaler scales up the node pool with the lowest value for cloud.google.com/gke-max-shared-clients-per-gpu.

If all node pools are at capacity, node auto-provisioning creates a new node pool with a default value of cloud.google.com/gke-max-shared-clients-per-gpu=2

Request limits for time-shared GPUs

If you're developing GPU applications that will run on time-shared GPU nodes, you can only request one time-shared GPU for each container. GKE rejects requests for more that one time-shared GPU in a container to avoid unexpected behavior, and because the number of time-shared GPUs requested is not a measure of the compute power available to the container.

The following table shows you what to expect when you request specific quantities of GPUs.

Time-shared GPU requests
1 time-shared GPU per container GKE allows the request.
More than 1 time-shared GPU per container

GKE rejects the request.

This behavior includes asking for more than one multi-instance GPU instance in a container, because each GPU instance is considered to be a discrete physical GPU.

If GKE rejects the workload, you'll see an error message similar to the following:

status:
  message: 'Pod Allocate failed due to rpc error: code = Unknown desc = [invalid request
    for sharing GPU (time-sharing), at most 1 nvidia.com/gpu can be requested on GPU nodes], which is unexpected'
  phase: Failed
  reason: UnexpectedAdmissionError

Monitor time-shared GPU nodes

Use Cloud Monitoring to monitor the performance of your time-shared GPUs. GKE sends metrics for each time-shared GPU device to Cloud Monitoring. These metrics are different than the metrics for regular GPUs that are not time-shared.

You can check the following metrics for each time-shared GPU device in Cloud Monitoring:

  • Duty cycle (node/accelerator/duty_cycle): Percentage of time over the last sample period (10 seconds) during which the time-shared GPU device was actively processing. Ranges from 1% to 100%.
  • Memory usage (node/accelerator/memory_used): Amount of accelerator memory allocated in bytes for each time-shared GPU device.
  • Memory capacity (node/accelerator/memory_total): Total accelerator memory in bytes for each time-shared GPU device.

The main difference between the time-shared GPU metrics and metrics for regular physical GPUs is that the time-shared metrics apply to the node level, and the metrics for regular physical GPUs apply at the container level.

Limitations

  • GKE enforces memory (address space) isolation, performance isolation, and fault isolation between containers that share a physical GPU. However, memory limits aren't enforced on time-shared GPUs. To avoid running into out-of-memory (OOM) issues, set GPU memory limits in your applications. To avoid security issues, only deploy workloads that are in the same trust boundary to time-shared GPUs.
  • GKE might reject certain time-shared GPU requests to prevent unexpected behavior during capacity allocation. For details, see Request limits for time-shared GPUs.
  • The maximum number of containers that can share a single physical GPU is 48. When planning your time-sharing configuration, consider the resource needs of your workloads and the capacity of the underlying physical GPUs to optimize your performance and responsiveness.

What's next