Deploy GPUs for batch workloads with ProvisioningRequest


This page shows you how to optimize GPU obtainability through the ProvisioningRequest API. We recommend this feature for large-scale batch workloads that can run during off-peak hours with defined GPU capacity management conditions. These workloads might be deep learning model training or a simulation that needs large amounts of GPUs with an atomic provisioning model, meaning that all resources are created at the same time.

To run GPU workloads in Google Kubernetes Engine (GKE) without the ProvisioningRequest API, see Run GPUs in GKE Standard node pools.

When to use ProvisioningRequest

We recommend that you use ProvisioningRequest if your workloads meet all of the following conditions:

  • You request GPUs to run your workloads.
  • You have limited or no reserved GPU capacity and you want to improve obtainability of GPU resources.
  • Your workload is time-flexible and your use case can afford to wait to get all the requested capacity, for example when GKE allocates the GPU resources outside of the busiest hours.
  • Your workload requires multiple nodes and can't start running until all GPU nodes are provisioned and ready at the same time (for example, distributed machine learning training).

To learn more details about ProvisioningRequest, see How ProvisioningRequest works.

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.

Use node pools with ProvisioningRequest

You can use any of the following three methods to designate that ProvisioningRequest can work with specific node pools in your cluster:

Create a node pool

Create a node pool with ProvisioningRequest enabled using the gcloud CLI:

gcloud beta container node-pools create NODEPOOL_NAME \
    --cluster=CLUSTER_NAME \
    --location=LOCATION \
     --enable-queued-provisioning \
    --accelerator type=GPU_TYPE,count=AMOUNT,gpu-driver-version=DRIVER_VERSION \
    --enable-autoscaling  \
    --num-nodes=0   \
    --total-max-nodes TOTAL_MAX_NODES  \
    --location-policy=ANY  \
    --reservation-affinity=none  \
    --no-enable-autorepair

Replace the following:

  • NODEPOOL_NAME: The name you choose for the node pool.
  • CLUSTER_NAME: The name of the cluster.
  • LOCATION: The cluster's Compute Engine region, such as us-central1.
  • GPU_TYPE: The GPU type.
  • AMOUNT: The number of GPUs to attach to nodes in the node pool.
  • DRIVER_VERSION: the NVIDIA driver version to install. Can be one of the following:
    • default: Install the default driver version for your GKE version.
    • latest: Install the latest available driver version for your GKE version. Available only for nodes that use Container-Optimized OS.
  • TOTAL_MAX_NODES: the maximum number of nodes to automatically scale for the entire node pool.

Optionally, you can use the following flags:

  • --no-enable-autoupgrade: Recommended. Disables node auto-upgrades. Supported only in GKE clusters not currently enrolled in a release channel. To learn more, see Disable node auto-upgrades for an existing node pool.
  • --node-locations=COMPUTE_ZONES: The comma-separated list of one or more zones where GKE creates the GPU nodes. The zones must be in the same region as the cluster. Choose zones that have available GPUs.
  • --machine-type=MACHINE_TYPE: The Compute Engine machine type for the nodes. Required if GPU_TYPE is tesla-a100 or nvidia-a100-80gb, which can only use an A2 machine type, or if GPU_TYPE is nvidia-l4, which can only use a G2 machine type. For all other GPUs, this flag is optional.
  • --enable-gvnic: This flag enables gVNIC on the GPU node pools to increase network traffic speed.

This command creates a node pool with the following configuration:

  • GKE enables queued provisioning and cluster autoscaling.
  • The node pool initially has zero nodes.
  • The --enable-queued-provisioning flag enables ProvisioningRequests and adds the cloud.google.com/gke-queued taint to the node pool.
  • The --no-enable-autorepair and --no-enable-autoupgrade flags disable automatic repair and upgrade of nodes, which could disrupt workloads running on repaired or upgraded nodes. You can only disable node auto-upgrade on clusters that are not enrolled in a release channel.

Update existing node pool and enable ProvisioningRequests

Enable ProvisioningRequests for an existing node pool, ensuring that you review the prerequisites to configure the node pool correctly.

Prerequisites

  • Ensure that you create a node pool with the --reservation-affinity=none flag. This flag is required for enabling ProvisioningRequests later, as you can't change the reservation affinity after node pool creation.

  • Ensure that you maintain at least one node pool without ProvisioningRequest handling enabled for the cluster to function correctly.

  • Ensure that the node pool is empty. You can resize the node pool so that it has zero nodes.

  • Ensure that autoscaling is enabled and correctly configured.

  • Ensure that auto-repairs are disabled.

Enable ProvisioningRequests for existing node pool

You can enable ProvisioningRequest for an existing node pool using the gcloud CLI:

gcloud beta container node-pools update NODEPOOL_NAME \
    --cluster=CLUSTER_NAME \
    --location=LOCATION \
     --enable-queued-provisioning

Replace the following:

  • NODEPOOL_NAME: name of the chosen node pool.
  • CLUSTER_NAME: name of the cluster.
  • LOCATION: cluster's Compute Engine region, such as us-central1.

This node pool update command results in following configuration changes:

  • The --enable-queued-provisioning flag enables ProvisioningRequests and adds the cloud.google.com/gke-queued taint to the node pool.

Optionally, you can also update the following node pool settings:

  • Disable node auto-upgrades: We recommend that you disable node auto-upgrades as node pool upgrades are not supported when using ProvisioningRequests. To disable node auto-upgrades, ensure your GKE cluster is not enrolled in a release channel.
  • Enable gVNIC on the GPU node pools: Google Virtual NIC (gVNIC) increases network traffic speed for GPU nodes.

Enable node auto-provisioning to create node pools for ProvisioningRequests

You can use node auto-provisioning to manage node pools for ProvisioningRequests for clusters running version 1.29.2-gke.1553000 or later. When you enable node auto-provisioning and create the ProvisioningRequest, GKE creates node pools with the required resources for the associated workload.

To enable node auto-provisioning, consider the following settings and complete the steps in Configure GPU limits:

  • Specify the required resources for your ProvisioningRequests when enabling the feature. To list the available resourceTypes, run gcloud compute accelerator-types list.
  • We recommend that you use the --no-enable-autoprovisioning-autoupgrade and --no-enable-autoprovisioning-autorepair flags to disable node auto-upgrades and node auto-repair. To learn more, see Configure disruption settings for node pools with workloads using ProvisioningRequest.

Run your batch workloads with ProvisioningRequest

To use ProvisioningRequest, we recommend that you use Kueue. Kueue implements Job queueing, deciding when Jobs should wait and when they should start, based on quotas and a hierarchy for sharing resources fairly among teams. This simplifies the setup needed to use queued VMs.

You can use ProvisioningRequest without Kueue when you use your own internal batch scheduling tools or platform. To configure ProvisioningRequests for Jobs without Kueue, see ProvisioningRequests for Jobs without Kueue.

ProvisioningRequests for Jobs with Kueue

The following section shows you how to configure the ProvisioningRequests for Jobs with Kueue. This section uses the samples in the dws-examples repository. We have published the samples in the dws-examples repository under the Apache2 license.

Prepare your environment

  1. In Cloud Shell, run the following command:

    git clone https://github.com/GoogleCloudPlatform/ai-on-gke
    cd ai-on-gke/tutorials-and-examples/workflow-orchestration/dws-examples
    
  2. Install Kueue in your cluster with necessary configuration to enable Provisioning Request integration:

    kubectl apply --server-side -f ./kueue-manifests.yaml
    

To learn more about Kueue installation, see Installation.

Create the Kueue resources

With the following manifest, you create a cluster-level queue named dws-cluster-queue and the LocalQueue namespace named dws-local-queue. Jobs that refer to dws-cluster-queue queue in this namespace use ProvisioningRequests to get the GPU resources.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "default-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: AdmissionCheck
metadata:
  name: dws-prov
spec:
  controllerName: kueue.x-k8s.io/provisioning-request
  parameters:
    apiGroup: kueue.x-k8s.io
    kind: ProvisioningRequestConfig
    name: dws-config
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ProvisioningRequestConfig
metadata:
  name: dws-config
spec:
  provisioningClassName: queued-provisioning.gke.io
  managedResources:
  - nvidia.com/gpu
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "dws-cluster-queue"
spec:
  namespaceSelector: {} 
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 10000  # Infinite quota.
      - name: "memory"
        nominalQuota: 10000Gi # Infinite quota.
      - name: "nvidia.com/gpu"
        nominalQuota: 10000  # Infinite quota.
  admissionChecks:
  - dws-prov
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: "default"
  name: "dws-local-queue"
spec:
  clusterQueue: "dws-cluster-queue"
---

Deploy the LocalQueue:

kubectl create -f ./dws-queues.yaml

The output is similar to the following:

resourceflavor.kueue.x-k8s.io/default-flavor created
admissioncheck.kueue.x-k8s.io/dws-prov created
provisioningrequestconfig.kueue.x-k8s.io/dws-config created
clusterqueue.kueue.x-k8s.io/dws-cluster-queue created
localqueue.kueue.x-k8s.io/dws-local-queue created

If you want to run Jobs that use ProvisioningRequests in other namespaces, you can create additional LocalQueues using the preceding template.

Run your Job

In the following manifest, the sample Job uses ProvisioningRequest:

apiVersion: batch/v1
kind: Job
metadata:
  name: sample-job
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: dws-local-queue
  annotations:
    provreq.kueue.x-k8s.io/maxRunDurationSeconds: "600"
spec:
  parallelism: 1
  completions: 1
  suspend: true
  template:
    spec:
      nodeSelector:
        cloud.google.com/gke-nodepool: NODEPOOL_NAME
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: dummy-job
        image: gcr.io/k8s-staging-perf-tests/sleep:v0.0.3
        args: ["120s"]
        resources:
          requests:
            cpu: "100m"
            memory: "100Mi"
            nvidia.com/gpu: 1
          limits:
            cpu: "100m"
            memory: "100Mi"
            nvidia.com/gpu: 1
      restartPolicy: Never

This manifest includes the following fields that are relevant for the ProvisioningRequest configuration:

  • The kueue.x-k8s.io/queue-name: dws-local-queue label tells GKE that Kueue is responsible for orchestrating that Job. This label also defines the queue where the Job is queued.
  • The flag suspend: true tells GKE to create the Job resource but to not schedule the Pods yet. Kueue changes that flag to false when the nodes are ready for the Job execution.
  • nodeSelector tells GKE to schedule the Job only on the specified node pool. The value should match NODEPOOL_NAME, the name of the node pool with queued provisioning enabled.
  1. Run your Job:

    kubectl create -f ./job.yaml
    

    The output is similar to the following:

    job.batch/sample-job created
    
  2. Check the status of your Job:

    kubectl describe job sample-job
    

    The output is similar to the following:

    Events:
      Type    Reason            Age    From                        Message
      ----    ------            ----   ----                        -------
      Normal  Suspended         5m17s  job-controller              Job suspended
      Normal  CreatedWorkload   5m17s  batch/job-kueue-controller  Created Workload: default/job-sample-job-7f173
      Normal  Started           3m27s  batch/job-kueue-controller  Admitted by clusterQueue dws-cluster-queue
      Normal  SuccessfulCreate  3m27s  job-controller              Created pod: sample-job-9qsfd
      Normal  Resumed           3m27s  job-controller              Job resumed
      Normal  Completed         12s    job-controller              Job completed
    

The ProvisioningRequest with Kueue integration also supports other workload types available in the open source ecosystem, like the following:

  • RayJob
  • JobSet
  • Kubeflow MPIJob, TFJob, PyTorchJob.
  • Kubernetes Pods that are frequently used by workflow orchestrators
  • Flux mini cluster

To learn more about this support, see Kueue's batch user.

ProvisioningRequests for Jobs without Kueue

Create a request through the ProvisioningRequest API for each Job. ProvisioningRequest doesn't start the Pods, it only provisions the nodes.

  1. Create the following provisioning-request.yaml manifest:

    apiVersion: v1
    kind: PodTemplate
    metadata:
      name: POD_TEMPLATE_NAME
      namespace: NAMESPACE_NAME
    template:
      spec:
        nodeSelector:
            cloud.google.com/gke-nodepool: NODEPOOL_NAME
        tolerations:
            - key: "nvidia.com/gpu"
              operator: "Exists"
              effect: "NoSchedule"
        containers:
            - name: pi
              image: perl
              command: ["/bin/sh"]
              resources:
                limits:
                  cpu: "700m"
                  nvidia.com/gpu: 1
                requests:
                  cpu: "700m"
                  nvidia.com/gpu: 1
        restartPolicy: Never
    ---
    apiVersion: autoscaling.x-k8s.io/v1beta1
    kind: ProvisioningRequest
    metadata:
      name: PROVISIONING_REQUEST_NAME
      namespace: NAMESPACE_NAME
    spec:
      provisioningClassName: queued-provisioning.gke.io
      parameters:
        maxRunDurationSeconds: "MAX_RUN_DURATION_SECONDS"
      podSets:
      - count: COUNT
        podTemplateRef:
          name: POD_TEMPLATE_NAME
    

    Replace the following:

    • NAMESPACE_NAME: The name of your Kubernetes namespace. The namespace must be the same as the namespace of the Pods.
    • PROVISIONING_REQUEST_NAME: The name of the ProvisioningRequest. You'll refer to this name in the Pod annotation.
    • MAX_RUN_DURATION_SECONDS: Optionally, the maximum runtime of a node in seconds, up to the default of seven days. To learn more, see How ProvisioningRequest works. You can't change this value after creation of the request. This field is available in Preview in GKE version 1.28.5-gke.1355000 or later.
    • COUNT: Number of Pods requested. The nodes are scheduled atomically in one zone.
    • POD_TEMPLATE_NAME: A Kubernetes' standard name. GKE references to this value in the Provisioning Request PodSet.
    • NODEPOOL_NAME: The name you choose for the node pool.
  2. Apply the manifest:

    kubectl apply -f provisioning-request.yaml
    

Configure the Pods

In the JobSet spec, link the Pods to the ProvisioningRequest using the following annotations:

apiVersion: batch/v1
kind: Job
spec:
  template:
    spec:
      ...
      annotations:
        cluster-autoscaler.kubernetes.io/consume-provisioning-request: PROVISIONING_REQUEST_NAME
        cluster-autoscaler.kubernetes.io/provisioning-class-name: "queued-provisioning.gke.io"

The Pod annotation key cluster-autoscaler.kubernetes.io/consume-provisioning-request defines which ProvisioningRequest to consume. GKE uses the consume-provisioning-request and provisioning-class-name annotations to do the following:

  • To schedule the Pods only in the nodes provisioned by ProvisioningRequest.
  • To avoid double counting of resource requests between Pods and ProvisioningRequests in the cluster autoscaler.
  • To inject safe-to-evict: false annotation, to prevent the cluster autoscaler from moving Pods between nodes and interrupting batch computations. You can change this behavior by specifying safe-to-evict: true in the Pod annotations.

Observe the status of ProvisioningRequest

The status of a ProvisioningRequest defines if a Pod can be scheduled or not. You can use Kubernetes watches to observe changes efficiently or other tooling you already use for tracking statuses of Kubernetes objects. The following table describes the possible status of a ProvisioningRequest and each possible outcome:

ProvisioningRequest status Description Possible outcome
Pending The request was not seen and processed yet. After processing, the request transitions to Accepted or Failed state.
Accepted=true The request is accepted and is waiting for resources to be available. The request should transition to Provisioned state, if resources were found and nodes were provisioned or to Failed state if that was not possible.
Provisioned=true The nodes are ready. You have 10 minutes to start the Pods to consume provisioned resources. After this time, the cluster autoscaler considers the nodes as not needed and removes them.
Failed=true The nodes can't be provisioned due to errors. Failed=true is a terminal state. Troubleshoot the condition based on the information in the Reason and Message fields of the condition. Create and retry a new ProvisioningRequest.
Provisioned=false The nodes haven't been provisioned yet.

If Reason=NotProvisioned, this is a temporary state before all resources are available.

If Reason=QuotaExceeded, troubleshoot the condition based on this reason and the information in the Message field of the condition. You might need to request more quota. For more details, see Check if ProvisioningRequest is limited by quota section. This Reason is only available with GKE version 1.29.2-gke.1181000 or later.

Start the Pods

When the ProvisioningRequest reaches the Provisioned=true status, you can run your Job to start the Pods. This avoids proliferation of unschedulable Pods for pending or failed requests, which can impact kube-scheduler and cluster autoscaler performance.

Alternatively, if you don't care about having unschedulable Pods, you can create Pods in parallel with ProvisioningRequest.

Cancel ProvisioningRequest request

To cancel the request before it's provisioned, you can delete the ProvisioningRequest:

kubectl delete provreq PROVISIONING_REQUEST_NAME -n NAMESPACE

In most cases, deleting ProvisioningRequest stops nodes from being created. However, depending on timing, for example if nodes were already being provisioned, the nodes might still end up created. In these cases, the cluster autoscaler removes the nodes after 10 minutes if no Pods are created.

How ProvisioningRequest works

With the ProvisioningRequest API, the following steps happen:

  1. You tell GKE that your workload can wait, for an indeterminate amount of time, until all the required nodes are ready to use at once.
  2. The cluster autoscaler accepts your request and calculates the number of necessary nodes, treating them as a single unit.
  3. The request waits until all needed resources are available in a single zone.
  4. The cluster autoscaler provisions the necessary nodes when available, all at once.
  5. All Pods of the workload are able to run together on newly provisioned nodes.
  6. The provisioned nodes are limited to seven days of runtime, or earlier if you set the maxRunDurationSeconds parameter to indicate that the workloads need less time to run. To learn more, see Limit the runtime of a VM (Preview). This capability is available with GKE version 1.28.5-gke.1355000 or later. After this time, the nodes and the Pods running on them are preempted. If the Pods finish sooner and the nodes aren't utilized, the cluster autoscaler removes them according to the autoscaling profile.
  7. The nodes aren't reused between ProvisioningRequests. Each ProvisioningRequest will order creation of new nodes with the new seven day runtime.

Quota

The number of ProvisioningRequests that are in Accepted state is limited by a dedicated quota, configured per project, independently per each region.

Check quota in the Google Cloud console

To check the name of the quota limit and current usage in the Google Cloud console, follow these steps:

  1. Go to the Quotas page in the Google Cloud console:

    Go to Quotas

  2. In the Filter box, select the Metric property, enter active_resize_requests, and press Enter.

The default value is 100. To increase the quota follow steps listed in Request a higher quota limit guide.

Check if ProvisioningRequest is limited by quota

If your ProvisioningRequest request is taking longer than expected to be fulfilled, check that the request isn't limited by quota. You might need to request more quota.

For clusters running version 1.29.2-gke.1181000 or later, check whether specific quota limitations are preventing your request from being fulfilled:

kubectl describe provreq PROVISIONING_REQUEST_NAME \
    --namespace NAMESPACE

The output is similar the following:

…
Last Transition Time:  2024-01-03T13:56:08Z
    Message:               Quota 'NVIDIA_P4_GPUS' exceeded. Limit: 1.0 in region europe-west4.
    Observed Generation:   1
    Reason:                QuotaExceeded
    Status:                False
    Type:                  Provisioned
…

In this example, GKE can't deploy nodes because there isn't enough quota in the region of europe-west4.

Configure disruption settings for node pools with workloads using ProvisioningRequest

Workloads requiring the availability of all nodes, or most nodes, in a node pool are sensitive to evictions. Automatic repair or upgrade of a node provisioned using the ProvisioningRequest API isn't supported because these operations evict all workloads running on that node and makes the workloads unschedulable.

To minimize disruption to running workloads using ProvisioningRequest, we recommend the following measures:

  • Depending on your cluster's release channel enrollment, use the following best practices to prevent node auto-upgrades from disrupting your workloads:
  • Disable node auto-repair.
  • Use maintenance windows and exclusions to minimize disruption for running workloads, while ensuring that there is a window in time where GKE can disrupt the node pool for automatic maintenance. If you use these maintenance tools, you must set a specific window of time where GKE can disrupt the node pool, so we recommend that you set this window for when there are no running workloads.
  • To ensure that your node pool remains up-to-date, we recommend that you manually upgrade your node pool when there are no active ProvisioningRequest requests and the node pool is empty.

Limitations

  • Inter-pod anti-affinity is not supported. Cluster autoscaler doesn't consider inter-pod anti-affinity rules during node provisioning which might lead to unschedulable workloads. This may happen when nodes for two or more ProvisioningRequest objects were provisioned in the same node pool.
  • Only GPU nodes are supported.
  • Reservations aren't supported with ProvisioningRequest nodes. You have to specify--reservation-affinity=none when creating the node pool. ProvisioningRequest requires and supports only the ANY location policy for cluster autoscaling.
  • A single ProvisioningRequest can create up to 1000 VMs, which is the maximum number of nodes per zone for a single node pool.
  • GKE uses the Compute Engine ACTIVE_RESIZE_REQUESTS quota to control the number of ProvisioningRequest pending in a queue. By default, this quota has a limit of 100 on a Google Cloud project level. If you attempt to create a ProvisioningRequest greater than this quota, the new request fails.
  • Node pools using ProvisioningRequest are sensitive to disruption as the nodes are provisioned together. To learn more, see configure disruption settings for node pools with workloads using ProvisioningRequest.
  • You might see additional short-lived VMs listed in the Google Cloud console. This behavior is intended because Compute Engine might create and promptly remove VMs until the capacity to provision all of the required machines is available.
  • The ProvisioningRequest integration supports only one PodSet. If you want to mix different Pod templates, use the one with most resources requested. Mixing different machine types, such as VMs with different GPU types is not supported.

What's next