This page shows you how to optimize GPU obtainability for large-scale batch and AI workloads with GPUs using the Dynamic Workload Scheduler and ProvisioningRequest.
Use Dynamic Workload Scheduler for large-scale batch and AI workloads that can run during off-peak hours with defined GPU capacity management conditions. This can include workloads such as deep learning model training or a simulation where a large number of GPUs is created at the same time.
To run GPU workloads in Google Kubernetes Engine (GKE) without the Dynamic Workload Scheduler, see Run GPUs in GKE Standard node pools.
When to use Dynamic Workload Scheduler
We recommend that you use Dynamic Workload Scheduler if your workloads meet all of the following conditions:
- You request GPUs to run your workloads.
- You have limited or no reserved GPU capacity and you want to improve obtainability of GPU resources.
- Your workload is time-flexible and your use case can afford to wait to get all the requested capacity, for example when GKE allocates the GPU resources outside of the busiest hours.
- Your workload requires multiple nodes and can't start running until all GPU nodes are provisioned and ready at the same time (for example, distributed machine learning training).
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
- Ensure that you have either:
- an existing Standard cluster in version 1.28.3-gke.1098000 or later.
- or an existing Autopilot cluster in version 1.30.3-gke.1451000 or later.
- Ensure that you configure disruption settings for node pools with workloads using Dynamic Workload Scheduler to prevent workload disruption. When using Dynamic Workload Scheduler, we recommend that you disable node auto-upgrades permanently.
- Ensure that you're familiar with limitations of Dynamic Workload Scheduler.
- When using a Standard cluster, ensure that you maintain at least one node pool without Dynamic Workload Scheduler enabled for the cluster to function correctly.
Use node pools with Dynamic Workload Scheduler
This section applies to Standard clusters only.
You can use any of the following three methods to designate that Dynamic Workload Scheduler can work with specific node pools in your cluster:
- Create a node pool.
- Update an existing node pool.
- Configure node auto-provisioning to create node pools with Dynamic Workload Scheduler enabled.
Create a node pool
Create a node pool with Dynamic Workload Scheduler enabled using the gcloud CLI:
gcloud container node-pools create NODEPOOL_NAME \
--cluster=CLUSTER_NAME \
--location=LOCATION \
--enable-queued-provisioning \
--accelerator type=GPU_TYPE,count=AMOUNT,gpu-driver-version=DRIVER_VERSION \
--machine-type=MACHINE_TYPE \
--enable-autoscaling \
--num-nodes=0 \
--total-max-nodes TOTAL_MAX_NODES \
--location-policy=ANY \
--reservation-affinity=none \
--no-enable-autorepair
Replace the following:
NODEPOOL_NAME
: The name you choose for the node pool.CLUSTER_NAME
: The name of the cluster.LOCATION
: The cluster's Compute Engine region, such asus-central1
.GPU_TYPE
: The GPU type.AMOUNT
: The number of GPUs to attach to nodes in the node pool.DRIVER_VERSION
: the NVIDIA driver version to install. Can be one of the following:default
: Install the default driver version for your GKE version.latest
: Install the latest available driver version for your GKE version. Available only for nodes that use Container-Optimized OS.
TOTAL_MAX_NODES
: the maximum number of nodes to automatically scale for the entire node pool.MACHINE_TYPE
: The Compute Engine machine type for your nodes.Best practice: Use an accelerator-optimized machine type to improve performance and efficiency for AI/ML workloads.
Optionally, you can use the following flags:
--no-enable-autoupgrade
: Recommended. Disables node auto-upgrades. Supported only in GKE clusters not enrolled in a release channel. To learn more, see Disable node auto-upgrades for an existing node pool.--node-locations=COMPUTE_ZONES
: The comma-separated list of one or more zones where GKE creates the GPU nodes. The zones must be in the same region as the cluster. Choose zones that have available GPUs.--enable-gvnic
: This flag enables gVNIC on the GPU node pools to increase network traffic speed.
This command creates a node pool with the following configuration:
- GKE enables queued provisioning and cluster autoscaling.
- The node pool initially has zero nodes.
- The
--enable-queued-provisioning
flag enables Dynamic Workload Scheduler and adds thecloud.google.com/gke-queued
taint to the node pool. - The
--no-enable-autorepair
and--no-enable-autoupgrade
flags disable automatic repair and upgrade of nodes, which could disrupt workloads running on repaired or upgraded nodes. You can only disable node auto-upgrade on clusters that are not enrolled in a release channel.
Update existing node pool and enable Dynamic Workload Scheduler
Enable Dynamic Workload Scheduler for an existing node pool. Review the prerequisites to configure the node pool correctly.
Prerequisites
Ensure that you create a node pool with the
--reservation-affinity=none
flag. This flag is required for enabling Dynamic Workload Scheduler later, as you can't change the reservation affinity after node pool creation.Ensure that you maintain at least one node pool without Dynamic Workload Scheduler handling enabled for the cluster to function correctly.
Ensure that the node pool is empty. You can resize the node pool so that it has zero nodes.
Ensure that autoscaling is enabled and correctly configured.
Ensure that auto-repairs are disabled.
Enable Dynamic Workload Scheduler for existing node pool
You can enable Dynamic Workload Scheduler for an existing node pool using the gcloud CLI:
gcloud container node-pools update NODEPOOL_NAME \
--cluster=CLUSTER_NAME \
--location=LOCATION \
--enable-queued-provisioning
Replace the following:
NODEPOOL_NAME
: name of the chosen node pool.CLUSTER_NAME
: name of the cluster.LOCATION
: cluster's Compute Engine region, such asus-central1
.
This node pool update command results in following configuration changes:
- The
--enable-queued-provisioning
flag enables Dynamic Workload Scheduler and adds thecloud.google.com/gke-queued
taint to the node pool.
Optionally, you can also update the following node pool settings:
When using Dynamic Workload Scheduler, do the following:
- Disable node auto-upgrades. Node pool upgrades are not supported by Dynamic Workload Scheduler.
- Enable Google Virtual NIC (gVNIC) on the GPU node pools. gVNIC increases network traffic speed for GPU nodes.
Enable node auto-provisioning to create node pools for Dynamic Workload Scheduler
You can use node auto-provisioning to manage node pools for Dynamic Workload Scheduler for clusters running version 1.29.2-gke.1553000 or later. When you enable node auto-provisioning and enable Dynamic Workload Scheduler, GKE creates node pools with the required resources for the associated workload.
To enable node auto-provisioning, consider the following settings and complete the steps in Configure GPU limits:
- Specify the required resources for Dynamic Workload Scheduler when enabling
the feature. To list the available
resourceTypes
, rungcloud compute accelerator-types list
. - Use the
--no-enable-autoprovisioning-autoupgrade
and--no-enable-autoprovisioning-autorepair
flags to disable node auto-upgrades and node auto-repair. To learn more, see Configure disruption settings for node pools with workloads using Dynamic Workload Scheduler. - Let GKE automatically install GPU drivers in auto-provisioned GPU nodes. To learn more, see Installing drivers using node auto-provisioning with GPUs.
Run your batch and AI workloads with Dynamic Workload Scheduler
To run batch workloads with Dynamic Workload Scheduler use any of the following configurations:
Dynamic Workload Scheduler for Jobs with Kueue: You can use Dynamic Workload Scheduler with Kueue to automate the lifecycle of the Dynamic Workload Scheduler requests. Kueue implements Job queueing and observes the status of the Dynamic Workload Scheduler. Kueue decides when Jobs should wait and when they should start, based on quotas and a hierarchy for sharing resources fairly among teams.
Dynamic Workload Scheduler for Jobs without Kueue: You can use Dynamic Workload Scheduler without Kueue when you use your own internal batch scheduling tools or platform. You manually observe and cancel the Dynamic Workload Scheduler requests.
Use Kueue to run your batch and AI workloads with Dynamic Workload Scheduler.
Dynamic Workload Scheduler for Jobs with Kueue
The following sections show you how to configure the Dynamic Workload Scheduler for Jobs with Kueue. You can configure the following common node pool setups:
- Dynamic Workload Scheduler node pool setup.
- Reservation and Dynamic Workload Scheduler node pool setup.
This section uses the samples in the dws-examples
directory
from the ai-on-gke
repository. We have published the samples in the dws-examples
directory
under the Apache2 license.
You need to have administrator permissions to install Kueue. To gain them, make sure you are granted the IAM role roles/container.admin
. To find out more about GKE IAM roles, see Create IAM allow policies guide.
Prepare your environment
In Cloud Shell, run the following command:
git clone https://github.com/GoogleCloudPlatform/ai-on-gke cd ai-on-gke/tutorials-and-examples/workflow-orchestration/dws-examples
Install the latest Kueue version in your cluster:
VERSION=v0.7.0 kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/manifests.yaml
If you use Kueue in version earlier than 0.7.0
, change the Kueue feature gate configuration by setting
the ProvisioningACC
feature gate to true
. See Kueue's feature gates
for more detailed explanation and default gate values. To learn more about Kueue installation, see
Installation.
Create the Kueue resources for the Dynamic Workload Scheduler node pool only setup
With the following manifest, you create a cluster-level queue named
dws-cluster-queue
and the
LocalQueue namespace
named dws-local-queue
. Jobs that refer to dws-cluster-queue
queue in this
namespace use Dynamic Workload Scheduler to get the GPU resources.
This cluster's queue has high quota limits and only the Dynamic Workload Scheduler integration is enabled. To learn more about Kueue APIs and how to set up limits, see Kueue concepts.
Deploy the LocalQueue:
kubectl create -f ./dws-queues.yaml
The output is similar to the following:
resourceflavor.kueue.x-k8s.io/default-flavor created
admissioncheck.kueue.x-k8s.io/dws-prov created
provisioningrequestconfig.kueue.x-k8s.io/dws-config created
clusterqueue.kueue.x-k8s.io/dws-cluster-queue created
localqueue.kueue.x-k8s.io/dws-local-queue created
If you want to run Jobs that use Dynamic Workload Scheduler in other namespaces,
you can create additional LocalQueues
using the preceding template.
Run your Job
In the following manifest, the sample Job uses Dynamic Workload Scheduler:
This manifest includes the following fields that are relevant for the Dynamic Workload Scheduler configuration:
- The
kueue.x-k8s.io/queue-name: dws-local-queue
label tells GKE that Kueue is responsible for orchestrating that Job. This label also defines the queue where the Job is queued. - The flag
suspend: true
tells GKE to create the Job resource but to not schedule the Pods yet. Kueue changes that flag tofalse
when the nodes are ready for the Job execution. nodeSelector
tells GKE to schedule the Job only on the specified node pool. The value should matchNODEPOOL_NAME
, the name of the node pool with queued provisioning enabled.
Run your Job:
kubectl create -f ./job.yaml
The output is similar to the following:
job.batch/sample-job created
Check the status of your Job:
kubectl describe job sample-job
The output is similar to the following:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Suspended 5m17s job-controller Job suspended Normal CreatedWorkload 5m17s batch/job-kueue-controller Created Workload: default/job-sample-job-7f173 Normal Started 3m27s batch/job-kueue-controller Admitted by clusterQueue dws-cluster-queue Normal SuccessfulCreate 3m27s job-controller Created pod: sample-job-9qsfd Normal Resumed 3m27s job-controller Job resumed Normal Completed 12s job-controller Job completed
The Dynamic Workload Scheduler with Kueue integration also supports other workload types available in the open source ecosystem, like the following:
- RayJob
- JobSet v0.5.2 or later
- Kubeflow MPIJob, TFJob, PyTorchJob.
- Kubernetes Pods that are frequently used by workflow orchestrators
- Flux mini cluster
To learn more about this support, see Kueue's batch user.
Create the Kueue resources for Reservation and Dynamic Workload Scheduler node pool setup
With the following manifest, you create two ResourceFlavors
tied to two different node pools: reservation-nodepool
and dws-nodepool
. The name of these node pools are only exemplary names. Modify these names according to your node pool configuration.
Additionally, with the ClusterQueue
configuration, incoming Jobs try to use reservation-nodepool
, and if there is no capacity then these Jobs use Dynamic Workload Scheduler to get the GPU resources.
This cluster's queue has high quota limits and only the Dynamic Workload Scheduler integration is enabled. To learn more about Kueue APIs and how to set up limits, see Kueue concepts.
Deploy the manifest using the following command:
kubectl create -f ./dws_and_reservation.yaml
The output is similar to the following:
resourceflavor.kueue.x-k8s.io/reservation created
resourceflavor.kueue.x-k8s.io/dws created
clusterqueue.kueue.x-k8s.io/cluster-queue created
localqueue.kueue.x-k8s.io/user-queue created
admissioncheck.kueue.x-k8s.io/dws-prov created
provisioningrequestconfig.kueue.x-k8s.io/dws-config created
Run your Job
Contrary to the preceding setup, this manifest does not include the nodeSelector
field, as it is filled by Kueue, depending on the free capacity in the ClusterQueue
.
Run your Job:
kubectl create -f ./job-without-node-selector.yaml
The output is similar to the following:
job.batch/sample-job-v8xwm created
To find out which node pool your Job uses, you need to find out what ResourceFlavor your Job uses.
Troubleshooting
To learn more about Kueue's troubleshooting, see Troubleshooting Provisioning Request in Kueue
Dynamic Workload Scheduler for Jobs without Kueue
Define a ProvisioningRequest object
Create a request through the ProvisioningRequest API for each Job. Dynamic Workload Scheduler doesn't start the Pods, it only provisions the nodes.
Create the following
provisioning-request.yaml
manifest:Standard
apiVersion: v1 kind: PodTemplate metadata: name: POD_TEMPLATE_NAME namespace: NAMESPACE_NAME labels: cloud.google.com/apply-warden-policies: "true" template: spec: nodeSelector: cloud.google.com/gke-nodepool: NODEPOOL_NAME tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" containers: - name: pi image: perl command: ["/bin/sh"] resources: limits: cpu: "700m" nvidia.com/gpu: 1 requests: cpu: "700m" nvidia.com/gpu: 1 restartPolicy: Never --- apiVersion: autoscaling.x-k8s.io/API_VERSION kind: ProvisioningRequest metadata: name: PROVISIONING_REQUEST_NAME namespace: NAMESPACE_NAME spec: provisioningClassName: queued-provisioning.gke.io parameters: maxRunDurationSeconds: "MAX_RUN_DURATION_SECONDS" podSets: - count: COUNT podTemplateRef: name: POD_TEMPLATE_NAME
Replace the following:
API_VERSION
: The version of the API, eitherv1
orv1beta1
. For GKE version 1.31.1-gke.1678000 and later, we recommend usingv1
for stability and access to the latest features.NAMESPACE_NAME
: The name of your Kubernetes namespace. The namespace must be the same as the namespace of the Pods.PROVISIONING_REQUEST_NAME
: The name of theProvisioningRequest
. You'll refer to this name in the Pod annotation.MAX_RUN_DURATION_SECONDS
: Optionally, the maximum runtime of a node in seconds, up to the default of seven days. To learn more, see How Dynamic Workload Scheduler works. You can't change this value after creation of the request. This field is available in GKE version 1.28.5-gke.1355000 or later.COUNT
: Number of Pods requested. The nodes are scheduled atomically in one zone.POD_TEMPLATE_NAME
: The name of thePodTemplate
.NODEPOOL_NAME
: The name you choose for the node pool. Remove if you want to use an auto-provisioned node pool.
GKE might apply validations and mutations to Pods during their creation. The
cloud.google.com/apply-warden-policies
label allows GKE to apply the same validations and mutations to PodTemplate objects. This label is necessary for GKE to calculate node resource requirements for your Pods.Node auto-provisioning
apiVersion: v1 kind: PodTemplate metadata: name: POD_TEMPLATE_NAME namespace: NAMESPACE_NAME labels: cloud.google.com/apply-warden-policies: "true" template: spec: nodeSelector: cloud.google.com/gke-accelerator: GPU_TYPE tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" containers: - name: pi image: perl command: ["/bin/sh"] resources: limits: cpu: "700m" nvidia.com/gpu: 1 requests: cpu: "700m" nvidia.com/gpu: 1 restartPolicy: Never --- apiVersion: autoscaling.x-k8s.io/API_VERSION kind: ProvisioningRequest metadata: name: PROVISIONING_REQUEST_NAME namespace: NAMESPACE_NAME spec: provisioningClassName: queued-provisioning.gke.io parameters: maxRunDurationSeconds: "MAX_RUN_DURATION_SECONDS" podSets: - count: COUNT podTemplateRef: name: POD_TEMPLATE_NAME
Replace the following:
API_VERSION
: The version of the API, eitherv1
orv1beta1
. For GKE version 1.31.1-gke.1678000 and later, we recommend usingv1
for stability and access to the latest features.NAMESPACE_NAME
: The name of your Kubernetes namespace. The namespace must be the same as the namespace of the Pods.PROVISIONING_REQUEST_NAME
: The name of theProvisioningRequest
. You'll refer to this name in the Pod annotation.MAX_RUN_DURATION_SECONDS
: Optionally, the maximum runtime of a node in seconds, up to the default of seven days. To learn more, see How Dynamic Workload Scheduler works. You can't change this value after creation of the request. This field is available in GKE version 1.28.5-gke.1355000 or later.COUNT
: Number of Pods requested. The nodes are scheduled atomically in one zone.POD_TEMPLATE_NAME
: The name of thePodTemplate
.GPU_TYPE
: the type of GPU hardware.
GKE might apply validations and mutations to Pods during their creation. The
cloud.google.com/apply-warden-policies
label allows GKE to apply the same validations and mutations to PodTemplate objects. This label is necessary for GKE to calculate node resource requirements for your Pods.Apply the manifest:
kubectl apply -f provisioning-request.yaml
Configure the Pods
This section uses
Kubernetes Jobs to
configure the Pods. However, you can also use a Kubernetes
JobSet or any other framework
like Kubeflow, Ray, or custom controllers. In the
Job spec,
link the Pods to the
ProvisioningRequest
using the following annotations:
apiVersion: batch/v1
kind: Job
spec:
template:
metadata:
annotations:
autoscaling.x-k8s.io/consume-provisioning-request: PROVISIONING_REQUEST_NAME
autoscaling.x-k8s.io/provisioning-class-name: "queued-provisioning.gke.io"
spec:
...
Prior to GKE version 1.30.3-gke.1854000, you must use the following legacy annotations:
annotations:
cluster-autoscaler.kubernetes.io/consume-provisioning-request: PROVISIONING_REQUEST_NAME
cluster-autoscaler.kubernetes.io/provisioning-class-name: "queued-provisioning.gke.io"
Note that starting with GKE version 1.31.1-gke.1678000
the cluster-autoscaler.kubernetes.io/consume-provisioning-request
and
cluster-autoscaler.kubernetes.io/provisioning-class-name
annotations are
deprecated.
The Pod annotation key consume-provisioning-request
defines which
ProvisioningRequest
to consume. GKE uses the
consume-provisioning-request
and provisioning-class-name
annotations to do
the following:
- To schedule the Pods only in the nodes provisioned by Dynamic Workload Scheduler.
- To avoid double counting of resource requests between Pods and Dynamic Workload Scheduler in the cluster autoscaler.
- To inject
safe-to-evict: false
annotation, to prevent the cluster autoscaler from moving Pods between nodes and interrupting batch computations. You can change this behavior by specifyingsafe-to-evict: true
in the Pod annotations.
Observe the status of Dynamic Workload Scheduler
The status of a Dynamic Workload Scheduler defines if a Pod can be scheduled or not. You can use Kubernetes watches to observe changes efficiently or other tooling you already use for tracking statuses of Kubernetes objects. The following table describes the possible status of a Dynamic Workload Scheduler and each possible outcome:
Dynamic Workload Scheduler status | Description | Possible outcome |
---|---|---|
Pending | The request was not seen and processed yet. | After processing, the request transitions to Accepted or Failed state. |
Accepted=true |
The request is accepted and is waiting for resources to be available. | The request should transition to Provisioned state, if resources were
found and nodes were provisioned or to Failed state if that was not possible. |
Provisioned=true |
The nodes are ready. | You have 10 minutes to start the Pods to consume provisioned resources. After this time, the cluster autoscaler considers the nodes as not needed and removes them. |
Failed=true |
The nodes can't be provisioned due to
errors. Failed=true is a terminal state. |
Troubleshoot
the condition based on the information in the Reason and
Message fields of the condition.
Create and retry a new Dynamic Workload Scheduler request. |
Provisioned=false |
The nodes haven't been provisioned yet. |
If If If |
Start the Pods
When the Dynamic Workload Scheduler request reaches the Provisioned=true
status, you can
run your Job
to start the Pods. This avoids proliferation of unschedulable Pods for pending
or failed requests, which can impact
kube-scheduler
and cluster autoscaler performance.
Alternatively, if you don't care about having unschedulable Pods, you can create Pods in parallel with Dynamic Workload Scheduler.
Cancel Dynamic Workload Scheduler request
To cancel the request before it's provisioned, you can delete the
ProvisioningRequest
:
kubectl delete provreq PROVISIONING_REQUEST_NAME -n NAMESPACE
In most cases, deleting ProvisioningRequest
stops nodes from being created.
However, depending on timing, for example if nodes were already
being provisioned, the nodes might still end up created. In these cases, the
cluster autoscaler removes the nodes after 10 minutes if no Pods are created.
How Dynamic Workload Scheduler works
With the ProvisioningRequest API
, Dynamic Workload Scheduler does the following:
- You tell GKE that your workload can wait, for an indeterminate amount of time, until all the required nodes are ready to use at once.
- The cluster autoscaler accepts your request and calculates the number of necessary nodes, treating them as a single unit.
- The request waits until all needed resources are available in a single zone. For clusters running 1.29.1-gke.1708000 and later, this zone is chosen using information about available capacity to ensure lower wait times. For clusters running earlier versions, the zone was chosen without this information, which can result in queueing in zones where the wait times are much longer.
- The cluster autoscaler provisions the necessary nodes when available, all at once.
- All Pods of the workload are able to run together on newly provisioned nodes.
- The provisioned nodes are limited to seven days of runtime, or earlier if
you set the
maxRunDurationSeconds
parameter to indicate that the workloads need less time to run. To learn more, see Limit the runtime of a VM. This capability is available with GKE version 1.28.5-gke.1355000 or later. After this time, the nodes and the Pods running on them are preempted. If the Pods finish sooner and the nodes aren't utilized, the cluster autoscaler removes them according to the autoscaling profile. - The nodes aren't reused between the Dynamic Workload Scheduler. Each
ProvisioningRequest
orders creation of new nodes with the new seven day runtime.
GKE measures runtime on a node level. The time available for running Pods might be slightly smaller due to delays during startup. Pod retries share this runtime, which means that there is less time available for Pods after retry. GKE counts the runtime for each Dynamic Workload Scheduler request separately.
Quota
All VMs provisioned by the Dynamic Workload Scheduler requests use preemptible quotas.
The number of ProvisioningRequests
that are in Accepted
state is limited by
a dedicated quota. You configure the quota for each project, one quota
configuration per region.
Check quota in the Google Cloud console
To check the name of the quota limit and current usage in the Google Cloud console, follow these steps:
Go to the Quotas page in the Google Cloud console:
In the
Filter box, select the Metric property, enteractive_resize_requests
, and press Enter.
The default value is 100. To increase the quota, follow the steps listed in Request a higher quota limit guide.
Check if Dynamic Workload Scheduler is limited by quota
If your Dynamic Workload Scheduler request is taking longer than expected to be fulfilled, check that the request isn't limited by quota. You might need to request more quota.
For clusters running version 1.29.2-gke.1181000 or later, check whether specific quota limitations are preventing your request from being fulfilled:
kubectl describe provreq PROVISIONING_REQUEST_NAME \
--namespace NAMESPACE
The output is similar the following:
…
Last Transition Time: 2024-01-03T13:56:08Z
Message: Quota 'NVIDIA_P4_GPUS' exceeded. Limit: 1.0 in region europe-west4.
Observed Generation: 1
Reason: QuotaExceeded
Status: False
Type: Provisioned
…
In this example, GKE can't deploy nodes because there isn't
enough quota in the region of europe-west4
.
Manage disruption in workloads using Dynamic Workload Scheduler
Workloads requiring the availability of all nodes, or most nodes, in a node pool
are sensitive to evictions. Automatic repair or upgrade of a node provisioned
using the ProvisioningRequest API
isn't supported because these operations evict
all workloads running on that node and makes the workloads unschedulable.
Best practices to minimize workload disruption
To minimize disruption to running workloads using Dynamic Workload Scheduler, perform the following tasks:
- Depending on your cluster's release channel
enrollment,
use the following best practices to prevent node auto-upgrades from
disrupting your workloads:
- If your cluster isn't enrolled in a release channel, disable node auto-upgrades.
- If your cluster is enrolled in a release channel, use maintenance windows and exclusions to prevent GKE from automatically upgrading your nodes while your workload is running.
- Disable node auto-repair.
- Use maintenance windows and exclusions to minimize disruption for running workloads, while ensuring that GKE still has time available to do automatic maintenance. Be sure to specify that time for when there are no running workloads.
- To ensure that your node pool remains up-to-date, manually upgrade your node pool when there are no active Dynamic Workload Scheduler requests and the node pool is empty.
Limitations
- Inter-pod anti-affinity is not supported. Cluster autoscaler doesn't consider inter-pod anti-affinity rules during node provisioning which might lead to unschedulable workloads. This may happen when nodes for two or more Dynamic Workload Scheduler objects were provisioned in the same node pool.
- Only GPU nodes are supported.
- Reservations aren't supported with Dynamic Workload Scheduler. You have to specify
--reservation-affinity=none
when creating the node pool. Dynamic Workload Scheduler requires and supports only theANY
location policy for cluster autoscaling. - A single Dynamic Workload Scheduler request can create up to 1000 VMs, which is the maximum number of nodes per zone for a single node pool.
- GKE uses the Compute Engine
ACTIVE_RESIZE_REQUESTS
quota to control the number of Dynamic Workload Scheduler requests pending in a queue. By default, this quota has a limit of 100 on a Google Cloud project level. If you attempt to create a Dynamic Workload Scheduler request greater than this quota, the new request fails. - Node pools using Dynamic Workload Scheduler are sensitive to disruption as the nodes are provisioned together. To learn more, see configure disruption settings for node pools with workloads using Dynamic Workload Scheduler.
- You might see additional short-lived VMs listed in the Google Cloud console. This behavior is intended because Compute Engine might create and promptly remove VMs until the capacity to provision all of the required machines is available.
- The Dynamic Workload Scheduler integration supports only one PodSet. If you want to mix different Pod templates, use the one with most resources requested. Mixing different machine types, such as VMs with different GPU types is not supported.
What's next
- Learn more about GPUs in GKE.
- Learn how to Deploy GPU workloads in Autopilot.