This page shows you how to create your own AI-optimized Google Kubernetes Engine (GKE) cluster that uses A4X, A4, and A3 Ultra virtual machines (VMs) to support your AI and ML workloads.
The A4X, A4, and A3 Ultra machine series are designed to enable you to run large-scale AI/ML clusters with features such as targeted workload placement, advanced cluster maintenance controls, and topology-aware scheduling. For more information, see Cluster management overview.
GKE provides a single platform surface to run a diverse set of workloads for your organization's needs. This includes high performance distributed pre-training, model fine-tuning, model inference, application serving, and supporting services. GKE reduces the operational burden of managing multiple platforms.
Choose how to create an AI-optimized GKE cluster
The following options for cluster creation each provide varying degrees of ease and flexibility in cluster configuration and workload scheduling:
- Create clusters with the default configuration for compute, storage, and networking resources, and with GPUDirect RDMA-over-Converged-Ethernet (RoCE) enabled: - Use Cluster Toolkit to quickly create production-ready GKE clusters.
- Use Accelerated Processing Kit (XPK) to quickly create GKE clusters for proofs-of-concept and testing.
 
- Alternatively, you can create your GKE cluster manually for precise customization or expansion of existing production GKE environments. To create an AI-optimized GKE cluster manually, see one of the following pages: 
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
    install and then
    initialize the
    gcloud CLI. If you previously installed the gcloud CLI, get the latest
    version by running the gcloud components updatecommand. Earlier gcloud CLI versions might not support running the commands in this document.
- Ensure that you have the required permissions for creating and managing the
GKE cluster and associated service accounts:
- Kubernetes Engine Admin (roles/container.admin)
- Compute Admin (roles/compute.admin)
- Storage Admin (roles/storage.admin)
- Project IAM Admin (roles/resourcemanager.projectIamAdmin)
- Service Account Admin (roles/iam.serviceAccountAdmin)
- Service Account User (roles/iam.serviceAccountUser)
- Service Usage Consumer (roles/serviceusage.serviceUsageConsumer)
 
- Kubernetes Engine Admin (
Choose a consumption option and obtain capacity
- Choose a consumption option. Make your choice based on how you want to get and use GPU resources. To learn more, see Choose a consumption option. - For GKE, consider the following additional information when choosing a consumption option: - A4X VMs can't be provisioned by flex-start.
- For more information about flex-start (Preview) and GKE, see About GPU obtainability with flex-start.
- Flex-start uses best-effort compact placement. To examine your topology, see View the physical topology of nodes in your GKE cluster.
- You can only get topology information when using Spot VMs if you configure compact placement.
 
- Obtain capacity. The process to obtain capacity differs for each consumption option. - To learn about the process for your chosen consumption option, see Capacity overview. 
Requirements
The following requirements apply to an AI-optimized GKE cluster:
- For A4X, ensure that you use, for 1.33 or later, GKE version 1.33.4-gke.1036000 or later. Or, for 1.32, use GKE version 1.32.8-gke.1108000 or later. These versions ensure that A4X uses the following: - R580, the minimum GPU driver version for A4X.
- Coherent Driver-based Memory Management (CDMM), which is enabled by default. NVIDIA recommends that Kubernetes clusters enable this mode to resolve memory over-reporting. CDMM allows GPU memory to be managed through the driver instead of the operating system (OS). This approach helps you to avoid OS onlining of GPU memory, and exposes the GPU memory as a Non-Uniform Memory Access (NUMA) node to the OS. Multi-instance GPUs aren't supported when CDMM is enabled. For more information about CDMM, see Hardware and Software Support.
- GPUDirect RDMA, which is recommended to enable A4X node pools to use the networking capabilities of A4X.
 
- Ensure you use the minimum GPU driver version, depending on the machine type: - A4X: The GB200 GPUs in A4X VMs require a minimum of the R580 GPU driver version. See the previously mentioned version requirements.
- A4: The B200 GPUs in A4 VMs require a minimum of the R570 GPU driver version. GKE, by default, automatically installs this driver version on all A4 nodes running the required minimum version for A4, 1.32.1-gke.1729000 or later.
- A3 Ultra: The H200 GPUs in A3 Ultra VMs require a minimum of R550
GPU driver version, which is available in GKE 1.31 as
latestdriver version. For A3 Ultra, you must setgpu-driver-version=latestwith GKE 1.31. For GKE version 1.31.5-gke.1169000 or later, GKE, by default, automatically installs R550 GPU driver versions on A3 Ultra nodes.
 
- For A3 Ultra node pools, you must set the disk type to - hyperdisk-balanced.
- To use GPUDirect RDMA, use the following minimum versions depending on the machine type: - A4X: See the previously mentioned version requirements.
- A4: Use 1.32.2-gke.1475000 or later.
- A3 Ultra: Use 1.31.4-gke.1183000 or later.
 
- To use GPUDirect RDMA, the GKE nodes must use a Container-Optimized OS node image. Ubuntu and Windows node images are not supported. 
- You must use the reservation-bound provisioning model to create clusters with A4X. Other provisioning models are not supported. 
Create a cluster
Use the following instructions to create a cluster either using Cluster Toolkit or XPK.
Create a cluster using Cluster Toolkit
This section guides you through the cluster creation process, ensuring that your project follows best practices and meets the requirements for an AI-optimized GKE cluster.
A4X
- Launch Cloud Shell. You can use a different environment; however, we recommend Cloud Shell because the dependencies are already pre-installed for Cluster Toolkit. If you don't want to use Cloud Shell, follow the instructions to install dependencies to prepare a different environment.
- Clone the Cluster Toolkit from the git repository: - cd ~ git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
- Install the Cluster Toolkit: - cd cluster-toolkit && git checkout main && make
- Create a Cloud Storage bucket to store the state of the Terraform deployment: - gcloud storage buckets create gs://BUCKET_NAME \ --default-storage-class=STANDARD \ --project=PROJECT_ID \ --location=COMPUTE_REGION_TERRAFORM_STATE \ --uniform-bucket-level-access gcloud storage buckets update gs://BUCKET_NAME --versioning- Replace the following variables: - BUCKET_NAME: the name of the new Cloud Storage bucket.
- PROJECT_ID: your Google Cloud project ID.
- COMPUTE_REGION_TERRAFORM_STATE: the compute region where you want to store the state of the Terraform deployment.
 
- In the - examples/gke-a4x/gke-a4x-deployment.yamlblueprint from the GitHub repo, fill in the following settings in the- terraform_backend_defaultsand- varssections to match the specific values for your deployment:- DEPLOYMENT_NAME: a unique name for the deployment, which must be between 6 and 30 characters in length. If the deployment name isn't unique within a project, cluster creation fails. The default value is- gke-a4x.
- BUCKET_NAME: the name of the Cloud Storage bucket you created in the previous step.
- PROJECT_ID: your Google Cloud project ID.
- COMPUTE_REGION: the compute region for the cluster.
- COMPUTE_ZONE: the compute zone for the node pool of A4X machines. Note that this zone should match the zone where machines are available in your reservation.
- NODE_COUNT: the number of A4X nodes in your cluster's node pool, which must be 18 nodes or less. We recommend using 18 nodes to obtain the GPU topology of- 1x72in one subblock using an NVLink domain.
- IP_ADDRESS/SUFFIX: the IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine that you want to use to call Terraform. For more information, see How authorized networks work.
- For the - extended_reservationfield, use one of the following, depending on whether you want to target specific blocks in a reservation when provisioning the node pool:- To place the node pool anywhere in the reservation, provide the
name of your reservation
(RESERVATION_NAME).
- To target a specific block within your reservation, use the reservation and block names in the following format: - RESERVATION_NAME/reservationBlocks/BLOCK_NAME
 - If you don't know which blocks are available in your reservation, see View a reservation topology. 
- To place the node pool anywhere in the reservation, provide the
name of your reservation
(
- Set the boot disk sizes for each node of the system and A4X node pools. The disk size that you need depends on your use case. For example, if you use the disk as a cache to reduce the latency of pulling an image repeatedly, you can set a larger disk size to accommodate your framework, model, or container image: - SYSTEM_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the system node pool. The smallest allowed disk size is- 10. The default value is- 200.
- A4X_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the A4X node pool. The smallest allowed disk size is- 10. The default value is- 100.
 
 - To modify advanced settings, edit the - examples/gke-a4x/gke-a4x.yamlfile.
- Optionally, you can enable Cluster Health Scanner (CHS) on the cluster. CHS checks the health of your GPU clusters by running tests to verify that the clusters are ready to run your workloads. To enable CHS, make the following changes in the - examples/gke-a4x/gke-a4x-deployment.yamlfile:- In the - varsblock, set the- enable_periodic_health_checksfield to- true.
- By default the healthchecks run every Sunday at 12:00 am PST. If you want to change this setting, in the - varsblock, set the- health_check_schedulefield to a suitable value, in cron format.
 Schedule in cron format:- none * * * * * # | | | | | # | | | | day of the week (0-6) (Sunday to Saturday) # | | | month (1-12) # | | day of the month (1-31) # | hour (0-23) # minute (0-59)
 
- Generate Application Default Credentials (ADC) to provide access to Terraform. If you're using Cloud Shell, you can run the following command: - gcloud auth application-default login
- Deploy the blueprint to provision the GKE infrastructure using A4X machine types: - cd ~/cluster-toolkit ./gcluster deploy -d \ examples/gke-a4x/gke-a4x-deployment.yaml \ examples/gke-a4x/gke-a4x.yaml
- When prompted, select (A)pply to deploy the blueprint. - The blueprint creates VPC networks, a GPU RDMA VPC network, service accounts, a cluster, and a node pool.
- To support the fio-bench-job-templatejob template in the blueprint, Google Cloud buckets, network storage, and persistent volumes resources are created.
 
A4
- Launch Cloud Shell. You can use a different environment; however, we recommend Cloud Shell because the dependencies are already pre-installed for Cluster Toolkit. If you don't want to use Cloud Shell, follow the instructions to install dependencies to prepare a different environment.
- Clone the Cluster Toolkit from the git repository: - cd ~ git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
- Install the Cluster Toolkit: - cd cluster-toolkit && git checkout main && make
- Create a Cloud Storage bucket to store the state of the Terraform deployment: - gcloud storage buckets create gs://BUCKET_NAME \ --default-storage-class=STANDARD \ --project=PROJECT_ID \ --location=COMPUTE_REGION_TERRAFORM_STATE \ --uniform-bucket-level-access gcloud storage buckets update gs://BUCKET_NAME --versioning- Replace the following variables: - BUCKET_NAME: the name of the new Cloud Storage bucket.
- PROJECT_ID: your Google Cloud project ID.
- COMPUTE_REGION_TERRAFORM_STATE: the compute region where you want to store the state of the Terraform deployment.
 
- The files that you need to edit to create a cluster depend on the consumption option that you're using for your deployment. Select the tab that corresponds to your consumption option's provisioning model. - Reservation-bound- In the - examples/gke-a4/gke-a4-deployment.yamlblueprint from the GitHub repo, fill in the following settings in the- terraform_backend_defaultsand- varssections to match the specific values for your deployment:- DEPLOYMENT_NAME: a unique name for the deployment, which must be between 6 and 30 characters in length. If the deployment name isn't unique within a project, cluster creation fails. The default value is- gke-a4.
- BUCKET_NAME: the name of the Cloud Storage bucket you created in the previous step.
- PROJECT_ID: your Google Cloud project ID.
- COMPUTE_REGION: the compute region for the cluster.
- COMPUTE_ZONE: the compute zone for the node pool of A4 machines. Note that this zone should match the zone where machines are available in your reservation.
- NODE_COUNT: the number of A4 nodes in your cluster.
- IP_ADDRESS/SUFFIX: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine that you want to use to call Terraform. For more information, see How authorized networks work.
- For the - reservationfield, use one of the following, depending on whether you want to target specific blocks in a reservation when provisioning the node pool:- To place the node pool anywhere in the reservation, provide the
name of your reservation (RESERVATION_NAME).
- To target a specific block within your reservation, use the reservation and block names in the following format: - RESERVATION_NAME/reservationBlocks/BLOCK_NAME
 - If you don't know which blocks are available in your reservation, see View a reservation topology. 
- To place the node pool anywhere in the reservation, provide the
name of your reservation (
- Set the boot disk sizes for each node of the system and A4 node pools. The disk size that you need depends on your use case. For example, if you use the disk as a cache to reduce the latency of pulling an image repeatedly, you can set a larger disk size to accommodate your framework, model, or container image: - SYSTEM_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the system node pool. The smallest allowed disk size is- 10. The default value is- 100.
- A4_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the A4 node pool. The smallest allowed disk size is- 10. The default value is- 100.
 
 - To modify advanced settings, edit - examples/gke-a4/gke-a4.yaml.- Flex-start- In the - examples/gke-a4/gke-a4-deployment.yamlblueprint from the GitHub repo, fill in the following settings in the- terraform_backend_defaultsand- varssections to match the specific values for your deployment:- DEPLOYMENT_NAME: a unique name for the deployment, which must be between 6 and 30 characters in length. If the deployment name isn't unique within a project, cluster creation fails. The default value is- gke-a4.
- BUCKET_NAME: the name of the Cloud Storage bucket you created in the previous step.
- PROJECT_ID: your Google Cloud project ID.
- COMPUTE_REGION: the compute region for the cluster.
- COMPUTE_ZONE: the compute zone for the node pool of A4 machines.
- Remove static_node_count.
- IP_ADDRESS/SUFFIX: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine that you want to use to call Terraform. For more information, see How authorized networks work.
- Remove the reservationfield, and replace the field withenable_flex_start: true. Add on the next lineenable_queued_provisioning: trueif you'd also like to use queued provisioning. For more information, see Use node pools with flex-start with queued provisioning.
- Set the boot disk sizes for each node of the system and A4 node pools. The disk size that you need depends on your use case. For example, if you use the disk as a cache to reduce the latency of pulling an image repeatedly, you can set a larger disk size to accommodate your framework, model, or container image: - SYSTEM_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the system node pool. The smallest allowed disk size is- 10. The default value is- 100.
- A4_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the A4 node pool. The smallest allowed disk size is- 10. The default value is- 100.
 
 
- In the - examples/gke-a4/gke-a4.yamlblueprint from the GitHub repo, make the following changes:- In the varsblock, removestatic_node_count.
- In the varsblock, make sure theversion_prefixnumber is"1.32."or higher. To use flex-start in GKE, your cluster must use version 1.32.2-gke.1652000 or later.
- In the varsblock, replace the entirereservationblock (including thereservationline itself) withenable_flex_start: true, and, optionally,enable_queued_provisioning: true.
- In the varsblock, if you don't require queued provisioning, remove the following line:kueue_configuration_path: $(ghpc_stage("./kueue-configuration.yaml.tftpl")).
- Under id: a4-pool, remove the following line:static_node_count: $(vars.static_node_count).
- Under - id: a4-pool, remove the- reservation_affinityblock. Replace this block with the following lines:- enable_flex_start: $(vars.enable_flex_start)
- auto_repair: false
- For queued provisioning, if you want to enable it, add the
following additional lines:
- enable_queued_provisioning: $(vars.enable_queued_provisioning)
- autoscaling_total_min_nodes: 0
 
 
- Under - id: workload-manager-install, remove the following block:- kueue: install: true config_path: $(vars.kueue_configuration_path) config_template_vars: num_gpus: $(a3-ultragpu-pool.static_gpu_count) accelerator_type: $(vars.accelerator_type)- For flex-start with queued provisioning, do the following: - Add - gpu_nominal_quota: NOMINAL_QUOTAto the- varsblock. The- gpu_nominal_quotavalue is used to set the- nominalQuotaof GPUs in the- ClusterQueuespec (in the following, see the step setting- ClusterQueue). In this example, the- ClusterQueueonly admits workloads if the sum of the GPU requests is less than or equal to the- NOMINAL_QUOTAvalue. For more information about- ClusterQueue, see the following Kueue document of Cluster Queue.
- Update the - kueueblock to the following:- kueue: install: true config_path: $(vars.kueue_configuration_path) config_template_vars: num_gpus: $(vars.gpu_nominal_quota)
- Replace the content of the - kueue-configuration.yaml.tftplfile with the following:- apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: "default-flavor" --- apiVersion: kueue.x-k8s.io/v1beta1 kind: AdmissionCheck metadata: name: dws-prov spec: controllerName: kueue.x-k8s.io/provisioning-request parameters: apiGroup: kueue.x-k8s.io kind: ProvisioningRequestConfig name: dws-config --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ProvisioningRequestConfig metadata: name: dws-config spec: provisioningClassName: queued-provisioning.gke.io managedResources: - nvidia.com/gpu --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "dws-cluster-queue" spec: namespaceSelector: {} resourceGroups: - coveredResources: ["nvidia.com/gpu"] flavors: - name: "default-flavor" resources: - name: "nvidia.com/gpu" nominalQuota: ${num_gpus} admissionChecks: - dws-prov --- apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: "default" name: "dws-local-queue" spec: clusterQueue: "dws-cluster-queue" ---
 
 
- Under - id: job-template, replace the- node_countvariable with- 2.
 
- In the 
 - Spot- In the - examples/gke-a4/gke-a4-deployment.yamlblueprint from the GitHub repo, fill in the following settings in the- terraform_backend_defaultsand- varssections to match the specific values for your deployment:- DEPLOYMENT_NAME: a unique name for the deployment, which must be between 6 and 30 characters in length. If the deployment name isn't unique within a project, cluster creation fails. The default value is- gke-a4.
- BUCKET_NAME: the name of the Cloud Storage bucket you created in the previous step.
- PROJECT_ID: your Google Cloud project ID.
- COMPUTE_REGION: the compute region for the cluster.
- COMPUTE_ZONE: the compute zone for the node pool of A4 machines.
- STATIC_NODE_COUNT: the number of A4 nodes in your cluster.
- IP_ADDRESS/SUFFIX: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine that you want to use to call Terraform. For more information, see How authorized networks work.
- Replace the entire reservationblock (including thereservationline itself) withspot: true.
- Set the boot disk sizes for each node of the system and A4 node pools. The disk size that you need depends on your use case. For example, if you use the disk as a cache to reduce the latency of pulling an image repeatedly, you can set a larger disk size to accommodate your framework, model, or container image: - SYSTEM_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the system node pool. The smallest allowed disk size is- 10. The default value is- 100.
- A4_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the A4 node pool. The smallest allowed disk size is- 10. The default value is- 100.
 
 
- In the - examples/gke-a4/gke-a4.yamlblueprint from the GitHub repo, make the following changes:- In the varsblock, replace the entirereservationblock (including thereservationline itself) withspot: true.
- Under - id: a4-pool, remove the- reservation_affinityblock. Replace this block with the following line:- spot: $(vars.spot)
 
 
- In the 
 
- Optionally, you can enable Cluster Health Scanner (CHS) on the cluster. CHS checks the health of your GPU clusters by running tests to verify that the clusters are ready to run your workloads. To enable CHS, make the following changes in the - examples/gke-a4/gke-a4-deployment.yamlfile:- In the - varsblock, set the- enable_periodic_health_checksfield to- true.
- By default the healthchecks run every Sunday at 12:00 am PST. If you want to change this setting, in the - varsblock, set the- health_check_schedulefield to a suitable value, in cron format.
 Schedule in cron format:- none * * * * * # | | | | | # | | | | day of the week (0-6) (Sunday to Saturday) # | | | month (1-12) # | | day of the month (1-31) # | hour (0-23) # minute (0-59)
 
- Generate Application Default Credentials (ADC) to provide access to Terraform. If you're using Cloud Shell, you can run the following command: - gcloud auth application-default login
- Deploy the blueprint to provision the GKE infrastructure using A4 machine types: - cd ~/cluster-toolkit ./gcluster deploy -d \ examples/gke-a4/gke-a4-deployment.yaml \ examples/gke-a4/gke-a4.yaml
- When prompted, select (A)pply to deploy the blueprint. - The blueprint creates VPC networks, a GPU RDMA VPC network, service accounts, a cluster, and a nodepool.
- To support the fio-bench-job-templatejob template in the blueprint, Google Cloud buckets, network storage, and persistent volumes resources are created.
 
A3 Ultra
- Launch Cloud Shell. You can use a different environment; however, we recommend Cloud Shell because the dependencies are already pre-installed for Cluster Toolkit. If you don't want to use Cloud Shell, follow the instructions to install dependencies to prepare a different environment.
- Clone the Cluster Toolkit from the git repository: - cd ~ git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
- Install the Cluster Toolkit: - cd cluster-toolkit && git checkout main && make
- Create a Cloud Storage bucket to store the state of the Terraform deployment: - gcloud storage buckets create gs://BUCKET_NAME \ --default-storage-class=STANDARD \ --project=PROJECT_ID \ --location=COMPUTE_REGION_TERRAFORM_STATE \ --uniform-bucket-level-access gcloud storage buckets update gs://BUCKET_NAME --versioning- Replace the following variables: - BUCKET_NAME: the name of the new Cloud Storage bucket.
- PROJECT_ID: your Google Cloud project ID.
- COMPUTE_REGION_TERRAFORM_STATE: the compute region where you want to store the state of the Terraform deployment.
 
- The files that you need to edit to create a cluster depend on the consumption option that you're using for your deployment. Select the tab that corresponds to your consumption option's provisioning model. - Reservation-bound- In the - examples/gke-a3-ultragpu/gke-a3-ultragpu-deployment.yamlblueprint from the GitHub repo, replace the following variables in the- terraform_backend_defaultsand- varssections to match the specific values for your deployment:- DEPLOYMENT_NAME: a unique name for the deployment, which must be between 6 and 30 characters in length. If the deployment name isn't unique within a project, cluster creation fails.
- BUCKET_NAME: the name of the Cloud Storage bucket you created in the previous step.
- PROJECT_ID: your Google Cloud project ID.
- COMPUTE_REGION: the compute region for the cluster.
- COMPUTE_ZONE: the compute zone for the node pool of A3 Ultra machines. Note that this zone should match the zone where machines are available in your reservation.
- NODE_COUNT: the number of A3 Ultra nodes in your cluster.
- IP_ADDRESS/SUFFIX: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine that you want to use to call Terraform. For more information, see How authorized networks work.
- For the - reservationfield, use one of the following, depending on whether you want to target specific blocks in a reservation when provisioning the node pool:- To place the node pool anywhere in the reservation, provide the
name of your reservation (RESERVATION_NAME).
- To target a specific block within your reservation, use the reservation and block names in the following format: - RESERVATION_NAME/reservationBlocks/BLOCK_NAME
 - If you don't know which blocks are available in your reservation, see View a reservation topology. 
- To place the node pool anywhere in the reservation, provide the
name of your reservation (
- Set the boot disk sizes for each node of the system and A3 Ultra node pools. The disk size that you need depends on your use case. For example, if you use the disk as a cache to reduce the latency of pulling an image repeatedly, you can set a larger disk size to accommodate your framework, model, or container image: - SYSTEM_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the system node pool. The smallest allowed disk size is- 10. The default value is- 100.
- A3ULTRA_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the A3 Ultra node pool. The smallest allowed disk size is- 10. The default value is- 100.
 
 - To modify advanced settings, edit - examples/gke-a3-ultragpu/gke-a3-ultragpu.yaml.- Flex-start- In the - examples/gke-a3-ultragpu/gke-a3-ultragpu-deployment.yamlblueprint from the GitHub repo, replace the following variables in the- terraform_backend_defaultsand- varssections to match the specific values for your deployment:- DEPLOYMENT_NAME: a unique name for the deployment, which must be between 6 and 30 characters in length. If the deployment name isn't unique within a project, cluster creation fails.
- BUCKET_NAME: the name of the Cloud Storage bucket you created in the previous step.
- PROJECT_ID: your Google Cloud project ID.
- COMPUTE_REGION: the compute region for the cluster.
- COMPUTE_ZONE: the compute zone for the node pool of A3 Ultra machines.
- Remove static_node_count.
- IP_ADDRESS/SUFFIX: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine that you want to use to call Terraform. For more information, see How authorized networks work.
- Remove the reservationfield, and replace the field withenable_flex_start: true. Add on the next lineenable_queued_provisioning: trueif you'd also like to use queued provisioning. For more information, see Use node pools with flex-start with queued provisioning.
- Set the boot disk sizes for each node of the system and A3 Ultra node pools. The disk size that you need depends on your use case. For example, if you use the disk as a cache to reduce the latency of pulling an image repeatedly, you can set a larger disk size to accommodate your framework, model, or container image: - SYSTEM_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the system node pool. The smallest allowed disk size is- 10. The default value is- 100.
- A3ULTRA_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the A3 Ultra node pool. The smallest allowed disk size is- 10. The default value is- 100.
 
 
- In the - examples/gke-a3-ultragpu/gke-a3-ultragpu.yamlblueprint from the GitHub repo, make the following changes:- In the varsblock, removestatic_node_count.
- In the varsblock, updateversion_prefixnumber to"1.32."or higher. To use flex-start in GKE, your cluster must use version 1.32.2-gke.1652000 or later.
- In the varsblock, replace the entirereservationblock (including thereservationline itself) withenable_flex_start: true, and, optionally,enable_queued_provisioning: true.
- In the varsblock, remove the following line:kueue_configuration_path: $(ghpc_stage("./kueue-configuration.yaml.tftpl")).
- Under id: a3-ultragpu-pool, remove the following line:static_node_count: $(vars.static_node_count).
- Under - id: a3-ultragpu-pool, remove the- reservation_affinityblock. Replace this block with the following lines:- enable_flex_start: $(vars.enable_flex_start)
- auto_repair: false
- For queued provisioning, if you want to enable it, add the
following additional lines:
- enable_queued_provisioning: $(vars.enable_queued_provisioning)
- autoscaling_total_min_nodes: 0
 
 
- Under - id: workload-manager-install, remove the following block:- config_path: $(vars.kueue_configuration_path) config_template_vars: num_gpus: $(a4-pool.static_gpu_count) accelerator_type: $(vars.accelerator_type)- For flex-start with queued provisioning, follow these three steps: - Add - gpu_nominal_quota: NOMINAL_QUOTAto the- varsblock. The- gpu_nominal_quotavalue is used to set the- nominalQuotaof GPUs in the- ClusterQueuespecification. In this example, the- ClusterQueueonly admits workloads if the sum of the GPU requests is less than or equal to the- NOMINAL_QUOTAvalue. For more information about- ClusterQueue, see the following Kueue document of Cluster Queue.
- Update the - kueueblock to the following:- kueue: install: true config_path: $(vars.kueue_configuration_path) config_template_vars: num_gpus: $(vars.gpu_nominal_quota)
- Replace the content of the - kueue-configuration.yaml.tftplfile with the following:- apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: "default-flavor" --- apiVersion: kueue.x-k8s.io/v1beta1 kind: AdmissionCheck metadata: name: dws-prov spec: controllerName: kueue.x-k8s.io/provisioning-request parameters: apiGroup: kueue.x-k8s.io kind: ProvisioningRequestConfig name: dws-config --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ProvisioningRequestConfig metadata: name: dws-config spec: provisioningClassName: queued-provisioning.gke.io managedResources: - nvidia.com/gpu --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "dws-cluster-queue" spec: namespaceSelector: {} resourceGroups: - coveredResources: ["nvidia.com/gpu"] flavors: - name: "default-flavor" resources: - name: "nvidia.com/gpu" nominalQuota: ${num_gpus} admissionChecks: - dws-prov --- apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: "default" name: "dws-local-queue" spec: clusterQueue: "dws-cluster-queue" ---
 
- In the - id: job-templatefield, replace the- node_countvariable with- 2.
 
 
- In the 
 - Spot- In the - examples/gke-a3-ultragpu/gke-a3-ultragpu-deployment.yamlblueprint from the GitHub repo, fill in the following settings in the- terraform_backend_defaultsand- varssections to match the specific values for your deployment:- DEPLOYMENT_NAME: a unique name for the deployment, which must be between 6 and 30 characters in length. If the deployment name isn't unique within a project, cluster creation fails.
- BUCKET_NAME: the name of the Cloud Storage bucket you created in the previous step.
- PROJECT_ID: your Google Cloud project ID.
- COMPUTE_REGION: the compute region for the cluster.
- COMPUTE_ZONE: the compute zone for the node pool of A3 Ultra machines.
- STATIC_NODE_COUNT: the number of A3 Ultra nodes in your cluster.
- IP_ADDRESS/SUFFIX: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine that you want to use to call Terraform. For more information, see How authorized networks work.
- Replace the entire reservationblock (including thereservationline itself) withspot: true.
- Set the boot disk sizes for each node of the system and A3 Ultra node pools. The disk size that you need depends on your use case. For example, if you use the disk as a cache to reduce the latency of pulling an image repeatedly, you can set a larger disk size to accommodate your framework, model, or container image: - SYSTEM_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the system node pool. The smallest allowed disk size is- 10. The default value is- 100.
- A3ULTRA_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the A3 Ultra node pool. The smallest allowed disk size is- 10. The default value is- 100.
 
 
- In the - examples/gke-a3-ultragpu/gke-a3-ultragpu.yamlblueprint from the GitHub repo, make the following changes:- In the varsblock, replace the entirereservationblock (including thereservationline itself) withspot: true.
- Under - id: a3-ultragpu-pool, remove the- reservation_affinityblock. Replace this block with the following line:- spot: $(vars.spot)
 
 
- In the 
 
- Optionally, you can enable Cluster Health Scanner (CHS) on the cluster. CHS checks the health of your GPU clusters by running tests to verify that the clusters are ready to run your workloads. To enable CHS, make the following changes in the - examples/gke-a3-ultragpu/gke-a3-ultragpu-deployment.yamlfile:- In the - varsblock, set the- enable_periodic_health_checksfield to- true.
- By default the healthchecks run every Sunday at 12:00 am PST. If you want to change this setting, in the - varsblock, set the- health_check_schedulefield to a suitable value, in cron format.
 Schedule in cron format:- none * * * * * # | | | | | # | | | | day of the week (0-6) (Sunday to Saturday) # | | | month (1-12) # | | day of the month (1-31) # | hour (0-23) # minute (0-59)
 
- Generate Application Default Credentials (ADC) to provide access to Terraform. If you're using Cloud Shell, you can run the following command: - gcloud auth application-default login
- Deploy the blueprint to provision the GKE infrastructure using A3 Ultra machine types: - cd ~/cluster-toolkit ./gcluster deploy -d \ examples/gke-a3-ultragpu/gke-a3-ultragpu-deployment.yaml \ examples/gke-a3-ultragpu/gke-a3-ultragpu.yaml
- When prompted, select (A)pply to deploy the blueprint. - The blueprint creates VPC networks, a GPU RDMA VPC network, service accounts, a cluster, and a nodepool.
- To support the fio-bench-job-templatejob template in the blueprint, Google Cloud buckets, network storage, and persistent volumes resources are created.
 
Create a cluster and run workloads using XPK
Accelerated Processing Kit (XPK) lets you quickly provision and utilize clusters. XPK generates preconfigured, training-optimized infrastructure, ideal for when workload execution is your primary focus.
Create a cluster and run workloads with A3 Ultra VMs using XPK:
- Install the required tools to meet the XPK prerequisites.
- Copy the version number of the latest tagged release of XPK, for example,
  "v0.8.0". In the following command, replace the XPK_TAGwith the latest XPK version number.
- Open a shell window on a Linux machine, and enter the following commands to clone XPK from the Git repository, and install the required packages: - ## Setup virtual environment. VENV_DIR=~/venvp3 python3 -m venv $VENV_DIR source $VENV_DIR/bin/activate ## Clone the repository. git clone --branch XPK_TAG https://github.com/google/xpk.git cd xpk ## Install required packages make install && export PATH=$PATH:$PWD/bin
- Create a Standard cluster using A3 Ultra VMs. You can provision the cluster's nodes using reserved capacity: - python3 xpk.py cluster create \ --cluster=CLUSTER_NAME \ --device-type=h200-141gb-8 \ --zone=COMPUTE_ZONE \ --project=PROJECT_ID \ --num-nodes=NUM_NODES \ --reservation=RESERVATION_NAME- Replace the following variables: - CLUSTER_NAME: a name for the cluster.
- COMPUTE_ZONE: the compute zone for the node pool of A3 Ultra machines. To use reserved capacity, ensure that you use the zone where you reserved the capacity. And, we generally recommend choosing a zone near the user to minimize latency.
- PROJECT_ID: your Google Cloud project ID.
- NUM_NODES: the number of worker nodes in the node pool.
- RESERVATION_NAME: the name of your reservation.- XPK offers additional arguments for cluster creation, including those for creating private clusters, creating Vertex AI Tensorboards, and using node auto-provisioning. For more information, refer to the cluster creation guide for XPK. 
 
- Verify that the cluster was created successfully: - python3 xpk.py cluster list --zone=COMPUTE_ZONE --project=PROJECT_ID
- Optional: Run a workload to test the cluster environment: - python3 xpk.py workload create \ --workload WORKLOAD_NAME --command "echo goodbye" \ --cluster CLUSTER_NAME \ --device-type=h200-141gb-8 \ --num-nodes=WORKLOAD_NUM_NODES \ --zone=COMPUTE_ZONE \ --project=PROJECT_ID- Replace the following variables: - WORKLOAD_NAME: name of your workload.
- CLUSTER_NAME: the name of the cluster.
- WORKLOAD_NUM_NODES: number of worker nodes used for workload execution.
- COMPUTE_ZONE: the compute zone for the node pool of A3 Ultra machines.
- PROJECT_ID: your Google Cloud project ID.
 
Test network performance
We recommended that you validate the functionality of provisioned clusters. To do so, use NCCL/gIB tests, which are NVIDIA Collective Communications Library (NCCL) tests that are optimized for the Google environment.
Run reproducible benchmarks
You can use reproduce pre-training benchmarks for large machine learning open models on A4 and A3 Ultra VMs on GKE.
Each recipe provides you with the instructions to complete the following tasks:
- Prepare your environment.
- Run the benchmark.
- Analyze the benchmarks results. This includes the benchmark results and detailed logs for further analysis.
To view all the recipes available, see the GPU recipes GitHub repository.
| Models | Framework | Recipe | 
|---|---|---|
| Llama-3.1-70B | MaxText | 32 node workload | 
| Llama-3.1-70B | NeMo | 32 node workload | 
| Mixtral-8-7B | NeMo | 32 node workload | 
Clean up resources created by Cluster Toolkit
To avoid recurring charges for the resources used on this page, clean up the resources provisioned by Cluster Toolkit, including the VPC networks and GKE cluster:
   cd ~/cluster-toolkit
   ./gcluster destroy CLUSTER_NAME/
Replace CLUSTER_NAME with the name of your cluster.
   For the clusters created with Cluster Toolkit, the cluster name is
   based on the DEPLOYMENT_NAME.
What's next
- To learn about scheduling workloads on your GKE clusters using TAS and Kueue, see Schedule GKE workloads with Topology Aware Scheduling.
- To learn about managing common events relevant to GKE clusters and AI workloads, see Manage AI-optimized GKE clusters.
- For information about testing your environment for proper setup and optimization, see Cluster networking optimization overview