Deploy an A3 Mega GKE cluster for ML training

This document outlines the deployment steps for provisioning an A3 Mega (a3-megagpu-8g) Google Kubernetes Engine (GKE) cluster that is ideal for running large-scale artificial intelligence (AI) and machine learning (ML) training workloads.

Before you begin

  1. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  2. Identify the regions and zones where the a3-megagpu-8g machine type is available, run the following command:

    gcloud compute machine-types list --filter="name=a3-megagpu-8g"
    
  3. Ensure that you have enough GPU quotas. Each a3-megagpu-8g machine has 8 H100 80GB GPUs attached, so you'll need at least 8 NVIDIA H100 80GB GPUs in your selected region.

    1. To view quotas, see View the quotas for your project. In the Filter field, select Dimensions(e.g location) and specify gpu_family:NVIDIA_H100_MEGA.
    2. If you don't have enough quota, request a higher quota.
  4. Ensure that you have enough Filestore quota. You need a minimum of 10,240 GiB of zonal (also known as high scale SSD) capacity. If you don't have enough quota, request a quota increase.

Overview

To deploy the cluster, you must complete the following:

  1. Install Cluster Toolkit
  2. Switch to the Cluster Toolkit directory
  3. Create a reservation or get a reservation name from your Technical Account Manager (TAM)
  4. Fetch your machine's IP Address
  5. Update the blueprint file
  6. Build the Cluster Toolkit binary
  7. Provision the GKE cluster

Install Cluster Toolkit

From the CLI, complete the following steps:

  1. Install dependencies.

  2. Set up Cluster Toolkit.

Switch to the Cluster Toolkit directory

After you have installed the Cluster Toolkit, ensure that you are in the Cluster Toolkit directory.

To go to the main Cluster Toolkit blueprint's working directory, run the following command.

cd cluster-toolkit

Create a reservation

If you don't have a reservation provided by a Technical Account Manager (TAM), we recommend creating a reservation. For more information, see Reservations.

Reservations incur ongoing costs even after the GKE cluster is destroyed. To manage your costs, we recommend the following options:

To create a reservation, run the gcloud compute reservations create command and ensure that you specify the --require-specific-reservation flag.

gcloud compute reservations create RESERVATION_NAME \
    --require-specific-reservation \
    --project=PROJECT_ID \
    --machine-type=a3-megagpu-8g \
    --vm-count=NUMBER_OF_VMS \
    --zone=ZONE

Replace the following:

  • RESERVATION_NAME: a name for your reservation.
  • PROJECT_ID: your project ID.
  • NUMBER_OF_VMS: the number of VMs needed for the cluster.
  • ZONE: a zone that has a3-megagpu-8g machine types.

Get the IP address for your host machine

To get the IP address, run the following command.

curl ifconfig.me

Update the blueprint file

Update the gke-a3-megagpu.yaml blueprint and specify the deployment variables such as projectID, region, and zone.

   vars:
      project_id:  PROJECT_ID
      region: REGION
      zone: ZONE
      authorized_cidr: IP_ADDRESS/32

Replace the following:

  • PROJECT_ID: your project ID.
  • REGION: a region that has a3-megagpu-8g machine types.
  • ZONE: a zone that has a3-megagpu-8g machine types.
  • IP_ADDRESS: the IP address of your host machine.

Build the Cluster Toolkit binary

  1. To build the Cluster Toolkit binary from source, from Cloud Shell run the following command:

    make
  2. To verify the build, from Cloud Shell run the following command:

    ./gcluster --version

    The output shows you the version of the Cluster Toolkit that you are using.

Provision the GKE cluster

You can now run the gke-a3-megagpu.yaml blueprint to provision the cluster. This step takes approximately 5-10 minutes.

To provision the cluster, run the following command from the Cluster Toolkit directory.

   ./gcluster deploy examples/gke-a3-megagpu.yaml --vars project_id=PROJECT_ID
   

Replace PROJECT_ID with your project ID.

Clean Up

To avoid incurring charges to your Google Cloud account for the resources used on this page, delete the Google Cloud project with the resources.

Destroy the GKE cluster

To destroy all resources associated with creating the GKE cluster, from Cloud Shell run the following command:

./gcluster destroy gke-a3-megagpu.yaml