This document outlines the deployment steps for provisioning an A3 Mega
(a3-megagpu-8g
) Google Kubernetes Engine (GKE) cluster that is ideal for running
large-scale artificial intelligence (AI) and machine learning (ML) training
workloads.
Before you begin
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Identify the regions and zones where the
a3-megagpu-8g
machine type is available, run the following command:gcloud compute machine-types list --filter="name=a3-megagpu-8g"
Ensure that you have enough GPU quotas. Each
a3-megagpu-8g
machine has 8 H100 80GB GPUs attached, so you'll need at least 8 NVIDIA H100 80GB GPUs in your selected region.- To view quotas, see
View the quotas for your project.
In the Filter field,
select Dimensions(e.g location) and specify
gpu_family:NVIDIA_H100_MEGA
. - If you don't have enough quota, request a higher quota.
- To view quotas, see
View the quotas for your project.
In the Filter field,
select Dimensions(e.g location) and specify
Ensure that you have enough Filestore quota. You need a minimum of 10,240 GiB of zonal (also known as high scale SSD) capacity. If you don't have enough quota, request a quota increase.
Overview
To deploy the cluster, you must complete the following:
- Install Cluster Toolkit
- Switch to the Cluster Toolkit directory
- Create a reservation or get a reservation name from your Technical Account Manager (TAM)
- Fetch your machine's IP Address
- Update the blueprint file
- Build the Cluster Toolkit binary
- Provision the GKE cluster
Install Cluster Toolkit
From the CLI, complete the following steps:
Install dependencies.
Set up Cluster Toolkit.
Switch to the Cluster Toolkit directory
After you have installed the Cluster Toolkit, ensure that you are in the Cluster Toolkit directory.
To go to the main Cluster Toolkit blueprint's working directory, run the following command.
cd cluster-toolkit
Create a reservation
If you don't have a reservation provided by a Technical Account Manager (TAM), we recommend creating a reservation. For more information, see Reservations.
Reservations incur ongoing costs even after the GKE cluster is destroyed. To manage your costs, we recommend the following options:
- Track spending by using budget alerts.
- Delete reservations when you're done with them. To delete a reservation, see delete your reservation.
To create a reservation, run the
gcloud compute reservations create
command
and ensure that you specify the --require-specific-reservation
flag.
gcloud compute reservations create RESERVATION_NAME \ --require-specific-reservation \ --project=PROJECT_ID \ --machine-type=a3-megagpu-8g \ --vm-count=NUMBER_OF_VMS \ --zone=ZONE
Replace the following:
RESERVATION_NAME
: a name for your reservation.PROJECT_ID
: your project ID.NUMBER_OF_VMS
: the number of VMs needed for the cluster.ZONE
: a zone that hasa3-megagpu-8g
machine types.
Get the IP address for your host machine
To get the IP address, run the following command.
curl ifconfig.me
Update the blueprint file
Update the gke-a3-megagpu.yaml
blueprint and specify the
deployment variables such as projectID, region, and zone.
vars: project_id: PROJECT_ID region: REGION zone: ZONE authorized_cidr: IP_ADDRESS/32
Replace the following:
PROJECT_ID
: your project ID.REGION
: a region that hasa3-megagpu-8g
machine types.ZONE
: a zone that hasa3-megagpu-8g
machine types.IP_ADDRESS
: the IP address of your host machine.
Build the Cluster Toolkit binary
To build the Cluster Toolkit binary from source, from Cloud Shell run the following command:
make
To verify the build, from Cloud Shell run the following command:
./gcluster --version
The output shows you the version of the Cluster Toolkit that you are using.
Provision the GKE cluster
You can now run the gke-a3-megagpu.yaml
blueprint to provision the
cluster. This step takes approximately 5-10 minutes.
To provision the cluster, run the following command from the Cluster Toolkit directory.
./gcluster deploy examples/gke-a3-megagpu.yaml --vars project_id=PROJECT_ID
Replace PROJECT_ID
with your project ID.
Clean Up
To avoid incurring charges to your Google Cloud account for the resources used on this page, delete the Google Cloud project with the resources.
Destroy the GKE cluster
To destroy all resources associated with creating the GKE cluster, from Cloud Shell run the following command:
./gcluster destroy gke-a3-megagpu.yaml