This document outlines the deployment steps for provisioning A3 Ultra
(a3-ultragpu-8g
) VMs that run on Hypercompute Cluster and use Slurm
as an orchestrator.
For more information about Hypercompute Cluster, see
Hypercompute Cluster.
Before you begin
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Ensure that you have enough Filestore quota. You need a minimum of 10,240 GiB of zonal (also known as high scale SSD) capacity. If you don't have enough quota, request a quota increase.
Ensure that you have reserved blocks of A3 Ultra machines. To get these reserved blocks of resources, see Request capacity.
To provision Slurm clusters, you must use Cluster Toolkit version
v1.44.1
or later. To install Cluster Toolkit, see set up Cluster Toolkit.
Overview
To deploy the cluster, you must complete the following:
- Set up a Cloud Storage bucket. See Set up Cloud Storage bucket.
- Switch to the Cluster Toolkit directory. See Switch to the Cluster Toolkit directory.
- Create a deployment file. See Create a deployment file.
- Provision the cluster. See Provision the cluster.
- Connect to the Slurm cluster. See Connect to the Slurm cluster.
Set up a Cloud Storage bucket
Cluster blueprints use Terraform modules to provision Cloud infrastructure. A best practice when working with Terraform is to store the state remotely in a version enabled file. On Google Cloud, you can create a Cloud Storage bucket that has versioning enabled.
To create this bucket from the CLI, run the following command:
gcloud storage buckets create gs://BUCKET_NAME \ --project=PROJECT_ID \ --default-storage-class=STANDARD --location=REGION \ --uniform-bucket-level-access gcloud storage buckets update gs://BUCKET_NAME --versioning
Replace the following:
BUCKET_NAME
: a name for your Cloud Storage bucket that meets the bucket naming requirements.PROJECT_ID
: your project ID.REGION
: a region that hasa3-ultragpu-8g
machine types.
Switch to the Cluster Toolkit directory
After you have installed the Cluster Toolkit, ensure that you are in the Cluster Toolkit directory. To go to the main Cluster Toolkit directory, run the following command.
cd cluster-toolkit
This cluster deployment requires Cluster Toolkit v1.44.1
or
later. To check your version, you can run the ./gcluster --version
command.
Create a deployment file
Create a deployment file that you can use to specify the Cloud Storage bucket, set names for your network and subnetwork, and set deployment variables such as projectID, region, and zone.
To create your deployment file, use a text editor to create a YAML file
named a3ultra-slurm-deployment.yaml
and add the following content.
--- terraform_backend_defaults: type: gcs configuration: bucket: BUCKET_NAME vars: deployment_name: DEPLOYMENT_NAME project_id: PROJECT_ID region: REGION zone: ZONE a3u_reservation_name: RESERVATION_NAME a3u_cluster_size: NUMBER_OF_VMS
Replace the following:
DEPLOYMENT_NAME
: a name for your deployment. For examplea3ultra-slurm
.BUCKET_NAME
: the name of your Cloud Storage bucket, created in the previous section.PROJECT_ID
: your project ID.REGION
: a region that hasa3-ultragpu-8g
machine types.ZONE
: a zone that hasa3-ultragpu-8g
machine types.RESERVATION_NAME
: the name of your reservation provided by your TAM when you requested capacity.NUMBER_OF_VMS
: the number of VMs needed for the cluster.
Provision the cluster
To provision the cluster, run the following
command from the Cluster Toolkit directory. This step takes
approximately 5-10 minutes and uses the
a3ultra-slurm-blueprint.yaml
blueprint.
./gcluster deploy -d a3ultra-slurm-deployment.yaml examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml
Connect to the Slurm cluster
To access your cluster, you must login to the Slurm login node. To login, you can use either Google Cloud console or Google Cloud CLI.
Console
Go to the Compute Engine > VM instances page.
Locate the login node. It should have a name similar to
a3ultra-login-001
.From the Connect column of the login node, click SSH.
gcloud
To connect to the login node, use the
gcloud compute ssh
command.
gcloud compute ssh $(gcloud compute instances list --filter "name ~ login" --format "value(name)") \ --tunnel-through-iap \ --zone ZONE
Redeploy the Slurm cluster
If you need to increase the number of compute nodes or add new partitions to
your cluster, you might need to update configurations for your Slurm cluster by
redeploying. Redeployment can be sped up by using an existing image from a
previous deployment. To avoid creating new images during a redeploy, specify the
--only
flag.
To redeploy the cluster using an existing image, run:
./gcluster deploy -d a3ultra-slurm-deployment.yaml examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml --only cluster-env,cluster
This command is only for redeployments where an image already exists, it only redeploys the cluster and its infrastructure.
Test GPUDirect-RDMA performance on the cluster
To test NCCL communication using GPUDirect-RDMA, complete the following steps, which are described in the next sections:
- Create an enroot container and build the NCCL tests. For performing ML training on Slurm clusters, we recommend using enroot and pyxis, which together can be used to create and run containers with Slurm. See Create an enroot container and build NCCL tests.
- Set the GPUDirect-RDMA environment variables and run the NCCL test. See Run NCCL test.
Create an enroot container and build NCCL tests
From the shared directory of the login node, complete the following steps.
The shared directory is usually located at ${HOME}
.
Download the script needed to build the NCCL test.
wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/build-nccl-tests.sh
After the script downloads, import a Pytorch image from the NVIDIA container registry and build the NCCL tests. To do this, run the following command:
sbatch build-nccl-tests.sh
This script runs on one of your
a3-ultragpu-8g
nodes that has more CPU and memory than the login node, which enroot can use to more quickly import the container. When the import completes, you should have a file namednvidia+pytorch+24.09-py3.sqsh
in the directory where you ran the command along with a directory namednccl-tests
. The preceding script uses--container-mounts
to mount your current working directory$PWD
into the/nccl
directory inside the container.When the
sbatch
command finishes, check that thenccl-tests/build
folder contains several binaries, includingall_gather_perf
.
Run NCCL test
Download the following script to the shared directory of the login node:
wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/run-nccl-tests.sh
To run any job run on an A3 Ultra cluster, several environment variables must be set in order to enable high performance networking with GPUDirect-RDMA. Because we use enroot containers in this procedure to launch workloads, these variables must be set in the container environment as opposed to the host environment. These variables can be inspected in the
run-nccl-tests.sh
script that you just downloaded.From the same directory, run the following command:
sbatch run-nccl-tests.sh
This results in a
slurm-XX.out
file that contains the result of the ncclall_gather_perf
benchmark.The output is similar to the following:
# # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 268435456 4194304 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 536870912 8388608 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 1073741824 16777216 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 2147483648 33554432 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 4294967296 67108864 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 8589934592 134217728 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 # Out of bounds values : 0 OK # Avg bus bandwidth : XXX.XX #
What's next
- View VMs topology
- Learn how to manage host events
- Monitor VMs in your Slurm cluster
- Report faulty host