Create a Slurm cluster

This document outlines the deployment steps for provisioning A3 Ultra (a3-ultragpu-8g) VMs that run on Hypercompute Cluster and use Slurm as an orchestrator. For more information about Hypercompute Cluster, see Hypercompute Cluster.

Before you begin

  1. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  2. Ensure that you have enough Filestore quota. You need a minimum of 10,240 GiB of zonal (also known as high scale SSD) capacity. If you don't have enough quota, request a quota increase.

  3. Ensure that you have reserved blocks of A3 Ultra machines. To get these reserved blocks of resources, see Request capacity.

  4. To provision Slurm clusters, you must use Cluster Toolkit version v1.44.1 or later. To install Cluster Toolkit, see set up Cluster Toolkit.

Overview

To deploy the cluster, you must complete the following:

  1. Set up a Cloud Storage bucket. See Set up Cloud Storage bucket.
  2. Switch to the Cluster Toolkit directory. See Switch to the Cluster Toolkit directory.
  3. Create a deployment file. See Create a deployment file.
  4. Provision the cluster. See Provision the cluster.
  5. Connect to the Slurm cluster. See Connect to the Slurm cluster.

Set up a Cloud Storage bucket

Cluster blueprints use Terraform modules to provision Cloud infrastructure. A best practice when working with Terraform is to store the state remotely in a version enabled file. On Google Cloud, you can create a Cloud Storage bucket that has versioning enabled.

To create this bucket from the CLI, run the following command:

gcloud storage buckets create gs://BUCKET_NAME \
    --project=PROJECT_ID \
    --default-storage-class=STANDARD --location=REGION \
    --uniform-bucket-level-access
gcloud storage buckets update gs://BUCKET_NAME --versioning

Replace the following:

  • BUCKET_NAME: a name for your Cloud Storage bucket that meets the bucket naming requirements.
  • PROJECT_ID: your project ID.
  • REGION: a region that has a3-ultragpu-8g machine types.

Switch to the Cluster Toolkit directory

After you have installed the Cluster Toolkit, ensure that you are in the Cluster Toolkit directory. To go to the main Cluster Toolkit directory, run the following command.

cd cluster-toolkit

This cluster deployment requires Cluster Toolkit v1.44.1 or later. To check your version, you can run the ./gcluster --version command.

Create a deployment file

Create a deployment file that you can use to specify the Cloud Storage bucket, set names for your network and subnetwork, and set deployment variables such as projectID, region, and zone.

To create your deployment file, use a text editor to create a YAML file named a3ultra-slurm-deployment.yaml and add the following content.

---
terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME

vars:
  deployment_name: DEPLOYMENT_NAME
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  a3u_reservation_name: RESERVATION_NAME
  a3u_cluster_size: NUMBER_OF_VMS

Replace the following:

  • DEPLOYMENT_NAME: a name for your deployment. For example a3ultra-slurm.
  • BUCKET_NAME: the name of your Cloud Storage bucket, created in the previous section.
  • PROJECT_ID: your project ID.
  • REGION: a region that has a3-ultragpu-8g machine types.
  • ZONE: a zone that has a3-ultragpu-8g machine types.
  • RESERVATION_NAME: the name of your reservation provided by your TAM when you requested capacity.
  • NUMBER_OF_VMS: the number of VMs needed for the cluster.

Provision the cluster

To provision the cluster, run the following command from the Cluster Toolkit directory. This step takes approximately 5-10 minutes and uses the a3ultra-slurm-blueprint.yaml blueprint.

./gcluster deploy -d a3ultra-slurm-deployment.yaml examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml

Connect to the Slurm cluster

To access your cluster, you must login to the Slurm login node. To login, you can use either Google Cloud console or Google Cloud CLI.

Console

  1. Go to the Compute Engine > VM instances page.

    Go to VM instances

  2. Locate the login node. It should have a name similar to a3ultra-login-001.

  3. From the Connect column of the login node, click SSH.

gcloud

To connect to the login node, use the gcloud compute ssh command.

gcloud compute ssh $(gcloud compute instances list --filter "name ~ login" --format "value(name)") \
  --tunnel-through-iap \
  --zone ZONE

Redeploy the Slurm cluster

If you need to increase the number of compute nodes or add new partitions to your cluster, you might need to update configurations for your Slurm cluster by redeploying. Redeployment can be sped up by using an existing image from a previous deployment. To avoid creating new images during a redeploy, specify the --only flag.

To redeploy the cluster using an existing image, run:

./gcluster deploy -d a3ultra-slurm-deployment.yaml examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml --only cluster-env,cluster

This command is only for redeployments where an image already exists, it only redeploys the cluster and its infrastructure.

Test GPUDirect-RDMA performance on the cluster

To test NCCL communication using GPUDirect-RDMA, complete the following steps, which are described in the next sections:

  1. Create an enroot container and build the NCCL tests. For performing ML training on Slurm clusters, we recommend using enroot and pyxis, which together can be used to create and run containers with Slurm. See Create an enroot container and build NCCL tests.
  2. Set the GPUDirect-RDMA environment variables and run the NCCL test. See Run NCCL test.

Create an enroot container and build NCCL tests

From the shared directory of the login node, complete the following steps. The shared directory is usually located at ${HOME}.

  1. Download the script needed to build the NCCL test.

    wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/build-nccl-tests.sh
  2. After the script downloads, import a Pytorch image from the NVIDIA container registry and build the NCCL tests. To do this, run the following command:

    sbatch build-nccl-tests.sh

    This script runs on one of your a3-ultragpu-8g nodes that has more CPU and memory than the login node, which enroot can use to more quickly import the container. When the import completes, you should have a file named nvidia+pytorch+24.09-py3.sqsh in the directory where you ran the command along with a directory named nccl-tests. The preceding script uses --container-mounts to mount your current working directory $PWD into the /nccl directory inside the container.

  3. When the sbatch command finishes, check that the nccl-tests/build folder contains several binaries, including all_gather_perf.

Run NCCL test

  1. Download the following script to the shared directory of the login node:

    wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/run-nccl-tests.sh

    To run any job run on an A3 Ultra cluster, several environment variables must be set in order to enable high performance networking with GPUDirect-RDMA. Because we use enroot containers in this procedure to launch workloads, these variables must be set in the container environment as opposed to the host environment. These variables can be inspected in the run-nccl-tests.sh script that you just downloaded.

  2. From the same directory, run the following command:

    sbatch run-nccl-tests.sh

    This results in a slurm-XX.out file that contains the result of the nccl all_gather_perf benchmark.

    The output is similar to the following:

    #
    #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
       268435456       4194304     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
       536870912       8388608     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
      1073741824      16777216     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
      2147483648      33554432     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
      4294967296      67108864     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
      8589934592     134217728     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : XXX.XX
    #
    

What's next