Deploy an A3 Mega Slurm cluster for ML training

This document outlines the deployment steps for provisioning an A3 Mega (a3-megagpu-8g) Slurm cluster that is ideal for running large-scale artificial intelligence (AI) and machine learning (ML) training workloads. For an overview of the deployment, see A3 Mega Slurm cluster overview.

Before you begin

  1. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  2. Identify the regions and zones where the a3-megagpu-8g machine type is available, run the following command:

    gcloud compute machine-types list --filter="name=a3-megagpu-8g"
    
  3. Ensure that you have enough GPU quotas. Each a3-megagpu-8g machine has 8 H100 80GB GPUs attached, so you'll need at least 8 NVIDIA H100 80GB GPUs in your selected region.

    1. To view quotas, see View the quotas for your project. In the Filter field, select Dimensions(e.g location) and specify gpu_family:NVIDIA_H100_MEGA.
    2. If you don't have enough quota, request a higher quota.
  4. Ensure that you have enough Filestore quota. You need a minimum of 10,240 GiB of zonal (also known as high scale SSD) capacity. If you don't have enough quota, request a quota increase.

Overview

To deploy the cluster, you must complete the following:

  1. Install Cluster Toolkit.
  2. Switch to the A3 Mega directory.
  3. Set up Cloud Storage bucket.
  4. Create a reservation or get a reservation name from your Technical Account Manager (TAM).
  5. Update the base deployment file with the deployment variables needed for Virtual Private Cloud and Filestore setup.
  6. Set up Virtual Private Cloud and Filestore.
  7. Update the cluster deployment file with the additional variables needed for the cluster deployment.
  8. Build the custom OS image.
  9. Deploy the cluster.

Install Cluster Toolkit

From the CLI, complete the following steps:

  1. Install dependencies.

  2. Set up Cluster Toolkit.

Switch to the Cluster Toolkit directory

After you have installed the Cluster Toolkit, ensure that you are in the Cluster Toolkit directory.

To go to the main Cluster Toolkit working directory, run the following command.

cd cluster-toolkit

Set up Cloud Storage bucket

Cluster blueprints use Terraform modules to provision Cloud infrastructure. A best practice when working with Terraform is to store the state remotely in a version enabled file. On Google Cloud, you can create a Cloud Storage bucket that has versioning enabled.

To create this bucket from the CLI, run the following command:

gcloud storage buckets create gs://BUCKET_NAME \
    --project=PROJECT_ID \
    --default-storage-class=STANDARD --location=REGION \
    --uniform-bucket-level-access
gcloud storage buckets update gs://BUCKET_NAME --versioning

Replace the following:

  • BUCKET_NAME: a name for your Cloud Storage bucket that meets the bucket naming requirements.
  • PROJECT_ID: your project ID.
  • REGION: a region that has a3-megagpu-8g machine types.

Create a reservation

If you don't have a reservation provided by a Technical Account Manager (TAM), we recommend creating a reservation. For more information, see Reservations.

Reservations incur ongoing costs even after the Slurm cluster is destroyed. To manage your costs, we recommend the following options:

To create a reservation, run the gcloud compute reservations create command and ensure that you specify the --require-specific-reservation flag.

gcloud compute reservations create RESERVATION_NAME \
    --require-specific-reservation \
    --project=PROJECT_ID \
    --machine-type=a3-megagpu-8g \
    --vm-count=NUMBER_OF_VMS \
    --zone=ZONE

Replace the following:

  • RESERVATION_NAME: a name for your reservation.
  • PROJECT_ID: your project ID.
  • NUMBER_OF_VMS: the number of VMs needed for the cluster.
  • ZONE: a zone that has a3-megagpu-8g machine types.

Update the base deployment file

Update the deployment-base.yaml deployment file to specify the Cloud Storage bucket, set names for your network and subnetwork, and deployment variables such as projectID, region, and zone.

  1. Use an editor to add the following content to the deployment-base.yaml deployment file. This file is located in the A3 Mega directory: examples/machine-learning/a3-megagpu-8g/.

    ---
    terraform_backend_defaults:
      type: gcs
      configuration:
        bucket: BUCKET_NAME
    
    vars:
      deployment_name: a3mega-base
      project_id: PROJECT_ID
      region: REGION
      zone: ZONE
      network_name_system: NETWORK_NAME
      subnetwork_name_system: SUBNETWORK_NAME
    

    Replace the following:

    • BUCKET_NAME: the name of your Cloud Storage bucket.
    • PROJECT_ID: your project ID.
    • REGION: a region that has a3-megagpu-8g machine types.
    • ZONE: a zone that has a3-megagpu-8g machine types.
    • NETWORK_NAME: a name for your network. For example, sys-net.
    • SUBNETWORK_NAME: a name for your subnetwork. For example, sys-subnet.

Set up Virtual Private Cloud and Filestore

Create a Virtual Private Cloud network and a Filestore /home file system by using the slurm-a3mega-base.yaml cluster blueprint.

To set up the Virtual Private Cloud network and Filestore file system from the CLI, run the following command from the Cluster Toolkit directory. This step takes approximately 5 minutes.

./gcluster deploy -d examples/machine-learning/a3-megagpu-8g/deployment-base.yaml \
    examples/machine-learning/a3-megagpu-8g/slurm-a3mega-base.yaml --auto-approve

The output is similar to the following. Make note of the server_ip value as you need to add it to the deployment file in the next step.

network_name_sysnet = "sys-net"
network_storage_homefs = {
  "client_install_runner" = {
    "destination" = "install-nfs_home.sh"
    "source" = "modules/embedded/modules/file-system/filestore/scripts/install-nfs-client.sh"
    "type" = "shell"
  }
  "fs_type" = "nfs"
  "local_mount" = "/home"
  "mount_options" = "defaults,_netdev"
  "mount_runner" = {
    "args" = "\"10.224.153.226\" \"/nfsshare\" \"/home\" \"nfs\" \"defaults,_netdev\""
    "destination" = "mount_home.sh"
    "source" = "modules/embedded/modules/file-system/filestore/scripts/mount.sh"
    "type" = "shell"
  }
  "remote_mount" = "/nfsshare"
  "server_ip" = "192.168.0.2"
}
subnetwork_name_sysnet = "sys-subnet"

Update the cluster deployment file

Update the deployment-image-cluster.yaml deployment file to append your reservation name, maintenance interval, cluster size, and Filestore IP.

Use an editor to add the following content to the deployment-image-cluster.yaml deployment file. This file is located in the A3 Mega directory: examples/machine-learning/a3-megagpu-8g/.

---
terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME

vars:
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  network_name_system: NETWORK_NAME
  subnetwork_name_system: SUBNETWORK_NAME
  enable_ops_agent: true
  enable_nvidia_dcgm: true
  disk_size_gb: 200
  final_image_family: slurm-a3mega
  slurm_cluster_name: a3mega
  a3mega_reservation_name: RESERVATION_NAME
  a3mega_maintenance_interval: MAINTENANCE_INTERVAL
  a3mega_cluster_size: NUMBER_OF_VMS
  server_ip_homefs: HOMEFS_IP

Replace the following:

  • RESERVATION_NAME: either the name of your reservation provided by your TAM or the reservation you created in the previous step.
  • MAINTENANCE_INTERVAL: specify one of the following:
    • If you are using a Google-provided reservation, set a3mega_maintenance_interval: PERIODIC
    • If you created your own reservation, set a3mega_maintenance_interval: "". This sets the maintenance value to an empty string which is the default value.
  • NUMBER_OF_VMS: the number of VMs needed for the cluster.
  • HOMEFS_IP: the IP address of the Filestore instance for /home. Your IP address must match the output for the server_ip obtained from deploying the slurm-a3mega-base.yaml blueprint. For example, from the preceding sample output that value is 192.168.0.2.

Build the custom OS image

Build the custom Debian 12 OS image by using the slurm-a3mega-image.yaml cluster blueprint. To build the custom OS image from the CLI, run the following command from the Cluster Toolkit directory.

./gcluster deploy -d examples/machine-learning/a3-megagpu-8g/deployment-image-cluster.yaml \
    examples/machine-learning/a3-megagpu-8g/slurm-a3mega-image.yaml --auto-approve

Make additional updates

If you have multiple reservations, you can update the deployment files to specify the additional reservations. To do this, see Scale A3 Mega clusters across multiple reservations.

Provision the Slurm cluster

You can now run the slurm-a3mega-cluster.yaml blueprint to provision the cluster. This step takes approximately 5-10 minutes.

To provision the cluster, run the following command from the Cluster Toolkit directory.

./gcluster deploy -d examples/machine-learning/a3-megagpu-8g/deployment-image-cluster.yaml \
    examples/machine-learning/a3-megagpu-8g/slurm-a3mega-cluster.yaml --auto-approve

Connect to the A3 Mega Slurm cluster

To enable optimized NCCL communication tuning on your cluster, you must login to the Slurm login node. To login, you can use either Google Cloud console or Google Cloud CLI.

Console

  1. Go to the Compute Engine > VM instances page.

    Go to VM instances

  2. Locate the login node. It should have a name similar to a3mega-login-001.

  3. From the Connect column of the login node, click SSH.

gcloud

To connect to the login node, use the gcloud compute ssh command.

gcloud compute ssh $(gcloud compute instances list --filter "name ~ login" --format "value(name)") \
  --tunnel-through-iap \
  --zone ZONE

After you connect to the login node, you can then Enable GPUDirect-TCPXO optimized NCCL communication.