Create an RDMA-enabled HPC Slurm cluster with H4D instances
The H4D machine series is specifically designed to meet the needs of demanding HPC workloads. H4D offers instances with improved workload scalability through RDMA-enabled 200 Gbps networking. For more information on H4D compute-optimized machine types on Google Cloud, see H4D machine series.
Tutorial overview
This tutorial describes the steps to set up an HPC-optimized Slurm cluster using H4D machine types with RDMA. Specifically, you set up a cluster with Compute Engine virtual machines, create a Cloud Storage bucket to store the necessary Terraform modules, and set up a Filestore instance to provision your Slurm cluster. To complete the steps in this tutorial, you follow this process:
- Set up your Google Cloud project with the required permissions and environmental variables.
- Set up a Cloud Storage bucket.
- Set up Cluster Toolkit.
- Switch to the Cluster Toolkit directory.
- Create a Slurm deployment YAML file.
- Provision a Slurm cluster using a blueprint.
- Connect to the Slurm cluster.
Before you begin
- Request a reserved capacity block for two H4D instances with 192 vCPUs each.
Verify that you have enough Filestore quota to provision the Slurm cluster. You need a minimum of 5,120 GiB of zonal capacity specifically with a Filestore service tier of Basic SSD.
To check your Filestore quota, view Quotas & System limits in the Google Cloud console and filter the table to only show Filestore resources.
- For detailed instructions on checking Filestore quotas, see View API-specific quota.
- If you don't have enough quota, request a quota increase.
Make sure that billing is enabled for your Google Cloud project.
Enable the Compute Engine, Filestore, Cloud Storage, Service Usage, and Cloud Resource Manager API:
Costs
The cost of running this tutorial varies by each section you complete, such as setting up the tutorial or running jobs. You can calculate the cost by using the pricing calculator.
To estimate the cost for setting up this tutorial, use the following specifications:
- Filestore (Basic SSD) capacity per region: 5,120 GiB.
- Standard persistent disk: 50 GB
pd-standard
for the Slurm login node. - Performance (SSD) persistent disks: 50 GB
pd-ssd
for the Slurm controller. - VM instance: Two H4D instances created with the
h4d-highmem-192
,h4d-standard-192
, orh4d-highmem-192-lssd
machine type. - Hyperdisk Balanced volumes: 50 GiB for each H4D instance.
Launch Cloud Shell
In this tutorial, you use Cloud Shell which is a shell environment for managing resources hosted on Google Cloud.
Cloud Shell comes preinstalled with the Google Cloud CLI. gcloud CLI provides the primary command-line interface for Google Cloud. To launch Cloud Shell:
Go to the Google Cloud console.
From the upper-right corner of the console, click the Activate Cloud Shell button:
A Cloud Shell session starts and displays a command-line prompt.
You use this shell to run gcloud
and Cluster Toolkit commands.
Set environment variables
In Cloud Shell, set the following environment variables to use for the remainder of the tutorial. These environment variables set placeholder values for the following tasks:
Configures your project with the relevant values to access your reserved H4D instances.
Sets up a Cloud Storage bucket to store Cluster Toolkit modules.
Reservation capacity variables
export H4D_RESERVATION_PROJECT_ID=H4D_RESERVATION_PROJECT_ID export H4D_RESERVATION_NAME=H4D_RESERVATION_NAME export H4D_DEPLOYMENT_NAME=H4D_DEPLOYMENT_NAME export H4D_REGION=H4D_REGION export H4D_ZONE=H4D_ZONE export H4D_DEPLOYMENT_FILE_NAME=H4D_DEPLOYMENT_FILE_NAME
Replace the following:
H4D_RESERVATION_PROJECT_ID
- the Google Cloud project ID that was granted the H4D machine type reservation block.H4D_RESERVATION_NAME
- the name of your VM reservation block, found in your project. For example,h4d-highmem-exr
.H4D_DEPLOYMENT_NAME
- a unique name for your Slurm cluster deployment. For example,h4d-hpc-slurm-cluster-deployment
.H4D_REGION
- the region that is running the reserved H4D machine reservation block. For example,us-central1
.H4D_ZONE
- the zone that contains the reserved machines. This string must contain both the region and zone. For example,us-central1-a
.H4D_DEPLOYMENT_FILE_NAME
- a unique name for your Slurm blueprint YAML file. If you run through this tutorial more than once, choose a unique deployment name each time.
Storage capacity variables
Create the environment variables for your Cloud Storage bucket.
Cluster Toolkit uses blueprints to define and deploy clusters of VMs. A blueprint defines one or more Terraform modules to provision Cloud infrastructure. This bucket is used to store these blueprints.
export GOOGLE_CLOUD_BUCKET_NAME=GOOGLE_CLOUD_BUCKET_NAME export GOOGLE_CLOUD_BUCKET_LOCATION=GOOGLE_CLOUD_BUCKET_LOCATION
Replace the following:
GOOGLE_CLOUD_BUCKET_NAME
- the name that you want to use for your Cloud Storage bucket that meets the bucket naming requirements.GOOGLE_CLOUD_BUCKET_LOCATION
- any Google Cloud region of your choice, where the bucket will be hosted. For example,us-central1
.
Switch to the project with the reserved H4D capacity block
Run the following command to ensure that you are in the Google Cloud project that has the approved reservation block for the H4D instances.
gcloud config set project ${H4D_RESERVATION_PROJECT_ID}
Create a Cloud Storage bucket
Create the bucket to store your Terraform modules. From Cloud Shell, using your environment variables, run the following command:
A best practice when working with Terraform is to store the state remotely in a version-enabled file. On Google Cloud, you can create a Cloud Storage bucket that has versioning enabled.
gcloud storage buckets create gs://${GOOGLE_CLOUD_BUCKET_NAME} \ --project=${H4D_RESERVATION_PROJECT_ID} \ --default-storage-class=STANDARD \ --location=${GOOGLE_CLOUD_BUCKET_LOCATION} \ --uniform-bucket-level-access gcloud storage buckets update gs://${GOOGLE_CLOUD_BUCKET_NAME} --versioning
Set up the Cluster Toolkit
To create a Slurm cluster in a Google Cloud project, you can use Cluster Toolkit to handle deploying and provisioning the cluster. Cluster Toolkit is open-source software offered by Google Cloud to simplify the process of deploying workloads on Google Cloud.
Use the following steps to set up Cluster Toolkit.
Clone the Cluster Toolkit GitHub repository
In Cloud Shell, clone the GitHub repository:
git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
Go to the main working directory:
cd cluster-toolkit/
Build the Cluster Toolkit binary
In Cloud Shell, build the Cluster Toolkit binary from source by running the following command:
make
To verify the build, run the following command:
To deploy an HPC-optimized Slurm cluster that uses H4D instances, you must use version
v1.47.0
or later of the Cluster Toolkit../gcluster --version
After building the binary, you are now ready to deploy clusters to run your jobs or workloads.
Create a deployment file
In the Cluster Toolkit directory, create your Slurm deployment YAML file.
nano ${H4D_DEPLOYMENT_FILE_NAME}.yaml
Paste the following content into the YAML file.
--- terraform_backend_defaults: type: gcs configuration: bucket: GOOGLE_CLOUD_BUCKET_NAME vars: deployment_name: H4D_DEPLOYMENT_FILE_NAME project_id: H4D_RESERVATION_PROJECT_ID region: H4D_REGION zone: H4D_ZONE
To save and exit the file, press Ctrl+O > Enter > Ctrl+X.
Provision the Slurm cluster
To provision the Slurm cluster, run the following deployment command. This command provisions the Slurm cluster with the H4D Cluster Toolkit blueprint.
In Cloud Shell, start the cluster creation.
./gcluster deploy -d ${H4D_DEPLOYMENT_FILE_NAME}.yaml examples/hpc-slurm-h4d.yaml --auto-approve
Connect to the cluster
After deploying, connect to the Google Cloud console to view your cluster.
Go to the Compute Engine > VM instances page in the Google Cloud console.
Locate the login node, which might have a name like
example-deployment-login-123
.Click SSH to connect.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.
Delete the Slurm cluster
We recommend that you clean up your resources when they are no longer needed.
Disable deletion protection
To disable deletion protection when you update an instance, use a command similar to the following:
gcloud filestore instances update INSTANCE_NAME \
--no-deletion-protection
Replace:
INSTANCE_NAME
: the name of the instance that you want to edit. For example,my-genomics-instance
.
To find the INSTANCE_NAME
, you can run the command
gcloud filestore instances list
. This command lists all the
Filestore instances in your current Google Cloud project,
including their names, locations (zones), tiers, capacity, and status.
After running the command, find the Filestore instance that matches the H4D machine type that was used in this tutorial.
Destroy the Slurm cluster
Before running the destroy command, navigate to the root of the Cluster Toolkit directory. By default, DEPLOYMENT_FOLDER is located at the root of the Cluster Toolkit directory.
To destroy the cluster, run:
./gcluster destroy ${H4D_DEPLOYMENT_FILE_NAME} --auto-approve
When the cluster destruction is complete, you see a message similar to the following:
Destroy complete! Resources: xx destroyed.
Delete the storage bucket
Delete the Cloud Storage bucket after you make sure that the previous command ended without errors:
gcloud storage buckets delete gs://${GOOGLE_CLOUD_BUCKET_NAME}
Troubleshooting
Error: Cloud Shell can't provision the cluster because there is no storage left.
You might see this error if you are a frequent user of Cloud Shell and you have run out of storage space.
To resolve this issue, see Disable or reset Cloud Shell.
Error: Cluster or blueprint name already exists.
You might see this error if you are using a project that has already used the exact file names used in this tutorial. For example, if someone else in your organization ran through this tutorial end-to-end.
To resolve this issue, run through the tutorial again and choose a unique name for the deployment file and rerun the provision the Slurm cluster command with the new deployment file.
What's next
- Advanced Slurm tasks:
- Learn how to Redeploy the Slurm cluster
- Learn how to Test network performance on the Slurm cluster
- Learn how to manage host events:
- View VMs topology
- Monitor VMs in your Slurm cluster
- Report a faulty host