This tutorial shows how to deploy a Slurm cluster on Compute Engine. The Slurm Resource Manager is a popular resource manager used in many high performance computing centers. For a discussion of high performance computing terminology and use cases, see Using clusters for large-scale technical computing in the cloud.
The following diagram illustrates the configuration that you create in this tutorial.
As a cluster workload manager, Slurm does the following:
- Allocates resources (compute nodes) to users.
- Provides a structure for starting, executing, and monitoring work on the nodes.
- Manages a queue of pending work.
The deployment developed for this tutorial is a simplified cluster architecture that supports a few users running a single-app workload. You can use other configurations for multi-workload deployments, large-scale production deployments, and hybrid deployments that work with on-premises or multi-cloud clusters, but these configurations are outside the scope of this tutorial.
In this tutorial, you interact with the system by using the login (head) node. After the cluster is deployed, you connect to the login node by using SSH, install the apps, and use Slurm command-line tools to submit jobs for computation. The Slurm scheduler, running on the controller node, schedules the queued jobs by matching available resources with the job requirements and manages the execution of the jobs on the compute nodes. The NFS server provides a common shared space for files.
Objectives
- Use Terraform to deploy a Slurm cluster.
- Run a job using Slurm.
- Query cluster information and monitor running jobs in Slurm.
- Autoscale nodes to accommodate job parameters and requirements.
Costs
This tutorial uses the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Clean up.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
-
Enable the Compute Engine and Deployment Manager APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
-
Enable the Compute Engine and Deployment Manager APIs.
-
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Deploying the Slurm cluster
In this section, you prepare the cluster configuration, use Deployment Manager to deploy the Slurm cluster to Google Cloud, and then verify that the cluster is operational.
Prepare the cluster configuration
In Cloud Shell, clone the
slurm-gcp
GitHub repository:git clone https://github.com/SchedMD/slurm-gcp.git
Define the following environment variables:
export CLUSTER_NAME="cluster-name" export CLUSTER_ZONE="cluster-zone"
Replace the following:
cluster-name
: a unique name for your cluster.cluster-zone
: the zone where you deploy the cluster.
In the
slurm-gcp/tf/examples/basic
folder, copy thebasic.tfvars.example
file to thebasic.tfvars
file:cd slurm-gcp/tf/examples/basic cp basic.tfvars.example basic.tfvars
Use the
sed
command to modify thebasic.tfvars
file for your environment:sed -i "s/\(cluster_name.*= \)\"\(.*\)\"/\1\"${CLUSTER_NAME}\"/" basic.tfvars sed -i "s/<project>/$(gcloud config get-value core/project)/" basic.tfvars sed -i "s/\(zone.*= \)\"\(.*\)\"/\1\"${CLUSTER_ZONE}\"/" basic.tfvars
(Optional) In a text editor, customize the
basic.tfvars
file for your environment by changing preset values:machine_type
: To use a different machine type for the cluster's compute nodes, change this value. For example, change this value if you need more CPU cores or memory than are available with the defaultn1-standard-2
machine type. For more information, see Machine types.compute_disk_type
: To use SSDs on compute nodes, change this value topd-ssd
.compute_disk_size_gb
: To use larger or smaller disks attached to your compute nodes, change this value.
(Optional) In a text editor, further customize the
basic.tfvars
file for your environment by uncommenting additional Terraform variable declarations.
Deploy the Slurm cluster using Terraform
In Cloud Shell, initialize the working directory for Terraform:
terraform init
Use Terraform to deploy your cluster to Google Cloud:
terraform apply -var-file=basic.tfvars -auto-approve
The output is simalar to the following:
terraform apply -var-file=basic.tfvars -auto-approve module.slurm_cluster_controller.data.google_compute_default_service_account.default: Refreshing state... module.slurm_cluster_login.data.google_compute_default_service_account.default: Refreshing state... module.slurm_cluster_network.google_compute_network.cluster_network[0]: Creating... module.slurm_cluster_network.google_compute_network.cluster_network[0]: Still creating... [10s elapsed] module.slurm_cluster_network.google_compute_network.cluster_network[0]: Still creating... [20s elapsed] module.slurm_cluster_network.google_compute_network.cluster_network[0]: Creation complete after 22s [id=projects/myproject/global/networks/mycluster-network] module.slurm_cluster_network.google_compute_router.cluster_router[0]: Creating... module.slurm_cluster_network.google_compute_subnetwork.cluster_subnet[0]: Creating... module.slurm_cluster_network.google_compute_firewall.cluster_iap_ssh_firewall[0]: Creating... module.slurm_cluster_network.google_compute_firewall.cluster_internal_firewall[0]: Creating... module.slurm_cluster_network.google_compute_router.cluster_router[0]: Still creating... [10s elapsed] module.slurm_cluster_network.google_compute_subnetwork.cluster_subnet[0]: Still creating... [10s elapsed] module.slurm_cluster_network.google_compute_firewall.cluster_iap_ssh_firewall[0]: Still creating... [10s elapsed] module.slurm_cluster_network.google_compute_firewall.cluster_internal_firewall[0]: Still creating... [10s elapsed] module.slurm_cluster_network.google_compute_firewall.cluster_internal_firewall[0]: Creation complete after 11s [id=projects/myproject/global/firewalls/mycluster-allow-internal] module.slurm_cluster_network.google_compute_firewall.cluster_iap_ssh_firewall[0]: Creation complete after 11s [id=projects/myproject/global/firewalls/mycluster-allow-iap] module.slurm_cluster_network.google_compute_router.cluster_router[0]: Creation complete after 11s [id=projects/myproject/regions/us-central1/routers/mycluster-router] module.slurm_cluster_network.google_compute_subnetwork.cluster_subnet[0]: Still creating... [20s elapsed] module.slurm_cluster_network.google_compute_subnetwork.cluster_subnet[0]: Creation complete after 21s [id=projects/myproject/regions/us-central1/subnetworks/mycluster-us-central1] module.slurm_cluster_network.google_compute_router_nat.cluster_nat[0]: Creating... module.slurm_cluster_login.google_compute_instance.login_node[0]: Creating... module.slurm_cluster_controller.google_compute_instance.controller_node[0]: Creating... module.slurm_cluster_network.google_compute_router_nat.cluster_nat[0]: Still creating... [10s elapsed] module.slurm_cluster_login.google_compute_instance.login_node[0]: Still creating... [10s elapsed] module.slurm_cluster_controller.google_compute_instance.controller_node[0]: Still creating... [10s elapsed] module.slurm_cluster_network.google_compute_router_nat.cluster_nat[0]: Still creating... [20s elapsed] module.slurm_cluster_login.google_compute_instance.login_node[0]: Still creating... [20s elapsed] module.slurm_cluster_controller.google_compute_instance.controller_node[0]: Still creating... [20s elapsed] module.slurm_cluster_network.google_compute_router_nat.cluster_nat[0]: Creation complete after 22s [id=myproject/us-central1/mycluster-router/mycluster-nat] module.slurm_cluster_login.google_compute_instance.login_node[0]: Still creating... [30s elapsed] module.slurm_cluster_controller.google_compute_instance.controller_node[0]: Still creating... [30s elapsed] module.slurm_cluster_controller.google_compute_instance.controller_node[0]: Creation complete after 34s [id=projects/myproject/zones/us-central1-a/instances/mycluster-controller] module.slurm_cluster_login.google_compute_instance.login_node[0]: Creation complete after 35s [id=projects/myproject/zones/us-central1-a/instances/mycluster-login0] Apply complete! Resources: 8 added, 0 changed, 0 destroyed. Outputs: controller_network_ips = [ [ "10.0.0.2", ], ] login_network_ips = [ [ "10.0.0.3", ], ]
The cluster configuration takes 5-10 minutes to complete. Track the progress of the configuration:
gcloud compute ssh ${CLUSTER_NAME}-controller \ --command "sudo journalctl -fu google-startup-scripts.service" \ --zone $CLUSTER_ZONE
When the script finishes, your Slurm cluster is ready to use. The output ends with the following line:
Started Google Compute Engine Startup Scripts.
To stop watching the configuration, press Control+C.
Verify that the cluster is operational
In Cloud Shell, check that the cluster is ready by logging in to the login node:
export CLUSTER_LOGIN_NODE=$(gcloud compute instances list \ --zones ${CLUSTER_ZONE} \ --filter="name ~ .*login." \ --format="value(name)" | head -n1) gcloud compute ssh ${CLUSTER_LOGIN_NODE} \ --zone $CLUSTER_ZONE
The cluster is ready when the output is similar to the following:
SSSSSSS SSSSSSSSS SSSSSSSSS SSSSSSSSS SSSS SSSSSSS SSSS SSSSSS SSSSSS SSSSSS SSSSSSS SSSSSS SSSS SSSSSSSSS SSSS SSS SSSSSSSSS SSS SSSSS SSSS SSSSSSSSS SSSS SSSSS SSS SSSSSS SSSSSSSSS SSSSSS SSS SSSSSS SSSSSSS SSSSSS SSS SSSSSS SSSSSS SSS SSSSS SSSS SSSSSSS SSSS SSSSS S SSS SSSSSSSSS SSS S SSS SSSS SSSSSSSSS SSSS SSS S SSS SSSSSS SSSSSSSSS SSSSSS SSS S SSSSS SSSSSS SSSSSSSSS SSSSSS SSSSS S SSSSS SSSS SSSSSSS SSSS SSSSS S S SSS SSS SSS SSS S S S S S SSS SSS SSS SSS SSSSSSSSSSSS SSS SSSS SSSS SSSSSSSSS SSSSSSSSSSSSSSSSSSSS SSSSSSSSSSSSS SSS SSSS SSSS SSSSSSSSSS SSSSSSSSSSSSSSSSSSSSSS SSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS SSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS SSSSSSSSSSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS SSSSSSSSSSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS SSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS SSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS SSSSSSSSSSSSS SSS SSSSSSSSSSSSSSS SSSS SSSS SSSS SSSS SSSSSSSSSSSS SSS SSSSSSSSSSSSS SSSS SSSS SSSS SSSS
To exit the cluster, press
Control+D
.When the cluster is ready, schedule a job to verify that it is working correctly. This job runs the
hostname
command on several nodes in the cluster.gcloud compute ssh ${CLUSTER_LOGIN_NODE} \ --command 'sbatch -N2 --wrap="srun hostname"' --zone $CLUSTER_ZONE gcloud compute ssh ${CLUSTER_LOGIN_NODE} \ --command 'cat slurm-*.out' --zone $CLUSTER_ZONE
The output is similar to the following:
helloworld-compute1 helloworld-compute2
You now have a working cluster.
Autoscaling the Slurm cluster
When you deploy a Slurm cluster, you specify values for the max_node_count
and the static_node_count
properties. The max_node_count
value determines
the maximum number of compute nodes that your cluster runs at any given time.
The static_node_count
value specifies the number of compute nodes that are
always running. By default, the slurm-cluster.yaml
file sets max_node_count
to 10
and static_node_count
to 2
.
The difference between these two numbers, max_node_count
- static_node_count
,
is the number of ephemeral nodes in the cluster. Ephemeral nodes are created
on-demand in response to job scheduling requests. After the job that they were
created to execute completes, the nodes are destroyed unless the cluster can use
them to execute other jobs. On Google Cloud, the Slurm
power-saving mechanism
autoscales clusters by instantiating ephemeral nodes only when necessary.
In Cloud Shell, log in to the cluster's login node:
gcloud compute ssh ${CLUSTER_LOGIN_NODE} --zone $CLUSTER_ZONE
View the node count:
sinfo
The output is similar to the following:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 8 idle~ demo-compute[3-10] debug* up infinite 2 idle demo-compute[1-2]
The
~
suffix on theSTATE
descriptoridle
in the first line of output indicates that the 8 ephemeral nodes are in power-saving mode.To autoscale the cluster, create 3 ephemeral nodes:
sbatch -N5 --wrap="srun hostname"
The output is similar the following:
Submitted batch job JOB_ID
View the node count again:
sinfo
The output is similar to the following:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 3 alloc# demo-compute[3-5] debug* up infinite 5 idle~ demo-compute[6-10] debug* up infinite 2 idle demo-compute[1-2]
The
#
suffix on theSTATE
descriptoralloc
in the first line of output indicates that 3 ephemeral nodes are created. After the additional nodes are created and configured, they run the job, and then after a short delay, they are destroyed.To exit the cluster, press
Control+D
.
Clean up
The easiest way to eliminate billing is to delete the Cloud project you created for the tutorial. Alternatively, you can delete the individual resources.Delete the Slurm cluster
Delete the cluster:
terraform destroy -var-file=basic.tfvars -auto-approve
Delete the project
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Explore reference architectures, diagrams, tutorials, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.