This tutorial shows how to deploy a Slurm cluster on Compute Engine. The Slurm Resource Manager is a popular resource manager used in many high performance computing centers. For a discussion of high performance computing terminology and use cases, see Using clusters for large-scale technical computing in the cloud.
The following diagram illustrates the configuration that you create in this tutorial.
As a cluster workload manager, Slurm does the following:
- Allocates resources (compute nodes) to users.
- Provides a structure for starting, executing, and monitoring work on the nodes.
- Manages a queue of pending work.
The deployment developed for this tutorial is a simplified cluster architecture that supports a few users running a single-app workload. You can use other configurations for multi-workload deployments, large-scale production deployments, and hybrid deployments that work with on-premises or multi-cloud clusters, but these configurations are outside the scope of this tutorial.
In this tutorial, you interact with the system by using the login (head) node. After the cluster is deployed, you connect to the login node by using SSH, install the apps, and use Slurm command-line tools to submit jobs for computation. The Slurm scheduler, running on the controller node, schedules the queued jobs by matching available resources with the job requirements and manages the execution of the jobs on the compute nodes. The NFS server provides a common shared space for files.
Objectives
- Use Cloud Deployment Manager to deploy a Slurm cluster.
- Run a job using Slurm.
- Query cluster information and monitor running jobs in Slurm.
- Autoscale nodes to accommodate job parameters and requirements.
Costs
This tutorial uses the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.
When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.
Before you begin
-
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
-
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
- Enable the Compute Engine and Deployment Manager APIs.
-
In the Cloud Console, activate Cloud Shell.
At the bottom of the Cloud Console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including the
gcloud
command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize.
Deploying the Slurm cluster
In this section, you prepare the cluster configuration, use Deployment Manager to deploy the Slurm cluster to Google Cloud, and then verify that the cluster is operational.
Prepare the cluster configuration
In Cloud Shell, clone the
slurm-gcp
GitHub repository:git clone https://github.com/SchedMD/slurm-gcp.git
Define the following environment variables:
export CLUSTER_DEPLOY_NAME="cluster-deployment-name" export CLUSTER_NAME="cluster-name" export CLUSTER_REGION="cluster-region" export CLUSTER_ZONE="cluster-zone"
Replace the following:
cluster-deployment-name
: a unique name for your cluster deployment.cluster-name
: a unique name for your cluster.cluster-region
: the region where you deploy the cluster. Pick a region close to your location. For more information, see Regions and zones.cluster-zone
: the zone where you deploy the cluster.
In the
slurm-gcp
folder, copy theslurm-cluster.yaml
file to the${CLUSTER_DEPLOY_NAME}.yaml
file:cd slurm-gcp cp slurm-cluster.yaml ${CLUSTER_DEPLOY_NAME}.yaml
In a text editor, modify the
${CLUSTER_DEPLOY_NAME}.yaml
file for your environment. Use the types as defined in theslurm.jinja.schema
file, which specifies default and permissible values for all configuration properties except thedefault_users
value. Make the following required changes:cluster_name
: change the name of your cluster to becluster-name
.region
andzone
: replace withcluster-region
andcluster-zone
compute_machine_type
: (Optional) to use a different machine type, change thecompute_machine_type
value. For example, if you need more CPU cores or memory than are available with the default choice ofn1-standard-2
, choosen1-standard-4
. For more information, see Machine types.vpc_net
andvpc_subnet
: (Optional) use an existing Virtual Private Cloud (VPC) network and VPC subnet. The network and subnet requirements are described in theslurm.jinja.schema
file. If you don't specify values, a new network or subnet is created for your cluster. For more information, see VPC networks.
Save the file.
Deploy the Slurm cluster using Deployment Manager
In Cloud Shell, use Deployment Manager to deploy your cluster to Google Cloud:
gcloud deployment-manager deployments \ --project="$(gcloud config get-value core/project)" \ create $CLUSTER_DEPLOY_NAME \ --config ${CLUSTER_DEPLOY_NAME}.yaml
The output is similar to the following:
The fingerprint of the deployment is VWVaIYX1mFpjv9UDhzalYQ== Waiting for create [operation-1582142432278-59ef33da8f6fa-f5785d68-21feb545]...done. Create operation operation-1582142432278-59ef33da8f6fa-f5785d68-21feb545 completed successfully. NAME TYPE STATE ERRORS INTENT helloworld-all-internal-firewall-rule compute.v1.firewall IN_PREVIEW [] CREATE_OR_ACQUIRE helloworld-compute-image compute.v1.instance IN_PREVIEW [] CREATE_OR_ACQUIRE helloworld-compute1 compute.v1.instance IN_PREVIEW [] CREATE_OR_ACQUIRE helloworld-compute2 compute.v1.instance IN_PREVIEW [] CREATE_OR_ACQUIRE helloworld-controller compute.v1.instance IN_PREVIEW [] CREATE_OR_ACQUIRE helloworld-login1 compute.v1.instance IN_PREVIEW [] CREATE_OR_ACQUIRE helloworld-router compute.v1.router IN_PREVIEW [] CREATE_OR_ACQUIRE helloworld-slurm-network compute.v1.network IN_PREVIEW [] CREATE_OR_ACQUIRE helloworld-slurm-subnet compute.v1.subnetwork IN_PREVIEW [] CREATE_OR_ACQUIRE helloworld-ssh-firewall-rule compute.v1.firewall IN_PREVIEW [] CREATE_OR_ACQUIRE
The cluster configuration takes 5-10 minutes to complete. Track the progress of the configuration:
gcloud compute ssh ${CLUSTER_NAME}-controller \ --command "sudo journalctl -fu google-startup-scripts.service" \ --zone $CLUSTER_ZONE
When the script finishes, your Slurm cluster is ready to use. The output ends with the following line:
Started Google Compute Engine Startup Scripts.
To stop watching the configuration, press Control+C.
Verify that the cluster is operational
In Cloud Shell, check that the cluster is ready by logging in to the login node:
export CLUSTER_LOGIN_NODE=$(gcloud compute instances list \ --zones ${CLUSTER_ZONE} \ --filter="name ~ .*login." \ --format="value(name)" | head -n1) gcloud compute ssh ${CLUSTER_LOGIN_NODE} \ --zone $CLUSTER_ZONE
The cluster is ready when the output is similar to the following:
SSSSSSS SSSSSSSSS SSSSSSSSS SSSSSSSSS SSSS SSSSSSS SSSS SSSSSS SSSSSS SSSSSS SSSSSSS SSSSSS SSSS SSSSSSSSS SSSS SSS SSSSSSSSS SSS SSSSS SSSS SSSSSSSSS SSSS SSSSS SSS SSSSSS SSSSSSSSS SSSSSS SSS SSSSSS SSSSSSS SSSSSS SSS SSSSSS SSSSSS SSS SSSSS SSSS SSSSSSS SSSS SSSSS S SSS SSSSSSSSS SSS S SSS SSSS SSSSSSSSS SSSS SSS S SSS SSSSSS SSSSSSSSS SSSSSS SSS S SSSSS SSSSSS SSSSSSSSS SSSSSS SSSSS S SSSSS SSSS SSSSSSS SSSS SSSSS S S SSS SSS SSS SSS S S S S S SSS SSS SSS SSS SSSSSSSSSSSS SSS SSSS SSSS SSSSSSSSS SSSSSSSSSSSSSSSSSSSS SSSSSSSSSSSSS SSS SSSS SSSS SSSSSSSSSS SSSSSSSSSSSSSSSSSSSSSS SSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS SSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS SSSSSSSSSSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS SSSSSSSSSSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS SSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS SSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS SSSSSSSSSSSSS SSS SSSSSSSSSSSSSSS SSSS SSSS SSSS SSSS SSSSSSSSSSSS SSS SSSSSSSSSSSSS SSSS SSSS SSSS SSSS
To exit the cluster, press
Control+D
.When the cluster is ready, schedule a job to verify that it is working correctly. This job runs the
hostname
command on several nodes in the cluster.gcloud compute ssh ${CLUSTER_NAME}-login1 \ --command 'sbatch -N2 --wrap="srun hostname"' --zone $CLUSTER_ZONE gcloud compute ssh ${CLUSTER_NAME}-login1 \ --command 'cat slurm-*.out' --zone $CLUSTER_ZONE
The output is similar to the following:
helloworld-compute1 helloworld-compute2
You now have a working cluster.
Autoscaling the Slurm cluster
When you deploy a Slurm cluster, you specify values for the max_node_count
and the static_node_count
properties. The max_node_count
value determines
the maximum number of compute nodes that your cluster runs at any given time.
The static_node_count
value specifies the number of compute nodes that are
always running. By default, the slurm-cluster.yaml
file sets max_node_count
to 10
and static_node_count
to 2
.
The difference between these two numbers, max_node_count
- static_node_count
,
is the number of ephemeral nodes in the cluster. Ephemeral nodes are created
on-demand in response to job scheduling requests. After the job that they were
created to execute completes, the nodes are destroyed unless the cluster can use
them to execute other jobs. On Google Cloud, the Slurm
power-saving mechanism
autoscales clusters by instantiating ephemeral nodes only when necessary.
In Cloud Shell, view the node count:
sinfo
The output is similar to the following:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 8 idle~ demo-compute[3-10] debug* up infinite 2 idle demo-compute[1-2]
The
~
suffix on theSTATE
descriptoridle
in the first line of output indicates that the 8 ephemeral nodes are in power-saving mode.To autoscale the cluster, create 3 ephemeral nodes:
sbatch -N5 --wrap="srun hostname"
The output is similar the following:
Submitted batch job JOB_ID
View the node count again:
sinfo
The output is similar to the following:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 3 alloc# demo-compute[3-5] debug* up infinite 5 idle~ demo-compute[6-10] debug* up infinite 2 idle demo-compute[1-2]
The
#
suffix on theSTATE
descriptoralloc
in the first line of output indicates that 3 ephemeral nodes are created. After the additional nodes are created and configured, they run the job, and then after a short delay, they are destroyed.
Cleaning up
The easiest way to eliminate billing is to delete the Cloud project you created for the tutorial. Alternatively, you can delete the individual resources.Delete the project
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the Slurm cluster
Delete the cluster:
gcloud deployment-manager deployments delete slurm
What's next
- Try out other Google Cloud features for yourself. Have a look at our tutorials.