Deploying a Slurm cluster on Compute Engine

This tutorial shows how to deploy a Slurm cluster on Compute Engine. The Slurm Resource Manager is a popular resource manager used in many high performance computing centers. For a discussion of high performance computing terminology and use cases, see Using clusters for large-scale technical computing in the cloud.

The following diagram illustrates the configuration that you create in this tutorial.

Architectural diagram showing a Slurm cluster installed on Compute Engine.

As a cluster workload manager, Slurm does the following:

  • Allocates resources (compute nodes) to users.
  • Provides a structure for starting, executing, and monitoring work on the nodes.
  • Manages a queue of pending work.

The deployment developed for this tutorial is a simplified cluster architecture that supports a few users running a single-app workload. You can use other configurations for multi-workload deployments, large-scale production deployments, and hybrid deployments that work with on-premises or multi-cloud clusters, but these configurations are outside the scope of this tutorial.

In this tutorial, you interact with the system by using the login (head) node. After the cluster is deployed, you connect to the login node by using SSH, install the apps, and use Slurm command-line tools to submit jobs for computation. The Slurm scheduler, running on the controller node, schedules the queued jobs by matching available resources with the job requirements and manages the execution of the jobs on the compute nodes. The NFS server provides a common shared space for files.

Objectives

  • Use Terraform to deploy a Slurm cluster.
  • Run a job using Slurm.
  • Query cluster information and monitor running jobs in Slurm.
  • Autoscale nodes to accommodate job parameters and requirements.

Costs

This tutorial uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Compute Engine and Deployment Manager APIs.

    Enable the APIs

  5. In the Cloud Console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Cloud Console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including the gcloud command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize.

Deploying the Slurm cluster

In this section, you prepare the cluster configuration, use Deployment Manager to deploy the Slurm cluster to Google Cloud, and then verify that the cluster is operational.

Prepare the cluster configuration

  1. In Cloud Shell, clone the slurm-gcp GitHub repository:

    git clone https://github.com/SchedMD/slurm-gcp.git
    
  2. Define the following environment variables:

    export CLUSTER_NAME="cluster-name"
    export CLUSTER_ZONE="cluster-zone"
    

    Replace the following:

    • cluster-name: a unique name for your cluster.
    • cluster-zone: the zone where you deploy the cluster.
  3. In the slurm-gcp/tf/examples/basic folder, copy the basic.tfvars.example file to the basic.tfvars file:

    cd slurm-gcp/tf/examples/basic
    cp basic.tfvars.example basic.tfvars
    
  4. Use the sed command to modify the basic.tfvars file for your environment:

    sed -i "s/\(cluster_name.*= \)\"\(.*\)\"/\1\"${CLUSTER_NAME}\"/" basic.tfvars
    sed -i "s/<project>/$(gcloud config get-value core/project)/" basic.tfvars
    sed -i "s/\(zone.*= \)\"\(.*\)\"/\1\"${CLUSTER_ZONE}\"/" basic.tfvars
    
  5. (Optional) In a text editor, customize the basic.tfvars file for your environment by changing preset values:

    • machine_type: To use a different machine type for the cluster's compute nodes, change this value. For example, change this value if you need more CPU cores or memory than are available with the default n1-standard-2 machine type. For more information, see Machine types.
    • compute_disk_type: To use SSDs on compute nodes, change this value to pd-ssd.
    • compute_disk_size_gb: To use larger or smaller disks attached to your compute nodes, change this value.
  6. (Optional) In a text editor, further customize the basic.tfvars file for your environment by uncommenting additional Terraform variable declarations.

Deploy the Slurm cluster using Terraform

  1. In Cloud Shell, initialize the working directory for Terraform:

    terraform init
    
  2. Use Terraform to deploy your cluster to Google Cloud:

    terraform apply -var-file=basic.tfvars
    

    The output is simalar to the following:

    terraform apply -var-file=basic.tfvars -auto-approve
    module.slurm_cluster_controller.data.google_compute_default_service_account.default: Refreshing state...
    module.slurm_cluster_login.data.google_compute_default_service_account.default: Refreshing state...
    module.slurm_cluster_network.google_compute_network.cluster_network[0]: Creating...
    module.slurm_cluster_network.google_compute_network.cluster_network[0]: Still creating... [10s elapsed]
    module.slurm_cluster_network.google_compute_network.cluster_network[0]: Still creating... [20s elapsed]
    module.slurm_cluster_network.google_compute_network.cluster_network[0]: Creation complete after 22s [id=projects/myproject/global/networks/mycluster-network]
    module.slurm_cluster_network.google_compute_router.cluster_router[0]: Creating...
    module.slurm_cluster_network.google_compute_subnetwork.cluster_subnet[0]: Creating...
    module.slurm_cluster_network.google_compute_firewall.cluster_iap_ssh_firewall[0]: Creating...
    module.slurm_cluster_network.google_compute_firewall.cluster_internal_firewall[0]: Creating...
    module.slurm_cluster_network.google_compute_router.cluster_router[0]: Still creating... [10s elapsed]
    module.slurm_cluster_network.google_compute_subnetwork.cluster_subnet[0]: Still creating... [10s elapsed]
    module.slurm_cluster_network.google_compute_firewall.cluster_iap_ssh_firewall[0]: Still creating... [10s elapsed]
    module.slurm_cluster_network.google_compute_firewall.cluster_internal_firewall[0]: Still creating... [10s elapsed]
    module.slurm_cluster_network.google_compute_firewall.cluster_internal_firewall[0]: Creation complete after 11s [id=projects/myproject/global/firewalls/mycluster-allow-internal]
    module.slurm_cluster_network.google_compute_firewall.cluster_iap_ssh_firewall[0]: Creation complete after 11s [id=projects/myproject/global/firewalls/mycluster-allow-iap]
    module.slurm_cluster_network.google_compute_router.cluster_router[0]: Creation complete after 11s [id=projects/myproject/regions/us-central1/routers/mycluster-router]
    module.slurm_cluster_network.google_compute_subnetwork.cluster_subnet[0]: Still creating... [20s elapsed]
    module.slurm_cluster_network.google_compute_subnetwork.cluster_subnet[0]: Creation complete after 21s [id=projects/myproject/regions/us-central1/subnetworks/mycluster-us-central1]
    module.slurm_cluster_network.google_compute_router_nat.cluster_nat[0]: Creating...
    module.slurm_cluster_login.google_compute_instance.login_node[0]: Creating...
    module.slurm_cluster_controller.google_compute_instance.controller_node[0]: Creating...
    module.slurm_cluster_network.google_compute_router_nat.cluster_nat[0]: Still creating... [10s elapsed]
    module.slurm_cluster_login.google_compute_instance.login_node[0]: Still creating... [10s elapsed]
    module.slurm_cluster_controller.google_compute_instance.controller_node[0]: Still creating... [10s elapsed]
    module.slurm_cluster_network.google_compute_router_nat.cluster_nat[0]: Still creating... [20s elapsed]
    module.slurm_cluster_login.google_compute_instance.login_node[0]: Still creating... [20s elapsed]
    module.slurm_cluster_controller.google_compute_instance.controller_node[0]: Still creating... [20s elapsed]
    module.slurm_cluster_network.google_compute_router_nat.cluster_nat[0]: Creation complete after 22s [id=myproject/us-central1/mycluster-router/mycluster-nat]
    module.slurm_cluster_login.google_compute_instance.login_node[0]: Still creating... [30s elapsed]
    module.slurm_cluster_controller.google_compute_instance.controller_node[0]: Still creating... [30s elapsed]
    module.slurm_cluster_controller.google_compute_instance.controller_node[0]: Creation complete after 34s [id=projects/myproject/zones/us-central1-a/instances/mycluster-controller]
    module.slurm_cluster_login.google_compute_instance.login_node[0]: Creation complete after 35s [id=projects/myproject/zones/us-central1-a/instances/mycluster-login0]
    
    Apply complete! Resources: 8 added, 0 changed, 0 destroyed.
    
    Outputs:
    
    controller_network_ips = [
      [
        "10.0.0.2",
      ],
    ]
    login_network_ips = [
      [
        "10.0.0.3",
      ],
    ]
    
  3. The cluster configuration takes 5-10 minutes to complete. Track the progress of the configuration:

    gcloud compute ssh ${CLUSTER_NAME}-controller \
        --command "sudo journalctl -fu google-startup-scripts.service" \
        --zone $CLUSTER_ZONE
    

    When the script finishes, your Slurm cluster is ready to use. The output ends with the following line:

    Started Google Compute Engine Startup Scripts.
    
  4. To stop watching the configuration, press Control+C.

Verify that the cluster is operational

  1. In Cloud Shell, check that the cluster is ready by logging in to the login node:

    export CLUSTER_LOGIN_NODE=$(gcloud compute instances list \
        --zones ${CLUSTER_ZONE} \
        --filter="name ~ .*login." \
        --format="value(name)" | head -n1)
    gcloud compute ssh ${CLUSTER_LOGIN_NODE} \
        --zone $CLUSTER_ZONE
    

    The cluster is ready when the output is similar to the following:

                                     SSSSSSS
                                    SSSSSSSSS
                                    SSSSSSSSS
                                    SSSSSSSSS
                            SSSS     SSSSSSS     SSSS
                           SSSSSS               SSSSSS
                           SSSSSS    SSSSSSS    SSSSSS
                            SSSS    SSSSSSSSS    SSSS
                    SSS             SSSSSSSSS             SSS
                   SSSSS    SSSS    SSSSSSSSS    SSSS    SSSSS
                    SSS    SSSSSS   SSSSSSSSS   SSSSSS    SSS
                           SSSSSS    SSSSSSS    SSSSSS
                    SSS    SSSSSS               SSSSSS    SSS
                   SSSSS    SSSS     SSSSSSS     SSSS    SSSSS
              S     SSS             SSSSSSSSS             SSS     S
             SSS            SSSS    SSSSSSSSS    SSSS            SSS
              S     SSS    SSSSSS   SSSSSSSSS   SSSSSS    SSS     S
                   SSSSS   SSSSSS   SSSSSSSSS   SSSSSS   SSSSS
              S    SSSSS    SSSS     SSSSSSS     SSSS    SSSSS    S
        S    SSS    SSS                                   SSS    SSS    S
        S     S                                                   S     S
                    SSS
                    SSS
                    SSS
                    SSS
     SSSSSSSSSSSS   SSS   SSSS       SSSS    SSSSSSSSS   SSSSSSSSSSSSSSSSSSSS
    SSSSSSSSSSSSS   SSS   SSSS       SSSS   SSSSSSSSSS  SSSSSSSSSSSSSSSSSSSSSS
    SSSS            SSS   SSSS       SSSS   SSSS        SSSS     SSSS     SSSS
    SSSS            SSS   SSSS       SSSS   SSSS        SSSS     SSSS     SSSS
    SSSSSSSSSSSS    SSS   SSSS       SSSS   SSSS        SSSS     SSSS     SSSS
     SSSSSSSSSSSS   SSS   SSSS       SSSS   SSSS        SSSS     SSSS     SSSS
             SSSS   SSS   SSSS       SSSS   SSSS        SSSS     SSSS     SSSS
             SSSS   SSS   SSSS       SSSS   SSSS        SSSS     SSSS     SSSS
    SSSSSSSSSSSSS   SSS   SSSSSSSSSSSSSSS   SSSS        SSSS     SSSS     SSSS
    SSSSSSSSSSSS    SSS    SSSSSSSSSSSSS    SSSS        SSSS     SSSS     SSSS
    
  2. To exit the cluster, press Control+D.

  3. When the cluster is ready, schedule a job to verify that it is working correctly. This job runs the hostname command on several nodes in the cluster.

    gcloud compute ssh ${CLUSTER_NAME}-login1 \
        --command 'sbatch -N2 --wrap="srun hostname"' --zone $CLUSTER_ZONE
    
    gcloud compute ssh ${CLUSTER_NAME}-login1 \
        --command 'cat slurm-*.out'  --zone $CLUSTER_ZONE
    

    The output is similar to the following:

    helloworld-compute1
    helloworld-compute2
    

    You now have a working cluster.

Autoscaling the Slurm cluster

When you deploy a Slurm cluster, you specify values for the max_node_count and the static_node_count properties. The max_node_count value determines the maximum number of compute nodes that your cluster runs at any given time. The static_node_count value specifies the number of compute nodes that are always running. By default, the slurm-cluster.yaml file sets max_node_count to 10 and static_node_count to 2.

The difference between these two numbers, max_node_count - static_node_count, is the number of ephemeral nodes in the cluster. Ephemeral nodes are created on-demand in response to job scheduling requests. After the job that they were created to execute completes, the nodes are destroyed unless the cluster can use them to execute other jobs. On Google Cloud, the Slurm power-saving mechanism autoscales clusters by instantiating ephemeral nodes only when necessary.

  1. In Cloud Shell, view the node count:

    sinfo
    

    The output is similar to the following:

    PARTITION AVAIL TIMELIMIT  NODES  STATE   NODELIST
    debug*    up    infinite   8      idle~   demo-compute[3-10]
    debug*    up    infinite   2      idle    demo-compute[1-2]
    

    The ~ suffix on the STATE descriptor idle in the first line of output indicates that the 8 ephemeral nodes are in power-saving mode.

  2. To autoscale the cluster, create 3 ephemeral nodes:

    sbatch -N5 --wrap="srun hostname"
    

    The output is similar the following:

    Submitted batch job JOB_ID
    
  3. View the node count again:

    sinfo
    

    The output is similar to the following:

    PARTITION AVAIL  TIMELIMIT  NODES  STATE   NODELIST
    debug*    up     infinite   3      alloc#  demo-compute[3-5]
    debug*    up     infinite   5      idle~   demo-compute[6-10]
    debug*    up     infinite   2      idle    demo-compute[1-2]
    

    The # suffix on the STATE descriptor alloc in the first line of output indicates that 3 ephemeral nodes are created. After the additional nodes are created and configured, they run the job, and then after a short delay, they are destroyed.

Cleaning up

The easiest way to eliminate billing is to delete the Cloud project you created for the tutorial. Alternatively, you can delete the individual resources.

Delete the project

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete the Slurm cluster

  • Delete the cluster:

    terraform destroy -var-file=basic.tfvars
    

What's next

  • Try out other Google Cloud features for yourself. Have a look at our tutorials.