Deploying a Slurm cluster on Compute Engine

This tutorial shows how to deploy a Slurm cluster on Compute Engine. The Slurm Resource Manager is a popular resource manager used in many high performance computing centers. For a discussion of high performance computing terminology and use cases, see Using clusters for large-scale technical computing in the cloud.

The following diagram illustrates the configuration that you create in this tutorial.

Architectural diagram showing a Slurm cluster installed on Compute Engine.

As a cluster workload manager, Slurm does the following:

  • Allocates resources (compute nodes) to users.
  • Provides a structure for starting, executing, and monitoring work on the nodes.
  • Manages a queue of pending work.

The deployment developed for this tutorial is a simplified cluster architecture that supports a few users running a single-app workload. You can use other configurations for multi-workload deployments, large-scale production deployments, and hybrid deployments that work with on-premises or multi-cloud clusters, but these configurations are outside the scope of this tutorial.

In this tutorial, you interact with the system by using the login (head) node. After the cluster is deployed, you connect to the login node by using SSH, install the apps, and use Slurm command-line tools to submit jobs for computation. The Slurm scheduler, running on the controller node, schedules the queued jobs by matching available resources with the job requirements and manages the execution of the jobs on the compute nodes. The NFS server provides a common shared space for files.

Objectives

  • Use Cloud Deployment Manager to deploy a Slurm cluster.
  • Run a job using Slurm.
  • Query cluster information and monitor running jobs in Slurm.
  • Autoscale nodes to accommodate job parameters and requirements.

Costs

This tutorial uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Cloud Console, on the project selector page, select or create a Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

  4. Enable the Compute Engine and Deployment Manager APIs.

    Enable the APIs

  5. In the Cloud Console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Cloud Console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including the gcloud command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize.

Deploying the Slurm cluster

In this section, you prepare the cluster configuration, use Deployment Manager to deploy the Slurm cluster to Google Cloud, and then verify that the cluster is operational.

Prepare the cluster configuration

  1. In Cloud Shell, clone the slurm-gcp GitHub repository:

    git clone https://github.com/SchedMD/slurm-gcp.git
    
  2. Define the following environment variables:

    export CLUSTER_DEPLOY_NAME="cluster-deployment-name"
    export CLUSTER_NAME="cluster-name"
    export CLUSTER_REGION="cluster-region"
    export CLUSTER_ZONE="cluster-zone"
    

    Replace the following:

    • cluster-deployment-name: a unique name for your cluster deployment.
    • cluster-name: a unique name for your cluster.
    • cluster-region: the region where you deploy the cluster. Pick a region close to your location. For more information, see Regions and zones.
    • cluster-zone: the zone where you deploy the cluster.
  3. In the slurm-gcp folder, copy the slurm-cluster.yaml file to the ${CLUSTER_DEPLOY_NAME}.yaml file:

    cd slurm-gcp
    cp slurm-cluster.yaml ${CLUSTER_DEPLOY_NAME}.yaml
    
  4. In a text editor, modify the ${CLUSTER_DEPLOY_NAME}.yaml file for your environment. Use the types as defined in the slurm.jinja.schema file, which specifies default and permissible values for all configuration properties except the default_users value. Make the following required changes:

    • cluster_name: change the name of your cluster to be cluster-name.
    • region and zone: replace with cluster-region and cluster-zone
    • compute_machine_type: (Optional) to use a different machine type, change the compute_machine_type value. For example, if you need more CPU cores or memory than are available with the default choice of n1-standard-2, choose n1-standard-4. For more information, see Machine types.
    • vpc_net and vpc_subnet: (Optional) use an existing Virtual Private Cloud (VPC) network and VPC subnet. The network and subnet requirements are described in the slurm.jinja.schema file. If you don't specify values, a new network or subnet is created for your cluster. For more information, see VPC networks.
  5. Save the file.

Deploy the Slurm cluster using Deployment Manager

  1. In Cloud Shell, use Deployment Manager to deploy your cluster to Google Cloud:

    gcloud deployment-manager deployments \
        --project="$(gcloud config get-value core/project)" \
        create $CLUSTER_DEPLOY_NAME \
        --config ${CLUSTER_DEPLOY_NAME}.yaml
    

    The output is similar to the following:

    The fingerprint of the deployment is VWVaIYX1mFpjv9UDhzalYQ==
    Waiting for create [operation-1582142432278-59ef33da8f6fa-f5785d68-21feb545]...done.
    Create operation operation-1582142432278-59ef33da8f6fa-f5785d68-21feb545 completed successfully.
    NAME                                   TYPE                   STATE       ERRORS  INTENT
    helloworld-all-internal-firewall-rule  compute.v1.firewall    IN_PREVIEW  []      CREATE_OR_ACQUIRE
    helloworld-compute-image               compute.v1.instance    IN_PREVIEW  []      CREATE_OR_ACQUIRE
    helloworld-compute1                    compute.v1.instance    IN_PREVIEW  []      CREATE_OR_ACQUIRE
    helloworld-compute2                    compute.v1.instance    IN_PREVIEW  []      CREATE_OR_ACQUIRE
    helloworld-controller                  compute.v1.instance    IN_PREVIEW  []      CREATE_OR_ACQUIRE
    helloworld-login1                      compute.v1.instance    IN_PREVIEW  []      CREATE_OR_ACQUIRE
    helloworld-router                      compute.v1.router      IN_PREVIEW  []      CREATE_OR_ACQUIRE
    helloworld-slurm-network               compute.v1.network     IN_PREVIEW  []      CREATE_OR_ACQUIRE
    helloworld-slurm-subnet                compute.v1.subnetwork  IN_PREVIEW  []      CREATE_OR_ACQUIRE
    helloworld-ssh-firewall-rule           compute.v1.firewall    IN_PREVIEW  []      CREATE_OR_ACQUIRE
    
  2. The cluster configuration takes 5-10 minutes to complete. Track the progress of the configuration:

    gcloud compute ssh ${CLUSTER_NAME}-controller \
        --command "sudo journalctl -fu google-startup-scripts.service" \
        --zone $CLUSTER_ZONE
    

    When the script finishes, your Slurm cluster is ready to use. The output ends with the following line:

    Started Google Compute Engine Startup Scripts.
    
  3. To stop watching the configuration, press Control+C.

Verify that the cluster is operational

  1. In Cloud Shell, check that the cluster is ready by logging in to the login node:

    export CLUSTER_LOGIN_NODE=$(gcloud compute instances list \
        --zones ${CLUSTER_ZONE} \
        --filter="name ~ .*login." \
        --format="value(name)" | head -n1)
    gcloud compute ssh ${CLUSTER_LOGIN_NODE} \
        --zone $CLUSTER_ZONE
    

    The cluster is ready when the output is similar to the following:

                                     SSSSSSS
                                    SSSSSSSSS
                                    SSSSSSSSS
                                    SSSSSSSSS
                            SSSS     SSSSSSS     SSSS
                           SSSSSS               SSSSSS
                           SSSSSS    SSSSSSS    SSSSSS
                            SSSS    SSSSSSSSS    SSSS
                    SSS             SSSSSSSSS             SSS
                   SSSSS    SSSS    SSSSSSSSS    SSSS    SSSSS
                    SSS    SSSSSS   SSSSSSSSS   SSSSSS    SSS
                           SSSSSS    SSSSSSS    SSSSSS
                    SSS    SSSSSS               SSSSSS    SSS
                   SSSSS    SSSS     SSSSSSS     SSSS    SSSSS
              S     SSS             SSSSSSSSS             SSS     S
             SSS            SSSS    SSSSSSSSS    SSSS            SSS
              S     SSS    SSSSSS   SSSSSSSSS   SSSSSS    SSS     S
                   SSSSS   SSSSSS   SSSSSSSSS   SSSSSS   SSSSS
              S    SSSSS    SSSS     SSSSSSS     SSSS    SSSSS    S
        S    SSS    SSS                                   SSS    SSS    S
        S     S                                                   S     S
                    SSS
                    SSS
                    SSS
                    SSS
     SSSSSSSSSSSS   SSS   SSSS       SSSS    SSSSSSSSS   SSSSSSSSSSSSSSSSSSSS
    SSSSSSSSSSSSS   SSS   SSSS       SSSS   SSSSSSSSSS  SSSSSSSSSSSSSSSSSSSSSS
    SSSS            SSS   SSSS       SSSS   SSSS        SSSS     SSSS     SSSS
    SSSS            SSS   SSSS       SSSS   SSSS        SSSS     SSSS     SSSS
    SSSSSSSSSSSS    SSS   SSSS       SSSS   SSSS        SSSS     SSSS     SSSS
     SSSSSSSSSSSS   SSS   SSSS       SSSS   SSSS        SSSS     SSSS     SSSS
             SSSS   SSS   SSSS       SSSS   SSSS        SSSS     SSSS     SSSS
             SSSS   SSS   SSSS       SSSS   SSSS        SSSS     SSSS     SSSS
    SSSSSSSSSSSSS   SSS   SSSSSSSSSSSSSSS   SSSS        SSSS     SSSS     SSSS
    SSSSSSSSSSSS    SSS    SSSSSSSSSSSSS    SSSS        SSSS     SSSS     SSSS
    
  2. To exit the cluster, press Control+D.

  3. When the cluster is ready, schedule a job to verify that it is working correctly. This job runs the hostname command on several nodes in the cluster.

    gcloud compute ssh ${CLUSTER_NAME}-login1 \
        --command 'sbatch -N2 --wrap="srun hostname"' --zone $CLUSTER_ZONE
    
    gcloud compute ssh ${CLUSTER_NAME}-login1 \
        --command 'cat slurm-*.out'  --zone $CLUSTER_ZONE
    

    The output is similar to the following:

    helloworld-compute1
    helloworld-compute2
    

    You now have a working cluster.

Autoscaling the Slurm cluster

When you deploy a Slurm cluster, you specify values for the max_node_count and the static_node_count properties. The max_node_count value determines the maximum number of compute nodes that your cluster runs at any given time. The static_node_count value specifies the number of compute nodes that are always running. By default, the slurm-cluster.yaml file sets max_node_count to 10 and static_node_count to 2.

The difference between these two numbers, max_node_count - static_node_count, is the number of ephemeral nodes in the cluster. Ephemeral nodes are created on-demand in response to job scheduling requests. After the job that they were created to execute completes, the nodes are destroyed unless the cluster can use them to execute other jobs. On Google Cloud, the Slurm power-saving mechanism autoscales clusters by instantiating ephemeral nodes only when necessary.

  1. In Cloud Shell, view the node count:

    sinfo
    

    The output is similar to the following:

    PARTITION AVAIL TIMELIMIT  NODES  STATE   NODELIST
    debug*    up    infinite   8      idle~   demo-compute[3-10]
    debug*    up    infinite   2      idle    demo-compute[1-2]
    

    The ~ suffix on the STATE descriptor idle in the first line of output indicates that the 8 ephemeral nodes are in power-saving mode.

  2. To autoscale the cluster, create 3 ephemeral nodes:

    sbatch -N5 --wrap="srun hostname"
    

    The output is similar the following:

    Submitted batch job JOB_ID
    
  3. View the node count again:

    sinfo
    

    The output is similar to the following:

    PARTITION AVAIL  TIMELIMIT  NODES  STATE   NODELIST
    debug*    up     infinite   3      alloc#  demo-compute[3-5]
    debug*    up     infinite   5      idle~   demo-compute[6-10]
    debug*    up     infinite   2      idle    demo-compute[1-2]
    

    The # suffix on the STATE descriptor alloc in the first line of output indicates that 3 ephemeral nodes are created. After the additional nodes are created and configured, they run the job, and then after a short delay, they are destroyed.

Cleaning up

The easiest way to eliminate billing is to delete the Cloud project you created for the tutorial. Alternatively, you can delete the individual resources.

Delete the project

  1. In the Cloud Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project that you want to delete and then click Delete .
  3. In the dialog, type the project ID and then click Shut down to delete the project.

Delete the Slurm cluster

  • Delete the cluster:

    gcloud deployment-manager deployments delete slurm
    

What's next

  • Try out other Google Cloud features for yourself. Have a look at our tutorials.