Stay organized with collections Save and categorize content based on your preferences.

Deploy an HPC cluster with Slurm

This document describes how to deploy an HPC cluster with Slurm in the Google Cloud console.


To follow step-by-step guidance for this task directly in the Google Cloud console, click Guide me:

Guide me


Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.

  6. Enable the Compute Engine API.

    Enable the Compute Engine API

  7. Enable the Filestore API.

    Enable the Filestore API

  8. Enable the Cloud Storage API.

    Enable the Cloud Storage API

  9. Enable the Service Usage API.

    Enable the Service Usage API

Costs

The cost of running this tutorial varies by section such as setting up the tutorial or running jobs. You can calculate the cost by using the pricing calculator.

Tutorial only costs

  • To estimate the cost for setting up this tutorial, use the following specifications:

    • Filestore Basic HDD (standard) capacity per region: 1024 GB
    • Standard persistent disk: 50 GB pd-standard for the Slurm login node.
    • Performance (SSD) persistent disks: 50 GB pd-ssd for the Slurm controller.
    • 1 N2 VM instance: n2-standard-4
    • 1 C2 VM instance: c2-standard-4
  • To estimate the cost for running a job on the cluster, use the following specifications:

    • 3 N2 VM instances: n2-standard-2. These are created when the srun -N 3 hostname command is run and the cluster autoscales. Each of these VMs will have 50 GB of pd-standard disk attached. These VMs are deleted automatically after one minute of inactivity.

Costs for submitting additional jobs

The following resources are not used as a part of this tutorial but because Slurm can autoscale compute nodes, the following resources might be created if you submit additional jobs to the compute or debug partitions:

  • Jobs submitted to the default debug partition:

    • 4 N2 VM instances: n2-standard-2. Each of these VMs will have 50 GB of pd-standard disk attached.
  • Jobs submitted to the compute partition:

    • 20 C2 VM instances: c2-standard-60. Each of these VMs will have 50 GB of pd-standard disk attached.

Launch Cloud Shell

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

Ensure that the default Compute Engine service account is enabled

Cloud HPC Toolkit requires that the default Compute Engine service account is enabled in your project and that the roles/editor IAM role is enabled on the service account. This allows the Slurm controller to perform actions such as auto-scaling.

From Cloud Shell, run the following commands to ensure these settings are enabled:

  1. Enable the default Compute Engine service account.

     gcloud iam service-accounts enable \
         --project=PROJECT_ID \
         PROJECT_NUMBER-compute@developer.gserviceaccount.com
    
  2. Grant the roles/editor IAM role to the service account.

    gcloud projects add-iam-policy-binding PROJECT_ID \
        --member=serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com \
        --role=roles/editor
    
Replace the following:

  • PROJECT_ID: your project ID
  • PROJECT_NUMBER: the automatically generated unique identifier for your project

    For more information, see Identifying projects.

Clone the Cloud HPC Toolkit GitHub repository

  1. Clone the GitHub repository:

    git clone https://github.com/GoogleCloudPlatform/hpc-toolkit.git
  2. Go to the main working directory:

    cd hpc-toolkit/

Build the Cloud HPC Toolkit binary

  1. To build the Cloud HPC Toolkit binary from source, from Cloud Shell run the following command:

    make
  2. To verify the build, from Cloud Shell run the following command:

    ./ghpc --version

    The output shows you the version of the Cloud HPC Toolkit that you are using.

Create the HPC deployment folder

An HPC blueprint is a YAML file that defines the HPC environment. The ghpc command, that is built in previous step, uses the HPC blueprint to create a deployment folder. The deployment folder can then be used to deploy the environment.

This tutorial uses the slurm-gcp-v5-hpc-centos7.yaml example found in the Cloud HPC Toolkit GitHub repository.

To create a deployment folder from the HPC blueprint, run the following command from Cloud Shell:

./ghpc create community/examples/slurm-gcp-v5-hpc-centos7.yaml \
    -l ERROR --vars project_id=PROJECT_ID

Replace PROJECT_ID with your project ID.

This command creates the slurm-gcp-v5/ deployment folder, which contains the Terraform needed to deploy your cluster. The -l ERROR validator flag is also specified to prevent the creation of the deployment folder if any of the validations fail.

Deploy the HPC cluster using Terraform

To deploy the HPC cluster, complete the following steps:

  1. Set up the Terraform deployment by running the terraform init command:

    terraform -chdir=slurm-gcp-v5/primary init
  2. Generate a plan that describes the Google Cloud resources that you want deploy by running the terraform apply command:

    terraform -chdir=slurm-gcp-v5/primary apply
  3. Review the plan and then start the deployment by typing yes and pressing enter.

    This deployment takes about 5 minutes. You will see regular status updates in the terminal.

If the terraform apply is successful, a message similar to the following is displayed:

Apply complete! Resources: xx added, 0 changed, 0 destroyed.

Run a job on the HPC cluster

After the cluster deploys, complete the following steps to run a job:

  1. Go to the Compute Engine > VM instances page.

    Go to VM instances

  2. Connect to the slurmgcpv5-login-* VM using SSH-in-browser.

    From the Connect column of the VM, click SSH.

    After connecting to the VM, if you see the following message on the terminal:

      Slurm is currently being configured in the background
    

    Wait a few minutes, disconnect and then re-connect to the VM.

  3. From the command line of the VM, run the hostname command using Slurm.

    srun -N 3 hostname

    This command creates three compute nodes for your HPC cluster. This may take a minute while Slurm auto-scales to create the three nodes.

    When the job finishes you should see an output similar to:

    $ srun -N 3 hostname
        slurmgcpv5-debug-ghpc-0
        slurmgcpv5-debug-ghpc-1
        slurmgcpv5-debug-ghpc-2
    

The auto-scaled nodes are automatically destroyed by the Slurm controller if left idle for more than 60 seconds.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, delete the Cloud project with the resources.

Destroy the HPC cluster

  1. Go to the VM instances page and check that the compute nodes are deleted.

    Compute nodes use the following naming convention: slurmgcpv5-debug-ghpc-*

    If you see any of these nodes, wait several minutes for them to be automatically deleted. This might take up to four minutes.

  2. After the compute nodes are removed, from the Cloud Shell terminal, run the following command:

    terraform -chdir=slurm-gcp-v5/primary destroy -auto-approve

    When complete you should see something like:

    Destroy complete! Resources: xx destroyed.
    

  3. Go to the VM instances page and check that the VMs are deleted.

    Note: If the destroy command is run before Slurm shuts down the auto-scale nodes then the destroy command might fail. In this case, you can delete the VMs manually and rerun the destroy command.

Delete the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Learn more about HPC blueprints