HPC blueprints

An HPC blueprint is a YAML file that defines a reusable configuration and describes the specific HPC environment that you want to deploy using Cloud HPC Toolkit.

To configure your environment, you can either start with one of the Example HPC blueprints which you can modify, or create your own blueprint. To create your own blueprint, review the Design an HPC blueprint section for an overview of the configurations that you need to specify in your blueprint.

Before you deploy a cluster, ensure to review the quota requirements.

Design an HPC blueprint

An HPC blueprint is comprised of the following three main components:

  • HPC blueprint name. The name of the blueprint. When naming your HPC blueprint, use the following conventions:

    • If you are updating or modifying an existing configuration, don't change the blueprint name.
    • If you are creating a new configuration, specify a new unique blueprint name.

    The blueprint name is added as a label to your cloud resources and is used for tracking usage and monitoring costs.

    The HPC blueprint name is set using the blueprint_name field.

  • Deployment variables. A set of parameters that are used by all modules in the blueprint. Use these variables to set values that are specific to a deployment.

    Deployment variables are set using the vars field in the blueprint, but you can override or set deployment variables at deployment time by specifying the --vars flag with the ghpc command.

    The most common deployment variables are as follows:

    • deployment_name: the name of the deployment. The deployment_name is a required variable for a deployment.

      This variable must be set to a unique value any time you deploy a new copy of an existing blueprint. The deployment name is added as a label to cloud resources and is used for tracking usage and monitoring costs.

      Because a single HPC blueprint can be used for multiple deployments, you can use the blueprint_name to identify the type of HPC environment, for example slurm-high-performance-cluster. While the deployment_name can be used to identify the targeted use of that cluster, for example research-dept-prod.

    • project_id: the ID for the project where you want to deploy the cluster. The project_id is a required variable for a deployment.

    • zone: the zone where you want to deploy the cluster.

    • region: the region where you want to deploy the cluster.

    Other variables that you might want to specify here include a custom image family, a Shared VPC network, or subnetwork that you want all modules to use.

  • Deployment groups. Defines a distinct set of modules that are to be deployed together. A deployment group can only contain modules of a single type, for example a deployment group can't mix Packer and Terraform modules.

    Deployment groups are set using the deployment_groups field. Each deployment group requires the following parameters:

    • group: the name of the group.
    • modules: the descriptors for each module, these include the following:

      • id: a unique identifier for the module.
      • source: the directory path or URL where the module is located. For more information, see Module fields.
      • kind: the type of module. Valid values are packer or terraform. This is an optional parameter that defaults to terraform if omitted.
      • use: a list of module IDs whose outputs can be linked to the module's settings. This is an optional parameter.
      • outputs: If you are using Terraform modules, use this parameter to specify a list of Terraform output values that you want to make available at the deployment group level.

        During deployment, these output values are printed to the screen after you run the terraform apply command.

        After deployment, you can access these outputs by running the terraform output command.

        This is an optional parameter.

      • settings: any module variable that you want to add. This is an optional parameter.

      For a list of supported modules, see Supported modules.

  • Terraform Remote State configuration (optional). Most blueprints use Terraform modules to provision Cloud infrastructure. It is recommended to use Terraform remote state backed by a Cloud Storage bucket configured with object versioning. All configuration settings of the Cloud Storage backend are supported. The prefix setting determines the path within a bucket where state is stored. If prefix is left unset, the Cloud HPC Toolkit automatically generates a unique value based upon the blueprint_name, deployment_name, and deployment group name. The following configuration enables remote state for all deployment groups in a blueprint:

    terraform_backend_defaults:
    type: gcs
    configuration:
      bucket: BUCKET_NAME
    

For more information about advanced Terraform remote state configuration, see the Cloud HPC Toolkit GitHub repository.

Example HPC blueprints

To get started, you can use one of the following example HPC blueprints.

  • Example 1: Deploys a basic HPC cluster with Slurm
  • Example 2: Deploys an HPC cluster with Slurm and a tiered filesystem

For a full list of example HPC blueprints, see the Cloud HPC Toolkit GitHub repository.

Example 1

Deploys a basic autoscaling cluster with Slurm that uses default settings. The blueprint also creates a new VPC network, and a filestore instance mounted to /home.

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---

blueprint_name: hpc-slurm

vars:
  project_id:  ## Set GCP Project ID Here ##
  deployment_name: hpc-small
  region: us-central1
  zone: us-central1-a

# Documentation for each of the modules used below can be found at
# https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/modules/README.md

deployment_groups:
- group: primary
  modules:
  # Source is an embedded module, denoted by "modules/*" without ./, ../, /
  # as a prefix. To refer to a local module, prefix with ./, ../ or /
  # Example - ./modules/network/vpc
  - id: network1
    source: modules/network/vpc

  - id: homefs
    source: modules/file-system/filestore
    use: [network1]
    settings:
      local_mount: /home

  - id: debug_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 4
      machine_type: n2-standard-2

  - id: debug_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use:
    - network1
    - homefs
    - debug_node_group
    settings:
      partition_name: debug
      exclusive: false # allows nodes to stay up after jobs are done
      enable_placement: false # the default is: true
      is_default: true

  - id: compute_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 20
      bandwidth_tier: gvnic_enabled

  - id: compute_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use:
    - network1
    - homefs
    - compute_node_group
    settings:
      partition_name: compute

  - id: h3_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 20
      machine_type: h3-standard-88
      # H3 does not support pd-ssd and pd-standard
      # https://cloud.google.com/compute/docs/compute-optimized-machines#h3_disks
      disk_type: pd-balanced
      bandwidth_tier: gvnic_enabled

  - id: h3_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use:
    - network1
    - homefs
    - h3_node_group
    settings:
      partition_name: h3

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
    use:
    - network1
    - debug_partition
    - compute_partition
    - h3_partition
    - homefs
    settings:
      disable_controller_public_ips: false

  - id: slurm_login
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-login
    use:
    - network1
    - slurm_controller
    settings:
      machine_type: n2-standard-4
      disable_login_public_ips: false

Example 2

Deploys a cluster with Slurm that has a tiered file systems for higher performance. It connects to the default Virtual Private Cloud of the project and creates seven partitions and a login node.

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---

blueprint_name: hpc-enterprise-slurm

vars:
  project_id:  ## Set GCP Project ID Here ##
  deployment_name: hpc01
  region: us-central1
  zone: us-central1-a
  gpu_zones: [us-central1-a, us-central1-b, us-central1-c, us-central1-f]
  slurm_image:
    # Visit https://github.com/GoogleCloudPlatform/slurm-gcp/blob/master/docs/images.md#published-image-family
    # for a list of valid family options with Slurm
    family: slurm-gcp-5-10-hpc-centos-7
    project: schedmd-slurm-public
  # If image above is changed to use custom image, then setting below must be set to true
  instance_image_custom: false
  # Set to true for active cluster reconfiguration.
  # Note that setting this option requires additional dependencies to be installed locally.
  # https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/community/modules/scheduler/schedmd-slurm-gcp-v5-controller#description
  enable_reconfigure: true
  # When set, active compute nodes will be cleaned up on destroy.
  # Note that setting this option requires additional dependencies to be installed locally.
  enable_cleanup_compute: true

# Recommended to use GCS backend for Terraform state
# See https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/examples#optional-setting-up-a-remote-terraform-state
#
# terraform_backend_defaults:
#  type: gcs
#  configuration:
#    bucket: <<BUCKET_NAME>>

# Documentation for each of the modules used below can be found at
# https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/modules/README.md

deployment_groups:
- group: primary
  modules:
  # Source is an embedded module, denoted by "modules/*" without ./, ../, /
  # as a prefix. To refer to a local or community module, prefix with ./, ../ or /
  # Example - ./modules/network/vpc
  - id: network1
    source: modules/network/pre-existing-vpc

  - id: controller_sa
    source: community/modules/project/service-account
    settings:
      name: controller
      project_roles:
      - compute.instanceAdmin.v1
      - iam.serviceAccountUser
      - logging.logWriter
      - monitoring.metricWriter
      - pubsub.admin
      - storage.objectViewer

  - id: login_sa
    source: community/modules/project/service-account
    settings:
      name: login
      project_roles:
      - logging.logWriter
      - monitoring.metricWriter
      - storage.objectViewer

  - id: compute_sa
    source: community/modules/project/service-account
    settings:
      name: compute
      project_roles:
      - logging.logWriter
      - monitoring.metricWriter
      - storage.objectCreator

  - id: homefs
    source: modules/file-system/filestore
    use: [network1]
    settings:
      local_mount: /home

  - id: projectsfs
    source: modules/file-system/filestore
    use: [network1]
    settings:
      local_mount: /projects

  # This file system has an associated license cost.
  # https://console.developers.google.com/marketplace/product/ddnstorage/exascaler-cloud
  - id: scratchfs
    source: community/modules/file-system/DDN-EXAScaler
    use: [network1]
    settings:
      local_mount: /scratch

  - id: n2_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 4
      machine_type: n2-standard-2
      instance_image: $(vars.slurm_image)
      service_account:
        email: $(compute_sa.service_account_email)
        scopes:
        - https://www.googleapis.com/auth/cloud-platform

  - id: n2_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use: [n2_node_group, network1, homefs, projectsfs, scratchfs]
    settings:
      partition_name: n2
      exclusive: false  # allows nodes to stay up after jobs are done
      enable_placement: false  # the default is: true
      is_default: true
      partition_conf:
        SuspendTime: 300 # time (in secs) the nodes in this partition stay active after their tasks have completed

  - id: c2_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 20
      machine_type: c2-standard-60  # this is the default
      instance_image: $(vars.slurm_image)
      bandwidth_tier: tier_1_enabled
      disk_type: pd-ssd
      disk_size_gb: 100
      service_account:
        email: $(compute_sa.service_account_email)
        scopes:
        - https://www.googleapis.com/auth/cloud-platform

  # use `-p c2` to submit jobs to this partition:
  # ex: `srun -p c2 -N 1 hostname`
  - id: c2_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use: [c2_node_group, network1, homefs, projectsfs, scratchfs]
    settings:
      partition_name: c2
      # the following two are true by default
      exclusive: true  # this must be true if enable_placement is true
      enable_placement: true

  - id: c2d_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 20
      machine_type: c2d-standard-112
      instance_image: $(vars.slurm_image)
      bandwidth_tier: tier_1_enabled
      disk_type: pd-ssd
      disk_size_gb: 100
      service_account:
        email: $(compute_sa.service_account_email)
        scopes:
        - https://www.googleapis.com/auth/cloud-platform

  - id: c2d_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use: [c2d_node_group, network1, homefs, projectsfs, scratchfs]
    settings:
      partition_name: c2d

  - id: c3_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 20
      machine_type: c3-highcpu-176
      instance_image: $(vars.slurm_image)
      bandwidth_tier: tier_1_enabled
      disk_type: pd-ssd
      disk_size_gb: 100
      service_account:
        email: $(compute_sa.service_account_email)
        scopes:
        - https://www.googleapis.com/auth/cloud-platform

  - id: c3_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use: [c3_node_group, network1, homefs, projectsfs, scratchfs]
    settings:
      partition_name: c3

  - id: a2_8_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 16
      machine_type: a2-ultragpu-8g
      bandwidth_tier: gvnic_enabled
      instance_image: $(vars.slurm_image)
      disk_type: pd-ssd
      disk_size_gb: 100
      node_conf:
        Sockets: 2
        CoresPerSocket: 24
      service_account:
        email: $(compute_sa.service_account_email)
        scopes:
        - https://www.googleapis.com/auth/cloud-platform

  # use `-p a208` to submit jobs to this partition:
  # ex: `srun -p a208 --gpus-per-node=8 -N 1 nvidia-smi`
  - id: a2_8_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use: [a2_8_node_group, network1, homefs, projectsfs, scratchfs]
    settings:
      partition_name: a208
      # This makes this partition look for machines in any of the following zones
      # https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/develop/community/modules/compute/schedmd-slurm-gcp-v5-partition#compute-vm-zone-policies
      zones: $(vars.gpu_zones)
      # The following allows users to use more host memory without specifying cpus on a job
      partition_conf:
        DefMemPerGPU: 160000
        DefMemPerCPU: null

  - id: a2_16_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 16
      machine_type: a2-megagpu-16g
      bandwidth_tier: gvnic_enabled
      instance_image: $(vars.slurm_image)
      disk_type: pd-ssd
      disk_size_gb: 100
      node_conf:
        Sockets: 2
        CoresPerSocket: 24
      service_account:
        email: $(compute_sa.service_account_email)
        scopes:
        - https://www.googleapis.com/auth/cloud-platform

  # use `-p a216` to submit jobs to this partition:
  # ex: `srun -p a216 --gpus-per-node=16 -N 1 nvidia-smi`
  - id: a2_16_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use: [a2_16_node_group, network1, homefs, projectsfs, scratchfs]
    settings:
      partition_name: a216
      # This makes this partition look for machines in any of the following zones
      # https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/develop/community/modules/compute/schedmd-slurm-gcp-v5-partition#compute-vm-zone-policies
      zones: $(vars.gpu_zones)
      # The following allows users to use more host memory without specifying cpus on a job
      partition_conf:
        DefMemPerGPU: 160000
        DefMemPerCPU: null

  - id: h3_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 16
      machine_type: h3-standard-88
      bandwidth_tier: gvnic_enabled  # https://cloud.google.com/compute/docs/compute-optimized-machines#h3_network
      instance_image: $(vars.slurm_image)
      service_account:
        email: $(compute_sa.service_account_email)
        scopes:
        - https://www.googleapis.com/auth/cloud-platform
      # H3 does not support pd-ssd and pd-standard
      # https://cloud.google.com/compute/docs/compute-optimized-machines#h3_disks
      disk_type: pd-balanced
      disk_size_gb: 100

  # use `-p h3` to submit jobs to this partition:
  # ex: `srun -p h3  -N 1 hostname`
  - id: h3_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use: [h3_node_group, network1, homefs, projectsfs, scratchfs]
    settings:
      partition_name: h3

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
    use: [network1, homefs, projectsfs, scratchfs, n2_partition,
          c2_partition, c2d_partition, c3_partition, a2_8_partition, a2_16_partition,
          h3_partition]
    settings:
      instance_image: $(vars.slurm_image)
      # the following allow for longer boot time
      # which is useful for large GPU nodes
      cloud_parameters:
        no_comma_params: false
        resume_rate: 0
        resume_timeout: 600
        suspend_rate: 0
        suspend_timeout: 600
      # we recommend disabling public IPs if possible
      # but that requires your network to have a NAT or
      # private access configured
      disable_controller_public_ips: false
      service_account:
        email: $(controller_sa.service_account_email)
        scopes:
        - https://www.googleapis.com/auth/cloud-platform

  - id: slurm_login
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-login
    use:
    - network1
    - slurm_controller
    settings:
      instance_image: $(vars.slurm_image)
      machine_type: n2-standard-4
      disable_login_public_ips: false
      service_account:
        email: $(login_sa.service_account_email)
        scopes:
        - https://www.googleapis.com/auth/cloud-platform

  - id: hpc_dashboard
    source: modules/monitoring/dashboard
    outputs: [instructions]

Request additional quotas

You might need to request additional quota to be able to deploy and use your HPC cluster.

For example, by default the schedmd-slurm-gcp-v5-node-group module uses c2-standard-60 VMs for compute nodes. The default quota for C2 VMs might be as low as 8, which might prevent even a single node from being started.

The required quotas are based on your custom HPC configuration. Minimum quotas are documented on GitHub for the provided example blueprints.

To view and increase quotas, see Managing your quota using the Google Cloud CLI.

What's next