Best practices for static compute nodes

This document provides best practices for configuring static compute nodes for optimal usability and performance of high performance computing (HPC) and artificial intelligence (AI) workloads.

Use a reservation

To make sure that Compute Engine resources are available when you need them, you can use reservations. Reservations provide a very high level of assurance in obtaining capacity for Compute Engine zonal resources.

For static clusters, it is recommended that you use reservations. Although there are some tradeoffs, a reservation provides guarantees about resource availability. The downside to reservations is that they must be created manually, prior to deploying a cluster. Once a reservation is created, you are billed for the compute resources as if the VMs were in use, whether or not the VMs have actually been created using the reservation.

Use compact placement policy

Using a compact placement policy specifies that your static compute nodes should be physically placed closer to each other, reducing network latency between nodes.

Unlike autoscaling nodes which might create a new compact placement policy specific to each job, the placement of static compute nodes is tied to the lifecycle of the nodeset and not that of the jobs.

However, it is important to note that the topology of static compute nodes might be subject to placement changes when nodes are restarted, recreated, or migrated. This can happen because of explicit actions like updating images, or because of maintenance events.

How to configure static compute nodes to use a reservation and compact placement policy

When using a reservation with a placement policy, the placement policy must be created prior to the reservation and specified during the creation of the reservation. When a reservation with placement is provided to Slurm, it automatically uses the placement policy attached to the reservation.

To configure static compute nodes to use a compact placement policy and reservation, complete the following steps by using the Google Cloud CLI:

  1. To create a compact placement policy, use the gcloud compute resource-policies create group-placement command with the --collocation=COLLOCATED flag.

    gcloud compute resource-policies create group-placement PLACEMENT_POLICY_NAME \
        --collocation=COLLOCATED \
        --project=PROJECT_ID \
        --region=REGION
    
  2. Use the gcloud compute reservations create command to create an example reservation for six VMs, and specify the compact placement policy that you created in the previous step.

    gcloud compute reservations create RESERVATION_NAME \
      --vm-count=6 \
      --machine-type=MACHINE_TYPE \
      --require-specific-reservation \
      --project=PROJECT_ID \
      --zone=ZONE \
      --resource-policies=compact-placement=PLACEMENT_POLICY_NAME
    
  3. Use the reservation for your static compute nodes. The enable_placement flag must be set to false. This states that placement is not created by Slurm but instead comes from a reservation.

     ‐ id: static_nodeset
       source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
       use: [network]
       settings:
         node_count_static: 6
         node_count_dynamic_max: 0
         enable_placement: false
         reservation_name: RESERVATION_NAME
         machine_type: MACHINE_TYPE 

    ‐ id: static_partition source: community/modules/compute/schedmd-slurm-gcp-v6-partition use: [static_nodeset] settings: partition_name: static exclusive: false

Replace the following:

  • RESERVATION_NAME: the name of your reservation.
  • MACHINE_TYPE: a machine type from the compute-optimized machine family. For example, c2-standard-60.
  • PLACEMENT_POLICY_NAME: the name of your placement policy.
  • PROJECT_ID: your project ID.
  • REGION: the region where your VMs are located.
  • ZONE: the zone where your VMs are located.

Summary of best practices

The following is a summary of the recommended best practices for clusters that use static compute nodes.


Requirement

Recommendation
VM availability
  • Use a reservation

Reduce network latency
  • Use compact placement policy

What's next