Overview of HPC cluster creation

To create the infrastructure for tightly-coupled applications that scale across multiple nodes, you can create a cluster of virtual machine (VM) instances. This guide provides a high-level overview of the key considerations and steps to configure a cluster of virtual machine (VM) instances for high performance computing (HPC) workloads using dense resource allocation.

Compute Engine provides a variety of ways to create a multitude of VM instances that are connected to the same network, but don't require dense resource allocation. For example, you can use any of the following methods, which are covered elsewhere in the Compute Engine documentation:

You can also create a script that includes the commands to create individual instances and associate compact placement policies with the instances to reduce latency.

With H4D (Preview), Compute Engine adds support for running massive HPC workloads by treating an entire cluster of VM instances as a single computer. Using topology-aware placement of VMs lets you access many instances within a single networking superblock and minimizes network latency. You can also configure Cloud RDMA on these instances to maximize inter-node communication performance, which is crucial for tightly-coupled HPC workloads.

You create these HPC VM clusters with H4D by reserving blocks of capacity instead of individual resources. Using blocks of capacity for your cluster provides the following additional features for deploying and managing these large-scale environments:

Topology-aware placement of VM instances
Advanced reservation systems to secure and manage capacity
Enhanced Managed Instance Groups (MIGs) designed for large, co-dependent sets of VM instances.
Advanced maintenance scheduling and controls which provide more control over when and how maintenance occurs on the VM instances, which is crucial for long-running, disruption-sensitive workloads. This includes features like customer-triggered maintenance and grouped maintenance for blocks of resources.

Cluster terminology

When working with blocks of capacity, the following terms are used:

Blocks

Multiple sub-blocks interconnect with a non-blocking fabric, providing a high-bandwidth interconnect. Any CPU within the block is reachable in a maximum of two network hops. The system exposes block and sub-block metadata to orchestrators to enable optimal job placement.

Clusters

Multiple blocks interconnect to form a cluster that scales to thousands of CPUs for running large-scale HPC workloads. Each cluster is globally unique. Communication across different blocks adds only one additional hop, maintaining high performance and predictability, even at a massive scale. Cluster-level metadata is also available to orchestrators for intelligent, large-scale job placement.

Dense deployment

A resource request that allocates your compute instance resources physically close to each other to minimize network hops and optimize for the lowest latency.

Network fabric

A network fabric provides high-bandwidth, low-latency connectivity across all blocks and Google Cloud services in a cluster. Jupiter is Google's data center network architecture that leverages software-defined networking and optical circuit switches to evolve the network and optimize its performance.

Node or host

A single physical server machine in the data center. Each host has its associated compute resources CPUs, memory, and network interfaces. The number and configuration of these compute resources depend on the machine family. VM instances are provisioned on top of a physical host.

Orchestrator

An orchestrator automates the management of your clusters. With an orchestrator, you don't have to manage each VM instance in the cluster. An orchestrator, such as Slurm or Google Kubernetes Engine (GKE), handles tasks like job queueing, resource allocation, auto scaling (with GKE), and other day-to-day cluster management tasks.

Sub-blocks

These are foundational units where a group of hosts physically co-locates on a single rack. A Top-of-Rack (ToR) switch connects these hosts, enabling extremely efficient, single-hop communication between any two CPUs within the sub-block. Cloud RDMA facilitates this direct communication.

Overview of cluster creation process with H4D VMs

To create HPC clusters on reserved blocks of capacity, you must complete the following steps:

Choose a consumption option and obtain capacity
Choose a deployment option and orchestrator
Choose the operating system or cluster image
Create your cluster

Choose a consumption option and obtain capacity

Consumption options determine how resources are obtained for your cluster. To create a cluster with enhanced management features, you must request blocks of capacity for a dense deployment.

The following table summarizes the key differences between the consumption options for blocks of capacity:

Consumption option	Future reservations for capacity blocks	Flex-start
Workload characteristics	Long-running, large-scale distributed workloads that require densely allocated resources	Short-duration workloads that require densely allocated resources
Lifespan	Any time	Up to 7 days
Preemptible	No	No
Quota	Check that you have enough quota before creating instances.	Preemptible quota is charged.
Pricing	See pricing for VMs. If you reserve resources for a year or longer, then you must purchase and attach a resource-based commitment to your reserved resources. You're charged for the reservation period. See reservations billing.	See Dynamic Workload Scheduler pricing. You pay as you go (PAYG).
Resource allocation	Dense	Dense
Provisioning model	Reservation-bound	Flex-start
Creation method	To create HPC clusters and VMs, you must do the following: Reserve capacity by contacting your account team. At your chosen date and time, you can use the reserved capacity to create HPC clusters. See Choose a deployment option.	To create VMs, select one of the following options: Create a standalone Flex-start VMs. Create Flex-start VMs all at once by using MIG resize requests. Use GKE to run high performance computing (HPC) workloads with H4D. When your requested capacity becomes available, Compute Engine provisions it.

Choose a deployment option

Depending on the level of control that you need over your cluster deployment, choose between a highly managed or a less managed deployment that gives you more control over your infrastructure. Some of the available deployment options include the installation and configuration of an orchestrator for enhanced management of the HPC cluster.

Highly managed

If you want Google to deploy and set up your infrastructure, then use Cluster Toolkit or Google Kubernetes Engine (GKE).

Cluster Toolkit: an open source tool offered by Google that simplifies the configuration and deployment for clusters that use either Slurm or GKE. You use predefined blueprints to build a deployment folder that is based on the blueprint. You can modify blueprints or the deployment folder to customize deployments and your software stack. You then use Terraform or Packer to run the commands generated by Cluster Toolkit to deploy the cluster.

For a tutorial on this deployment method, see Create an RDMA-enabled HPC Slurm cluster.
GKE: a managed Kubernetes service and open source container orchestration platform. GKE offers features like auto scaling and high availability. These features make GKE a good fit for deploying and managing HPC workloads, including its ability to orchestrate containerized applications, support of specialized hardware, and compatibility with the Google Cloud ecosystem. You can deploy GKE clusters by using GKE directly or by using Cluster Toolkit. You can choose between Google Kubernetes Engine (GKE) Standard edition or Autopilot.

For a more information about this deployment method, see Run HPC workloads with H4D.

Less managed, more control

For more granular control over your clusters and the software installed on them, create a Compute Engine cluster by either of the following methods:

You can use these methods if you want to use an orchestrator other than Slurm or GKE. After the VMs are created, you manually install any key software that your HPC workload needs on the VMs. You can also use a startup script to automatically check the software installation and install the software if needed when the VM starts.

Choose the operating system image

The operating system (OS) image you choose depends on the service you use to deploy your cluster.

For clusters on GKE: Use a GKE node image, such as Container-Optimized OS. If you use Cluster Toolkit to deploy your GKE cluster, a Container-Optimized OS image is used by default. For more information about node images, see Node images in the GKE documentation.
For clusters on Compute Engine: You can use one of the following images:
- HPC VM image: A Rocky Linux 8 image that is optimized for tightly-coupled HPC workloads.
- OS image provided by Google Cloud: OS images that support H4D. You will need to configure these for your HPC workloads.
- Custom images: You can create and use your own custom images. To include HPC-specific optimizations, we recommend that you create a custom image using the HPC VM image.
For Slurm Clusters: Cluster Toolkit deploys the Slurm Cluster with a HPC VM image based on Rocky Linux 8 that is optimized for tightly-coupled HPC workloads.

Create your HPC cluster

After you review the cluster creation process and make preliminary decisions for your workload, create your cluster by using one of the following options:

Create an RDMA-enabled HPC Slurm cluster
Create a GKE cluster: Run high performance computing (HPC) workloads with H4D
Create a cluster with Compute Engine:

What's next

Learn more about Cluster Toolkit.
Learn more about creating an instance that uses RDMA.