To create the infrastructure for tightly-coupled applications that scale across multiple nodes, you can create a cluster of virtual machine (VM) instances. This guide provides a high-level overview of the key considerations and steps to configure a cluster of virtual machine (VM) instances for high performance computing (HPC) workloads using dense resource allocation.
Compute Engine provides a variety of ways to create a multitude of VM instances that are connected to the same network, but don't require dense resource allocation. For example, you can use any of the following methods, which are covered elsewhere in the Compute Engine documentation:
You can also create a script that includes the commands to create individual instances and associate compact placement policies with the instances to reduce latency.
With H4D (Preview), Compute Engine adds support for running massive HPC workloads by treating an entire cluster of VM instances as a single computer. Using topology-aware placement of VMs lets you access many instances within a single networking superblock and minimizes network latency. You can also configure Cloud RDMA on these instances to maximize inter-node communication performance, which is crucial for tightly-coupled HPC workloads.
You create these HPC VM clusters with H4D by reserving blocks of capacity instead of individual resources. Using blocks of capacity for your cluster provides the following additional features for deploying and managing these large-scale environments:
- Topology-aware placement of VM instances
- Advanced reservation systems to secure and manage capacity
- Enhanced Managed Instance Groups (MIGs) designed for large, co-dependent sets of VM instances.
- Advanced maintenance scheduling and controls which provide more control over when and how maintenance occurs on the VM instances, which is crucial for long-running, disruption-sensitive workloads. This includes features like customer-triggered maintenance and grouped maintenance for blocks of resources.
Cluster terminology
When working with blocks of capacity, the following terms are used:
Overview of cluster creation process with H4D VMs
To create HPC clusters on reserved blocks of capacity, you must complete the following steps:
- Choose a consumption option and obtain capacity
- Choose a deployment option and orchestrator
- Choose the operating system or cluster image
- Create your cluster
Choose a consumption option and obtain capacity
Consumption options determine how resources are obtained for your cluster. To create a cluster with enhanced management features, you must request blocks of capacity for a dense deployment.
The following table summarizes the key differences between the consumption options for blocks of capacity:
Consumption option | Future reservations for capacity blocks | Flex-start (Preview) |
---|---|---|
Workload characteristics | Long-running, large-scale distributed workloads that require densely allocated resources | Short-duration workloads that require densely allocated resources |
Lifespan | Any time | Up to 7 days |
Preemptible | No | No |
Quota | Check that you have enough quota before creating instances. | Preemptible quota is charged. |
Pricing |
|
|
Resource allocation | Dense | Dense |
Provisioning model | Reservation-bound | Flex-start (Preview) |
Creation method | To create HPC clusters and VMs, you must do the following:
|
To create VMs, select one of the following options:
When your requested capacity becomes available, Compute Engine provisions it. |
Choose a deployment option
Depending on the level of control that you need over your cluster deployment, choose between a highly managed or a less managed deployment that gives you more control over your infrastructure. Some of the available deployment options include the installation and configuration of an orchestrator for enhanced management of the HPC cluster.
- Highly managed
If you want Google to deploy and set up your infrastructure, then use Cluster Toolkit or Google Kubernetes Engine (GKE).
Cluster Toolkit: an open source tool offered by Google that simplifies the configuration and deployment for clusters that use either Slurm or GKE. You use predefined blueprints to build a deployment folder that is based on the blueprint. You can modify blueprints or the deployment folder to customize deployments and your software stack. You then use Terraform or Packer to run the commands generated by Cluster Toolkit to deploy the cluster.
For a tutorial on this deployment method, see Create an RDMA-enabled HPC Slurm cluster.
GKE: a managed Kubernetes service and open source container orchestration platform. GKE offers features like auto scaling and high availability. These features make GKE a good fit for deploying and managing HPC workloads, including its ability to orchestrate containerized applications, support of specialized hardware, and compatibility with the Google Cloud ecosystem. You can deploy GKE clusters by using GKE directly or by using Cluster Toolkit. You can choose between Google Kubernetes Engine (GKE) Standard edition or Autopilot.
For a more information about this deployment method, see Run HPC workloads with H4D.
- Less managed, more control
For more granular control over your clusters and the software installed on them, create a Compute Engine cluster by either of the following methods:
You can use these methods if you want to use an orchestrator other than Slurm or GKE. After the VMs are created, you manually install any key software that your HPC workload needs on the VMs. You can also use a startup script to automatically check the software installation and install the software if needed when the VM starts.
Choose the operating system image
The operating system (OS) image you choose depends on the service you use to deploy your cluster.
For clusters on GKE: Use a GKE node image, such as Container-Optimized OS. If you use Cluster Toolkit to deploy your GKE cluster, a Container-Optimized OS image is used by default. For more information about node images, see Node images in the GKE documentation.
For clusters on Compute Engine: You can use one of the following images:
- HPC VM image: A Rocky Linux 8 image that is optimized for tightly-coupled HPC workloads.
- OS image provided by Google Cloud: OS images that support H4D. You will need to configure these for your HPC workloads.
- Custom images: You can create and use your own custom images. To include HPC-specific optimizations, we recommend that you create a custom image using the HPC VM image.
For Slurm Clusters: Cluster Toolkit deploys the Slurm Cluster with a HPC VM image based on Rocky Linux 8 that is optimized for tightly-coupled HPC workloads.
Create your HPC cluster
After you review the cluster creation process and make preliminary decisions for your workload, create your cluster by using one of the following options:
- Create an RDMA-enabled HPC Slurm cluster
- Create a GKE cluster: Run high performance computing (HPC) workloads with H4D
- Create a cluster with Compute Engine:
What's next
- Learn more about Cluster Toolkit.
- Learn more about creating an instance that uses RDMA.