Using clusters for large-scale technical computing in the cloud

Last reviewed 2022-12-20 UTC

This document provides information about how to perform large-scale technical computing on Google Cloud. Many technical computing applications require large numbers of individual compute nodes that are connected together into a cluster. These applications must then coordinate computation and data access across the nodes. High performance computing (HPC) on Google Cloud offers flexible, scalable resources that are built to handle these demanding workloads.

The concepts and technologies underlying cluster computing have developed over the past decades, and are now mature and mainstream. Migrating a software stack to Google Cloud offers many opportunities to decrease cost and alleviate bottlenecks in today's HPC environments. This guide provides an overview of the technologies, the challenges, and the current solutions available for running HPC clusters on Google Cloud.

Cluster computing aggregates and coordinates a collection of virtual machines (VMs) to work together to perform a task. Clusters typically have a single head node (sometimes called a controller node), a number of compute nodes, and possibly some specialty nodes. The head node manages many tasks, including the following:

  • Registering compute nodes into the system.
  • Allocating jobs to particular nodes.
  • Monitoring the nodes and the jobs running on them.

Architecture

The following diagram shows the architecture of a simple cluster:

A cluster is composed of a head node and a set of compute nodes.

The diagram shows a cluster that is composed of a head node and a set of compute nodes. Users interact with the head node, which then coordinates work out to the compute nodes.

Users submit jobs which can be composed of many tasks; a task is the basic unit of work. The following are the common structures for tasks within a job:

  • Jobs that require tasks to be run concurrently, sharing data across tasks using a parallel algorithm.
  • Jobs that have a complex set of tasks, with inter-task dependencies such that some tasks must be run before others.
  • Tasks that require hardware configurations (such as extra memory, GPUs, or other specific configurations) to meet the needs of the application.

In the most basic terms, tasks are executables that can read input data from storage, process the data to produce a result, and then write the final results back to storage.

The following are the main types of workloads that run on HPC clusters:

  • Tightly coupled workloads — These workloads employ many compute nodes that work together, each running the same executable and sharing data across nodes. An HPC cluster that is designed for tightly coupled workloads needs low-latency networking to enable fast internode communication. Low latency networking helps to avoid poor performance that is caused by slow inter-process communication and sharing of data. Weather modeling, computational fluid dynamics (CFD), and structural modeling are examples of tightly coupled workloads.
  • Loosely Coupled Workloads — Sometimes referred to as embarrassingly parallel workloads, loosely coupled workloads run many tasks that execute independently of each other, with little or no synchronous communication between compute nodes. In some cases, loosely coupled workloads involve sequential dependencies, where one class of task must wait for input from another class of task. Typical examples include media rendering, financial analysis, genomics, and particle-physics-event simulation and processing.

Opportunities for cluster computing in the cloud

There are many reasons to run compute clusters in the cloud, including the following:

  • Time-to-solution: Launching a production-quality cluster in the cloud takes only a few minutes, whether you're launching a small 10-node cluster with hundreds of available cores or large-scale clusters with 100,000 or more cores. In contrast, building new clusters on-premises can take months to be ready for operation. Even when on-premises clusters are available, they typically have high utilization and long queue wait times—sometimes hours or days—before jobs are scheduled to run. Instead, you can build your own clusters in the cloud, use them for your workloads, and terminate the clusters when your analysis is complete.
  • Lower total cost of ownership: Google Cloud not only reduces time to solution, but can also reduce the total cost per run by offering dynamic, as-needed scaling, Spot VMs, and long-term use discounts. You can add the exact resources that you need for a job, and then remove them when they aren't needed.
  • Task-tailored resources: Because the compute cost of a job depends only on the total core hours, rather than the number of instances, running clusters in the cloud allows each team or group to have their own dedicated cluster. This approach can alleviate another major difficulty of developing policies around multi-group use. You can then customize each dedicated cloud cluster for the target application. On-premises clusters tend to consist of a one-size-fits-all resource shared across the various groups and applications. In such an environment, policies for sharing across the groups tend to be complex to set up and maintain.
  • Support for collaboration: In many situations, an application is developed in collaboration with different people across multiple organizations. Google Cloud provides project-level Identity and Access Management (IAM) to allow controlled access to data and analytic tools. Authorized users can access the same applications, data, and clusters to ensure that everyone is coordinated without having to copy data, manage versions, or sync cluster configurations.
  • Integration: Before they can run large compute jobs, researchers do significant work to prepare datasets to be processed. After jobs have run, researchers spend considerable time to analyze the output of these jobs. After moving jobs to the cloud, these researchers can leverage the big data tools that are available in Google Cloud, such as BigQuery and Vertex AI Workbench, which provide advantages over on-premises systems.

Cluster computing software stack

A cluster computing software stack consists of the following:

  • Workload managers or schedulers (such as Slurm, PBS, or IBM's LSF) to orchestrate job execution.
  • Cluster configuration tools:
    • Cluster management tools (such as Managed Instance Groups or Kubernetes) to orchestrate compute nodes.
    • DevOps tools (such as Terraform) to provision and build clusters.
  • End-user applications (such as OpenFOAM, GROMACS, WRF, or Jupyter Notebooks) to execute computations and view and analyze output.

The following sections discuss system-management software and schedulers.

Cluster configuration tools

There are many ways to approach cluster configuration. Commonly used HPC tools can usually be employed on Google Cloud without much modification. If you use configuration management systems to manage your HPC clusters, such as Chef, Puppet, or Terraform, you can leverage those investments as you migrate to Google Cloud by using the available tools and plugins.

Cloud HPC Toolkit

Cloud HPC Toolkit is an open source tool that lets you create repeatable, turnkey HPC clusters based on proven best practices. Cloud HPC Toolkit makes HPC easier by enabling you to create turnkey HPC clusters within minutes.

Cloud HPC Toolkit features a modular design that enables composable HPC environments. This design lets the toolkit easily define and deploy both simple and advanced HPC environments. An HPC blueprint defines the infrastructure and software configuration of an HPC environment in a high-level YAML-formatted file that composes Terraform modules, Packer templates, and Ansible playbooks. You can create a cluster with an existing blueprint or modify it to fit your needs. With a few text lines in the blueprint, you can modify the configuration to provision the infrastructure and industry-specific tools that are required for the job.

Job schedulers

After the cluster is operational, the software that manages the task execution and node allocation is called a job scheduler (sometimes called a workload manager or queue manager). Often, a job scheduler comes with a built-in cluster manager. Job schedulers provide various capabilities to help manage jobs and tasks, such as the following:

  • Support for job priorities across users and groups, which helps to provide policy-based job scheduling.
  • Consideration of task dependencies and resource needs for task allocation.
  • Scaling the cluster size depending on the number of jobs in the queue.
  • Support for catching and handling failed tasks with automatic rescheduling of tasks, or other manual configuration.

Google Cloud supports various popular commercial and open source workload managers. Examples include Slurm from SchedMD, Batch from Google Cloud, HTCondor from University of Wisconsin, PBS Pro, and IBM Spectrum LSF. Each of these workload managers has its strengths, and the Cloud HPC Toolkit provides blueprints for all of them.

  • Slurm provides a more traditional HPC cluster environment, supporting both high-throughput and high-performance parallel applications. SchedMD, the commercial backer and maintainer of the Slurm open source project, leads the development. Slurm supports shared file systems across the cluster for simple storage access. An NFS-mounted file system for Slurm is sufficient for clusters of a few hundred nodes, but as the cluster size increases, the load on the file server can negatively affect performance. Parallel and distributed file systems, described in the "Storage" section of this document, can help with this problem when running I/O intensive workloads or using large clusters. Alternatively, where low-latency file access is not required, consider using Cloud Storage. For more information about using Slurm with Google Cloud, see Deploy an HPC cluster with Slurm.
  • Batch from Google Cloud is a fully managed service that lets you schedule, queue, and execute batch processing workloads on Compute Engine VM instances. Batch provisions resources and manages capacity on your behalf, allowing your batch workloads to run at scale. When you use Batch, you don't need to configure and manage third-party job schedulers, provision and deprovision resources, or request resources one zone at a time. To run a job, you specify parameters for the resources that are required for your workload, and then Batch obtains resources and queues the job for execution. Batch provides built-in integration with other Google Cloud services to help schedule, execute, store, and analyze batch jobs, so that you can focus on submitting a job and consuming the results. For more information, see the Batch documentation and learn how to run a Batch job using Workflows. Cloud HPC Toolkit also provides a batch blueprint.

  • HTCondor is used across shared resources to opportunistically schedule jobs on otherwise idle resources. It provides its own data movement, so it doesn't require shared file systems. Therefore, HTCondor can scale to hundreds of thousands of cores and you can distribute it across multiple Google Cloud zones and regions. HTCondor has been used for hybrid workloads, where the work is shared or divided between on-premises and cloud-based systems. However, as its name implies, it's focused on high-throughput jobs, and not on tightly coupled, parallel jobs. To start using HTCondor with Google Cloud, see the Google Cloud HTCondor integrations.

Storage

Most HPC applications require a file storage solution that supports the POSIX API. Google Cloud has built-in and third-party storage solutions to cover the full spectrum of storage needs, spanning object, block, and file storage.

Where low-latency file access isn't required, we recommend Cloud Storage. It provides parallel object access through an API or through gcsfuse if POSIX API compatibility is required.

For file storage for small to mid-sized applications, we recommend NFS-based or SMB-based solutions such as Filestore, NetApp, and Dell PowerScale. For applications that demand even higher performance, we recommend that you consider a parallel file system such as DAOS, Lustre, IBM Spectrum Scale, or Weka.

  • Filestore provides a Google fully managed NFS-based file storage service for low to medium performance-intensive workloads. Filestore Basic delivers fixed performance regardless of storage capacity. Filestore Zonal increases performance with capacity, scaling storage to hundreds of TBs and scaling throughput to tens of GBps. As a Google Cloud service, Filestore can be deployed using Terraform or by using the Google Cloud API, the Google Cloud CLI, or the Google Cloud console.
  • NetApp Cloud Volumes delivers advanced hybrid cloud data services that simplify how you migrate and run enterprise workloads in the cloud. NetApp offers various solutions for primary storage, data management, and in-cloud caching of on-premises data using FlexCache. Cloud Volumes ONTAP delivers NFS, SMB, SnapMirror and all the rest of the full feature set of ONTAP in Google Cloud. Cloud Volumes ONTAP can be deployed using Terraform and using the Google Cloud console.
  • Dell PowerScale for Google Cloud is an integrated file service for Google Cloud users that is powered by Dell PowerScale. This turnkey offering, managed by Dell Technologies Services, combines the performance and capacity at scale of PowerScale OneFS and the flexibility and cost economics of Google Cloud. You can deploy Dell PowerScale from the Google Cloud console.
  • DAOS is an open source software-defined scale-out HPC storage system that optimizes for high IOPs and metadata ops and for high bandwidth. DAOS focuses on scratch HPC storage cases, allowing users to dynamically adjust the durability level for each dataset. It supports a POSIX interface and several HPC and AI/ML interfaces such as MPI-IO, HDF5, and TensorFlow. For more information, see the solution for DAOS with Terraform and Cloud HPC Toolkit.
  • The Lustre parallel file system is available as open-source or with full enterprise support from our partner DDN Storage. The enterprise version of Lustre, DDN EXAScaler, has demonstrated over 10 Tbps on Google Cloud for its IO500 entry in 2021. EXAScaler can be deployed using Terraform. For more information, see EXAScaler Cloud in the Google Cloud Marketplace.
  • IBMs Spectrum Scale (formerly GPFS), and is available with full support by our partner Sycomp in the Cloud Marketplace. The full Spectrum Scale feature set is available, delivering extreme performance for both persistent and scratch use cases. In addition, IBM Spectrum Scale can dynamically tier data to and from existing Cloud Storage buckets, enabling a cost effective cache tier that accelerates access to your Cloud Storage data. IBM Spectrum Scale can be deployed using Terraform.
  • WEKA is a high-performance, file-based storage solution. WEKA combines the performance tier of NVMe-enabled compute instances in Google Cloud with the capacity tier of Cloud Storage in a single namespace. Weka can be deployed using Terraform.

Architectural considerations

Typically, on-premises clusters are shared across users and groups and they support many different application needs. In contrast, when moving to Google Cloud, you can customize the cluster properties to match your application needs to reduce cost and increase performance. These architectural differences are shown in the following diagram:

An on-premises data center cluster compared to a cloud cluster.

The diagram shows the distinction between on-premises and cloud clusters: for on-premises systems, all infrastructure is shared. In a cloud context, each group of users can create a cluster suited to their needs.

The following list describes technical challenges that might hamper HPC migration projects. For information about best practices, see the Migration to Google Cloud series.

  • Data movement. The datasets that are processed by a cluster's compute nodes typically must be staged to the cloud before the jobs are run. Managing the data movement can be complex, depending on the volume of the data and how it's managed.
  • Data access. Many HPC applications require shared access to a set of files and directories. The way this access is provided can significantly affect application performance. You can leverage shared data stored in Cloud Storage, in NFS servers such as Filestore, or by using parallel file systems, as discussed in the "Storage" section earlier in this document.
  • Security. For data that is sensitive, you must ensure that access is always authorized and that data is appropriately encrypted at rest and in transit. Cloud Storage encrypts data at rest and in transit, and you can apply an additional layer of control and manage keys either in Cloud Key Management Service, or on your own. The keys must be retrieved or installed on the compute nodes before the jobs are run.
  • Inter-node latency. For tightly coupled HPC applications, performance might be sensitive to the inter-node latency within the cluster. Because Google Cloud provides nodes with sizes up to 416 cores, you can run highly parallel jobs without traversing nodes. In most cases, jobs of around 1000 cores or smaller perform reasonably well on non-specialized network hardware.
  • Software license management. Many commercial applications require a license server, sometimes known as a key server. Some applications come with a built-in or recommended license server and others might be compatible with existing license-server offerings. Some job schedulers can help with license management and stop jobs from running until a license is available.

Technical computing provides many tools and approaches for different circumstances. With so many options, you might find it difficult to get started. Regardless of the choice of cluster management software and scheduler, there are a number of best practices you can follow when running on Google Cloud.

  • Leverage Spot VMs whenever possible. Spot VMs are just like regular VMs on Compute Engine, but priced up to 80% less than regular VMs. The risk is that Spot VMs can be preempted with little notice. For high-throughput workloads, your job scheduler will detect the loss of a reclaimed node and treat it as a node failure. Normally, the scheduler will reschedule any tasks running on the preempted node on a different resource. Although any work done on reclaimed nodes might be lost, the probability of node loss is low enough that the lower price is worth the chance. The expected loss rate is 5% to 15%. Unlike Preemptible VMs, Spot VMs aren't limited to a maximum of 24 hours of use before being reclaimed.
  • Leverage the cost and bandwidth of Cloud Storage instead of running your own parallel file system. Cloud Storage provides strong consistency and scalable parallel performance with low overall cost. Although time to first byte is typically tens of milliseconds, applications that can leverage Cloud Storage instead of running a parallel file server on Compute Engine are cost effective. The available bandwidth between Cloud Storage and the compute nodes is sufficient for many applications; some customers have reported sustained aggregate bandwidth of over 23 GBps.
  • Build a single-application or single-group cluster. Traditional clusters are shared across multiple users, groups and applications, which can result in resource contention and inefficient resource usage by applications. On Google Cloud, consider creating a cluster for each group or project, and consider using clusters that are optimized for particular applications that run on them. Whether you run one cluster for two hours, or two clusters for one hour each, the overall cost is the same, but the latter pattern can reduce queue-wait times, and improve application performance.

Although every implementation is unique, the following sections provide some general recommendations for three common cases.

Independent researcher looking to process their data

Individual researchers typically want to run their application on their data and get to completion as fast as possible. They might be experts in their application, but they don't want to be experts in cluster administration or management.

If you're running high-throughput workloads, then you can consider using the Batch API. It consists of the following components:

  • Job: A scheduled program that runs a set of tasks to completion without any user interaction, typically for computational workloads. For example, a job might be a single shell script or a complex, multipart computation. Each Batch job consists of an array of one or more tasks that all run the same executable.
  • Tasks: Programmatic actions that are defined as part of a job and executed when the job runs. Each task is part of a job's task group. A job's tasks can run in parallel or sequentially on the job's resources.
  • Resources: The infrastructure needed to run a job. Each Batch job runs on a regional managed instance group (MIG) of Compute Engine VMs based on the job's specified requirements and location. If additional compute resources are specified, a job might also use resources like GPUs, or additional read or write storage resources, like local SSDs or a Cloud Storage bucket. Some factors that determine the number of VMs that are provisioned for a job include the compute resources that are required for each task and the job's parallelism: whether you want tasks to run sequentially on one VM or simultaneously on multiple VMs.

Google Cloud's Batch lets you create and run jobs that each automatically provision and use the resources that are required in order to execute its tasks. There is no cluster to provision or manage. The tasks run until completion in a VM that is provisioned and managed by Batch. This is cost efficient because Compute Engine bills per second of usage.

Small- to medium-sized cluster for a single project or team

In a typical team, members have access to a cluster that is managed by a central group for use by the entire company. Alternatively, members might have access to large-scale resources at an off-site HPC center. In both situations, the clusters are professionally managed and accessed using standard tools. For example, users might use SSH to connect to a head node and use Slurm submit job scripts to submit jobs for execution.

In either of these scenarios, when you implement on cloud, we recommend using Cloud HPC Toolkit to define a cluster environment that is similar to the on-premises or off-site systems. Teams can customize the cluster by selecting a Compute Engine machine type that is best suited for their application, and customize the startup scripts to install the software dependencies for their application. Input data can be staged to Cloud Storage, and the team can install gcsfuse on the compute nodes to mount the input data.

The following diagram shows a typical configuration flow using Cloud HPC Toolkit:

HPC environment configuration using Cloud HPC Toolkit.

The diagram shows how you can use Cloud HPC Toolkit to go from a YAML blueprint to a deployed cluster. First, you create a blueprint or choose from a large selection of examples. Then Cloud HPC Toolkit generates a deployment folder that has all of the Terraform scripts and image building tools that are needed to build an HPC cluster that meets your requirements.

HPC center adding burst capacity to existing clusters

Central HPC centers have tremendous capacity for compute, but because they are used by many groups across the company or organization, HPC centers tend to have consistently high utilization and long queue wait times. They are typically purchased with a particular production capacity in mind, and when unforeseen workloads are submitted into the mix, they can slow progress down significantly.

In these situations, you can burst into the Google Cloud environment by adding compute nodes temporarily to the cluster. The cluster becomes a hybrid, with the head node and some compute nodes running on-premises, and other compute nodes running on Google Cloud. When the job queues have been drained, the additional nodes can be released.

Bursting to the cloud is convenient for a couple of reasons:

  • It maintains a consistent user environment for job submission and management. Users don't know or care if additional nodes are added.
  • It allows IT managers to define policies for when to burst, in order to control costs.

The greatest challenge is providing a consistent data and file namespace for user jobs across the on-premises and Google Cloud nodes. The Google Cloud nodes might not have access to the same internal file systems as the on-premises nodes. In this situation, jobs that reference these files will fail to run.

If the Google Cloud nodes are configured with internal file-access permissions, then jobs will run, but might not perform in the same manner and might create additional networking bandwidth and egress charges. In addition, parallel jobs that are split across on-premises and cloud nodes might also not perform well with the added latency between the different parts of the application.

What's next

Read more about cluster computing use cases on Google Cloud:

Read more about HPC infrastructure:

Get started with your cluster:

For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.