This solution provides guidance for performing large-scale technical computing on Google Cloud. Many technical computing apps require large numbers of individual compute nodes, connected together into a cluster, and coordinating computation and data access across the nodes.
The concepts and technologies underlying cluster computing have developed over the past few decades, and are now mature and mainstream. Migrating the software stack to Google Cloud can add a few wrinkles, but also offers a number of opportunities to decrease cost and alleviate existing bottlenecks in today's high-performance computing environments. This guide provides an overview of the technologies, the challenges, and the current crop of solutions for running computational clusters on Google Cloud.
Cluster computing aggregates and coordinates a collection of machines to work together to solve a task. Clusters typically have a single head node (sometimes called a master node), some number of compute nodes, and possibly a few other specialty nodes. The head node is the brains of the system and is responsible for:
- Registering compute nodes into the system.
- Monitoring the nodes.
- Allocating jobs to particular nodes.
Users submit jobs, which are composed of many tasks, where a task is the basic unit of work. Some apps require all tasks in a job to run concurrently and let tasks communicate to implement a parallel algorithm; some jobs have a complex set of task dependencies such that particular tasks must be run prior to others; and some tasks might require particular node configurations in terms of memory, CPUs, or other particular hardware on which to run. Tasks are executables that read input data from storage, process the data to produce a result, and then write the final results back to storage.
There are two main types of cluster computing workloads:
High-performance computing (HPC) — A type of computing that uses many worker nodes, tightly coupled, and executing concurrently to accomplish a task. These machines typically need low network latency to communicate effectively. Example apps in this space include weather modeling, computational fluid dynamics (CFD), stress modeling in engineering, and electronics design.
High-throughput computing (HTC) — A type of computing where apps have multiple tasks that are processed independently of each other without a need for the individual compute nodes to communicate. Sometimes these workloads are called embarrassingly parallel or batch workloads. Typical examples include media rendering, transcoding, genomics, and particle-physics-event simulation and processing. If you need to process a lot of individual files, it's probably an HTC workload.
Cluster computing software stack
A cluster computing software stack is comprised of:
- System-management software that provisions and builds clusters.
- Schedulers that orchestrate job execution.
- End-user apps.
The following sections discuss system-management software and schedulers.
System management software
You can run clustering software either directly on the bare-metal hardware, as with on-premises clusters, or in virtualized environments, as with cloud environments. Orchestrating multiple nodes in a cluster by hand is time consuming and error prone. You can use specialized cluster-management software to provision and configure multiple nodes and resources, together, in a repeatable and deterministic way.
The open source ElastiCluster software from the University of Zurich provides a cloud-native approach to cluster management, with support for provisioning nodes, by using Compute Engine, and configuration of nodes by using a set of Ansible playbooks. ElastiCluster provisions the nodes and installs a base software stack, including NFS for file serving, NIS user account management, and a job scheduler for executing user apps. ElastiCluster supports a variety of schedulers, and you can use it out of the box or customize it to meet the needs of small to mid-sized teams.
If you use other configuration management systems to manage your HPC clusters, such as Chef, Puppet, or Terraform, you can leverage those investments as you migrate to Google Cloud by using the available tools and plugins. See Compute Engine management with Puppet, Chef, Salt, and Ansible for more details.
Google Cloud provides native services for provisioning and deploying multi-node software systems. Cloud Deployment Manager lets you provision a set of cloud resources including Compute Engine, Compute Engine managed instance groups, and Cloud Storage. The HTCondor tutorial shows you how to use Cloud Deployment Manager and managed instance groups to provision and configure a cluster.
After the cluster is operational, the software that manages the task execution and node allocation is called a job scheduler (sometimes called a workload manager or queue manager). Often, a cluster manager comes with a built-in job scheduler. Job schedulers provide a variety of capabilities to help manage jobs and tasks, such as:
- Support for job priorities across users and groups, which helps provide policy-based job scheduling.
- Support for failed tasks by queueing and rescheduling tasks.
- Consideration of task dependencies and resource needs for task allocation.
- Scaling the cluster size depending on the number of jobs in the queue.
There are a variety of popular commercial and open source workload managers. Examples include HTCondor from University of Wisconsin, Slurm from SchedMD, Univa Grid Engine, and LSF Symphony from IBM. Each has its strengths.
HTCondor is built with a shared-nothing philosophy and is used across shared resources to opportunistically schedule jobs on otherwise idle resources. It provides its own data movement and therefore, requires no shared file systems. As a result, HTCondor scales to hundreds of thousands of cores and you can use it across multiple zones and regions. HTCondor has been used for hybrid workloads, where the work is shared or divided between on-premises and cloud-based systems. However, as its name implies, it is focused on high-throughput jobs, not tightly coupled, parallel jobs.
Slurm and Univa Grid Engine provide a more traditional HPC cluster environment, supporting both high-throughput and high-performance parallel apps. They both assume a shared file system across the nodes, which eliminates the need to move the data. Both provide a convenient and familiar user environment for developing apps because they are often the same tools used on-premises. These traditional job schedulers are sufficient for small to mid-sized clusters, but as the cluster size increases, the load on the file server becomes the bottleneck for performance. Parallel and distributed file systems (see next section) can help with this problem when at high scale. Alternatively, where low-latency file access is not required, you can leverage Cloud Storage, which provides parallel object access by using the API or through gcsfuse, where POSIX compatibility is required.
Finally, Google Cloud includes a simple service for scheduling a Docker-based task on Compute Engine for high-throughput workloads: the Cloud Life Sciences Pipelines API. This service requires you to decompose the job into tasks, manage dependencies across tasks, and manage the task lifecycle. The dsub open source project provides a command-line tool to launch batch jobs and supports the Cloud Life Sciences Pipelines API.
Most HPC applications require a file storage solution that supports the POSIX API. For smaller clusters, FileStore provides a Google-managed NFS-based file storage service. For larger clusters, however, application I/O can become a performance bottleneck. Scale-out and parallel file systems, such as Elastifile (acquired by Google), Lustre, or Quobyte, help scaling to large clusters (or even I/O-heavy smaller clusters).
Alternatively, where low-latency file access isn't required, you can leverage Cloud Storage, which provides parallel object access by using the API or through gcsfuse, where POSIX compatibility is required.
Opportunities for cluster computing in the cloud
There are many reasons to run compute clusters in the cloud:
Time-to-solution. Launching a production-quality cluster in the cloud takes only a few minutes, from a small 10-node cluster with hundreds of available cores, to large-scale clusters with a hundred thousand or more cores. In contrast, building new clusters on-premises can take months to be ready for operation. Even when on-premises clusters are available, they typically have high utilization and long queue wait times — sometimes hours or days — before jobs are scheduled to run. Instead, you can build your own clusters in the cloud, use them for your workloads, and terminate the clusters when your analysis is complete.
Lower total cost of ownership. Google Cloud not only reduces time to solution, but can also reduce the total cost per run by leveraging preemptible VMs, long-term use discounts, and dynamic scaling. You can add nodes when jobs are queued and remove them when not needed.
Support for collaboration. In many situations, the compute analysis is developed in collaboration with different people across multiple organizations. Google Cloud provides project-level identity and access management tools to allow controlled access to data and analytic tools. Authorized users can access the same apps, data, and clusters to ensure that everyone is on the same page without having to copy data, manage versions, or sync cluster configurations.
Task-tailored resources. Because the cost of a job depends only on the total core hours, rather than the number instances, running clusters in the cloud allows each team or group to have their own dedicated cluster. This approach can alleviate another major pain point of developing policies around multi-group use. You can then customize each dedicated cloud cluster to tune it for the target app. On-premises clusters tend to comprise of a one-size-fits-all resource shared across the various groups and apps. In such an environment, policies for sharing across the groups tend to be complex to set up and maintain.
Integration. Before they can run large compute jobs, researchers do significant work to prepare the datasets. After moving to the cloud, these researchers can leverage the big data tools available in the cloud. The outputs of the compute systems also have to be analyzed. Tools such as BigQuery and Datalab can provide significant advantages over those available in on-premises systems.
While the advantages described so far are compelling, there are nevertheless some technical challenges that often hamper migration projects.
Data movement. The datasets that are processed by a cluster's compute nodes typically must be staged to the cloud prior to running the jobs. Managing the data movement can be complex, depending on the volume of the data and how it is managed. Tools such as Avere can help by providing a cloud-caching layer that automatically moves data when needed, but for many apps the datasets must be staged manually.
Data Access. Many HPC apps require shared access to a set of files and directories. How this access is provided can significantly affect app performance. You can leverage shared data stored in Cloud Storage, in NFS servers such as FileStore, or using parallel file systems, as discussed in the Storage section.
Security. For data that is sensitive, you must take care to ensure that access is always authorized and data is appropriately encrypted at rest and in transit. While Cloud Storage encrypts data at rest and in transit, you can apply an additional layer of control and manage keys either in Cloud Key Management Service, or on your own. The keys must be retrieved or installed on the compute nodes prior to running the job.
Inter-node latency. For tightly-coupled HPC apps, performance might be sensitive to the inter-node latency between nodes in the cluster. Because Google Cloud provides nodes with sizes up to 64 cores, you can run 64-way parallel jobs without traversing nodes. In most cases, jobs of around 1000 cores or smaller perform reasonably well on non-specialized network hardware.
Software license management. Many commercial apps require a license server, sometimes known as a key server. Some apps come with a built-in or recommended license server and others might be compatible with existing license-server offerings. Some job schedulers can help with license management and stop jobs from running until a license is available.
Recommended architectures and best practices
Technical computing provides many tools and approaches for different circumstances. With so many options, you might find it difficult to get started. Regardless of the choice of cluster management software and scheduler, there are a number of best practices you can follow when running on Google Cloud.
- Leverage preemptible VMs whenever possible. Preemptible VMs are just like regular VMs on Compute Engine, but priced up to 80% less than regular VMs, with the caveat that they can be reclaimed with little notice. For high-throughput workloads, your job schedulers detect the loss of the node and treat it as a node failure and reschedule any tasks running on that node on a different resource. While any work done on those lost nodes might be lost, the probability of node loss is low enough that the lower price is worth the chance. The expected loss rate is 5% to 15%. Preemptible VMs are limited to a maximum of 24 hours of use before being reclaimed.
- Leverage the cost and bandwidth of Cloud Storage instead of running your own parallel file system. Cloud Storage provides strong consistency and scalable parallel performance with low overall cost. While the first-byte latency is high at roughly 100 ms, apps that can leverage Cloud Storage instead of running a parallel file server on Compute Engine are more cost effective. The available bandwidth between Cloud Storage and the compute nodes is sufficient for many apps, some customers have reported sustained aggregate bandwidth of over 23 GB/s.
- Build a single-app or single-group cluster. Traditional clusters are shared across multiple users, groups and apps, which can result in long queue times for jobs and inefficient resource usage by apps. On Google Cloud, consider creating multiple clusters for each group or project, and using clusters that are optimized for particular apps that run on them. Whether you run one cluster for two hours, or two clusters for one hour each, the overall cost is the same, but the latter pattern can reduce queue-wait times, and improve app performance.
While every implementation is unique, the following sections provide some general recommendations for three common cases.
Independent researcher looking to process their data
Individual researchers typically seek to run their app across their data and get to completion as fast as possible. They might be experts in their app, but they don't want to be experts in cluster administration or management.
If you are running high-throughput workloads, then you can consider using the
Cloud Life Sciences
The Pipelines API requires you to put your app into a Docker container
and place your input files in a Cloud Storage bucket. After that is
done, you can use the
gcloud command-line tool to launch the app against
each of the files in the Cloud Storage bucket. You can place the
results in another Cloud Storage bucket.
Here's an example of a command to execute a task that uses SAMtools to generate a BAM index file from an input BAM file:
gcloud alpha genomics pipelines run --pipeline_id [PIPELINE_ID] \ --logging gs://[YOUR_BUCKET/YOUR_DIRECTORY]/logs \ --inputs inputFile=gs://genomics-public-data/gatk-examples/example1/NA12878_chr22.bam \ --outputs outputFile=gs://[YOUR_BUCKET]/[YOUR_DIRECTORY]/output/NA12878_chr22.bam.bai
[PIPELINE_ID]represents your app's ID in the Pipelines API.
[YOUR_BUCKET]represents your Cloud Storage bucket name.
[YOUR_DIRECTORY]represents the name of your directory.
There is no cluster to provision or manage. The tasks simply run until completion in a VM that is provisioned and managed by the Pipelines API. This is cost efficient because Compute Engine bills per second of usage.
Small- to medium-sized cluster for a single project or team
In a project or team, the members might have access to a cluster run by a
central team for users throughout their company, or might have access to
large-scale resources at an off-site HPC center. In both situations, the
clusters are professionally managed and accessed using standard tools. For
example, users might use
ssh to connect to a head node and use Grid Engine
submit scripts to submit jobs for execution.
One approach for such a team is to use ElastiCluster to define a cluster
environment that is similar to their on-premises systems. They can customize the
cluster by selecting a Compute Engine machine type that is best suited
for their app, and customize the startup scripts to install the software
dependencies for their app. Input data can still be staged to
Cloud Storage, and you can install
on the compute nodes to mount the input data.
These details are recorded in the ElastiCluster configuration file, and when computation is needed, a cluster is brought up by using the command-line tool, for example:
% elasticluster start astrocluster1
The cluster, named in the configuration file as
astrocluster1, is provisioned
and configured as specified. The definitions in a configuration file are
flexible and support different node types for head and compute nodes,
for scratch space, preemptible VMs to lower the cost for high
throughput workloads, and GPUs for accelerated operation. An example of a basic
configuration for a Slurm-based cluster with 10 compute nodes and 1 head node
using 32-core virtual machines based on CentOS would look as follows:
[cluster/astrocluster1] cloud=google login=google setup=ansible-slurm security_group=default image_id=centos-7-v20170327 flavor=n1-standard-32 frontend_nodes=1 compute_nodes=10 ssh_to=frontend boot_disk_size=50
Finally, when no more jobs are running in the system, you can stop the cluster:
% elasticluster stop astrocluster1
For larger workloads, you can:
- Look to customize the cluster machine types to further reduce costs.
- Add an external parallel filer to increase performance at scale.
- Add auto scaling capabilities to add and remove additional nodes based on the queue depth.
HPC center adding burst capacity to existing clusters
Central HPC centers have tremendous capacity for compute, but because they are used by many groups across the company or organization, HPC centers tend to have consistently high utilization and long queue wait times. They are typically purchased with a particular production capacity in mind, and when unforeseen workloads are submitted into the mix, they can slow progress down significantly.
In these situations, you can burst into the Google Cloud environment by adding compute nodes temporarily to the cluster. The cluster becomes a hybrid, with the head node and some compute nodes running on-premises, and other compute nodes running on Google Cloud. When the job queues have been drained, the additional nodes can be released.
Bursting to the cloud is convenient for a couple of reasons:
- It maintains a consistent user environment for job submission and management. Users don't know or care if additional nodes are added.
- It allows IT managers to define policies for when to burst, in order to control costs.
The greatest challenge is providing a consistent data and file namespace for user jobs across the on-premises and Google Cloud nodes. The Google Cloud nodes might not have access to the same internal file systems as the on-premises nodes. In this situation, jobs that reference these files will fail to run.
If the Google Cloud nodes are configured with internal file-access permissions, then jobs will run, but might not perform in the same manner and might create additional networking bandwidth and egress charges. In addition, parallel jobs that are split across on-premises and cloud nodes might also not perform well with the added latency between the different parts of the app.
For high-throughput jobs, using HTCondor to burst into cloud resources sidesteps many of these challenges. HTCondor supports dynamic provisioning using GlideInWMS. As jobs are submitted into the a job queue, they can trigger nodes being provisioned and added to the cluster. When they are added, the condor scheduler transfers the input files to the designated node and uses those additional nodes to execute the tasks and drain the queue.
Read more about cluster computing use cases on Google Cloud:
- Google Cloud, HEPCloud, and probing the nature of nature
- 220,000 cores and counting: MIT math professor breaks record for largest ever Compute Engine job
Read more about:
Get started with your cluster:
- Batch processing with Compute Engine Autoscaler
- Creating an HTCondor cluster by using Cloud Deployment Manager templates
Example projects on GitHub: