Using Clusters for Large-scale Technical Computing in the Cloud

This solution provides guidance for performing large-scale technical computing on Google Cloud Platform (GCP). Many technical computing applications require large numbers of individual compute nodes, connected together into a cluster, and coordinating computation and data access across the nodes.

The concepts and technologies underlying cluster computing have developed over the past few decades, and are now mature and mainstream. Migrating the software stack to the cloud can add a few wrinkles, but also offers a number of opportunities to decrease cost and alleviate existing bottlenecks in today's high-performance computing environments. This guide provides an overview of the technologies, the challenges, and the current crop of solutions for running computational clusters on GCP.

Cluster computing aggregates and coordinates a collection of machines to work together to solve a task. Clusters typically have a single head node (sometimes called a master node), some number of compute nodes, and possibly a few other specialty nodes. The head node is the brains of the system and is responsible for:

  • Registering compute nodes into the system.
  • Monitoring the nodes.
  • Allocating jobs to particular nodes.
A cluster is composed of a head node and a set of compute nodes.
Figure 1.A cluster is composed of a head node and a set of compute nodes. Users interact with the head node that then coordinates work out to the compute nodes.

Users submit jobs, which are composed of many tasks, where a task is the basic unit of work. Some applications require all tasks in a job to be run concurrently and allow tasks to communicate to implement a parallel algorithm; some jobs have a complex set of task dependencies such that particular tasks must be run prior to others; and some tasks might require particular node configurations in terms of memory, CPUs, or other particular hardware on which to run. Tasks are executables that read input data from storage, process the data to produce a result, and then write the final results back to storage.

Cluster computing workloads can be broken down into two main types:

  • High-performance computing (HPC) — A type of computing that uses many worker nodes, tightly coupled, and executing concurrently to accomplish a task. These machines typically need low network latency to communicate effectively. Example applications in this space include weather modeling, computational fluid dynamics (CFD), stress modeling in engineering, and electronics design.

  • High-throughput computing (HTC) — A type of computing where applications have multiple tasks that can be processed independently of each other without a need for the individual compute nodes to communicate. Sometimes these workloads are called embarrassingly parallel or batch workloads. Typical examples include media rendering, transcoding, genomics, and particle-physics-event simulation and processing. If you need to process a lot of individual files, it's probably an HTC workload.

Cluster computing software stack

A cluster computing software stack is comprised of:

  • System-management software that provisions and builds clusters.
  • Schedulers that orchestrate job execution.
  • End-user applications.

The following sections discuss system-management software and schedulers.

System management software

You can run clustering software either directly on the bare-metal hardware, as with on-premises clusters, or in virtualized environments, as with cloud environments. Orchestrating multiple nodes in a cluster by hand is time consuming and error prone. You can use specialized cluster-management software to provision and configure multiple nodes and resources, together, in a repeatable and deterministic way.

The open source ElastiCluster software from the University of Zurich provides a cloud-native approach to cluster management, with support for provisioning nodes, by using Google Compute Engine, and configuration of nodes by using a set of Ansible playbooks. ElastiCluster provisions the nodes and installs a base software stack, including NFS for file serving, NIS user account management, and a job scheduler for executing user applications. It supports a variety of schedulers, and can be easily used out of the box or customized to meet the needs of small to mid-sized teams.

If you use other configuration management systems to manage your HPC clusters, such as Chef, Puppet, or Terraform, you can leverage those investments as you migrate to the cloud by using the available tools and plugins. See Compute Engine Management with Puppet, Chef, Salt, and Ansible for more details.

GCP provides native services for provisioning and deploying multi-node software systems. Google Cloud Deployment Manager allows you to provision a set of cloud resources including Compute Engine, Compute Engine managed instance groups, and Google Cloud Storage. The HTCondor tutorial shows you how to use Cloud Deployment Manager and managed instance groups to provision and configure a cluster.

Job schedulers

After the cluster is operational, the software that manages the task execution and node allocation is called a job scheduler (sometimes called a workload manager or queue manager). Often, a cluster manager comes with a built-in job scheduler. Job schedulers provide a variety of capabilities to help manage jobs and tasks, such as:

  • Support for job priorities across users and groups, which helps provide policy-based job scheduling.
  • Support for failed tasks by queueing and re-scheduling tasks.
  • Consideration of task dependencies and resource needs for task allocation.
  • Scaling the cluster size depending on the number of jobs in the queue.

There are a variety of popular commercial and open source workload managers. Examples include HTCondor from University of Wisconsin, Slurm from SchedMD, Univa Grid Engine, and LSF Symphony from IBM. Each has its strengths.

HTCondor is built with a shared-nothing philosophy and can be easily used across shared resources to opportunistically schedule jobs on otherwise idle resources. It provides its own data movement and therefore requires no shared file systems. As a result, it scales to hundreds of thousands of cores and can be used across multiple zones and regions. HTCondor has been used for hybrid workloads, where the work is shared or divided between on-premises and cloud- based systems. However, as its name implies, it is focused on high-throughput jobs, not tightly coupled, parallel jobs.

Slurm and Univa Grid Engine provide a more traditional HPC cluster environment, supporting both high-throughput and high-performance parallel applications. They both assume a shared file system across the nodes, which eliminates the need to move the data. Both provide a convenient and familiar user environment for developing applications because they are often the same tools used on-premises. These traditional job schedulers are sufficient for small to mid-sized clusters, but as the cluster size increases, the load on the file server becomes the bottleneck for performance. Parallel file systems, such as Gluster and Ceph, can help with this problem when at high scale. Alternatively, where low-latency file access is not required, you can leverage Cloud Storage, which provides parallel object access by using the API or through gcsfuse, where POSIX compatibility is required.

Finally, GCP includes a simple service for scheduling a Docker-based task on Compute Engine for high-throughput workloads: the Google Genomics Pipelines API. This service is simple to use but requires you to decompose the job into tasks, manage dependencies across tasks, and manage the task lifecycle. The dsub open source project provides a command line tool that makes it even easier to launch batch jobs and supports the Genomics Pipelines API.

Opportunities for cluster computing in the cloud

There are many good reasons to run compute clusters in the cloud:

  • Time-to-solution. Launching a production-quality cluster in the cloud takes only a few minutes, from a small 10-node cluster with hundreds of available cores, to large-scale clusters with a hundred thousand or more cores. In contrast, building new clusters on-premises can take months to be ready for operation. Even when on-premises clusters are available, they typically have high utilization and long queue wait times — sometimes hours or days — before jobs are scheduled to run. Instead, you can build your own clusters in the cloud, use them for your workloads, and terminate the clusters when your analysis is complete.

  • Lower total cost of ownership. GCP not only reduces time to solution, but can also reduce the total cost per run by leveraging preemptible VMs, long-term use discounts, and dynamic scaling. You can add nodes when jobs are queued and remove them when not needed.

  • Support for collaboration. In many situations, the compute analysis is developed in collaboration with different people across multiple organizations. GCP provides project-level identity and access management tools to allow controlled access to data and analytic tools. Authorized users can access the same applications, data, and clusters to ensure that everyone is on the same page without having to copy data, manage versions, or sync cluster configurations.

  • Task-tailored resources. Because the cost of a job depends only on the total core hours, rather than the number instances, running clusters in the cloud allows each team or group to have their own dedicated cluster. This approach can alleviate another major pain point of developing policies around multi-group use. Each dedicated cloud cluster can then be customized and tuned to the target application. On-premises clusters tend to be comprised of a one-size-fits-all resource shared across the various groups and applications. In such an environment, policies for sharing across the groups tend to be complex to set up and maintain.

  • Integration. Before they can run large compute jobs, researchers do significant work to prepare the datasets. After moving to the cloud, these researchers can leverage the Big Data tools available in the cloud. The outputs of the compute systems also have to be analyzed. Tools such as Google BigQuery and Google Cloud Datalab can provide significant advantages over those available in on-premises systems.

Typical on-premises clusters are shared
   across users and groups and support many different application needs.
Figure 2.Typical on-premises clusters are shared across users and groups and support many different application needs. In contrast, when moving to GCP, users can customize the cluster properties to match the application needs to reduce cost and increase performance.

Architectural considerations

While the advantages described so far are compelling, there are nevertheless some technical challenges that often hamper migration projects.

  • Data movement. The datasets that are processed by a cluster's compute nodes typically must be staged to the cloud prior to running the jobs. Managing the data movement can be complex, depending on the volume of the data and how it is managed. Tools such as Avere can help by providing a cloud-caching layer that automatically moves data when needed, but for many applications the datasets must be staged manually.

  • Data Access. Many HPC applications require shared access to a set of files and directories. How this access is provided can significantly affect application performance. You can leverage shared data stored in Cloud Storage, in NFS servers, or using parallel file systems such as Gluster and Ceph.

  • Security. For data that is sensitive, you must take care to ensure that access is always authorized and data is appropriately encrypted at rest and in transit. While Cloud Storage encrypts data at rest and in transit, you can apply an additional layer of control and manage keys either in Google's Key Management Service, or on your own. The keys must be retrieved or installed on the compute nodes prior to running the job.

  • Inter-node latency. For tightly-coupled HPC applications, performance might be sensitive to the inter-node latency between nodes in the cluster. Because GCP provides nodes with sizes up to 64 cores, you can run 64-way parallel jobs without traversing nodes. In most cases, jobs of around 1000 cores or smaller perform reasonably well on non-specialized network hardware.

  • Software license management. Many commercial applications require a license server, sometimes known as a key server. Some applications come with a built-in or recommended license server and others might be compatible with existing license-server offerings. Some job schedulers can help with license management and stop jobs from running until a license is available.

Technical computing provides many tools and approaches for different circumstances. With so many options, you might find it difficult to get started. Regardless of the choice of cluster management software and scheduler, there are a number of best practices you can follow when running on GCP.

  • Leverage preemptible VMs whenever possible. Preemptible VMs are just like regular VMs on Compute Engine, but priced up to 80% less than regular VMs, with the caveat that they can be reclaimed with little notice. For high- throughput workloads, your job schedulers will detect the loss of the node and treat it as a node failure and reschedule any tasks running on that node on a different resource. While any work done on those lost nodes might be lost, the probability of node loss is low enough that the lower price is worth the chance. The expected loss rate is 5% to 15%. Preemptible VMs are limited to a maximum of 24 hours of use before being reclaimed.
  • Leverage the cost and bandwidth of Cloud Storage instead of running your own parallel file system. Cloud Storage provides strong consistency and scalable parallel performance with low overall cost. While the first-byte latency is high at roughly 100 ms, applications that can leverage Cloud Storage instead of running a parallel file server on Compute Engine are more cost effective. The available bandwidth between Cloud Storage and the compute nodes is sufficient for many applications, some customers have reported sustained aggregate bandwidth of over 23 GB/s.
  • Build a single-application or single-group cluster. Traditional clusters are shared across multiple users, groups and applications, which can result in long queue times for jobs and inefficient resource usage by applications. On GCP, consider creating multiple clusters for each group or project, and using clusters that are optimized for particular applications that run on them. Whether you run one cluster for two hours, or two clusters for one hour each, the overall cost is the same, but the latter pattern will reduce queue-wait times, and improve application performance.

While every implementation is unique, the following sections provide some general recommendations for three common cases.

Independent researcher looking to process their data

Individual researchers typically seek to run their application across their data and get to completion as fast as possible. They may be experts in their application, but they don't want to be experts in cluster administration or management.

If you are running high-throughput workloads, then you can consider using the Genomics Pipelines API. The Pipelines API requires you to put your application into a Docker container and place your input files in a Cloud Storage bucket. After that is done, you can use the gcloud command-line tool to launch the application against each of the files in the Cloud Storage bucket. The results can be placed in another Cloud Storage bucket.

Here's an example of a command to execute a task that uses samtools to generate a BAM index file from an input BAM file:

gcloud alpha genomics pipelines run --pipeline_id [PIPELINE_ID] \
--logging gs://[YOUR_BUCKET/YOUR_DIRECTORY]/logs \
--inputs inputFile=gs://genomics-public-data/gatk-examples/example1/NA12878_chr22.bam \
--outputs outputFile=gs://[YOUR_BUCKET]/[YOUR_DIRECTORY]/output/NA12878_chr22.bam.bai

There is no cluster to provision or manage. The tasks simply run until completion in a VM that is provisioned and managed by the Pipelines API. This is cost efficient because Compute Engine bills per second of usage.

Small- to medium-sized cluster for a single project or team

Within a project or team, the members might have access to a cluster run by a central team for users throughout their company, or might have access to large-scale resources at an off-site HPC center. In both situations, the clusters are professionally managed and accessed using standard tools. For example, users might SSH into a head node and use Grid Engine submit scripts to submit jobs for execution.

One approach for such a team would be to use ElastiCluster to define a cluster environment that is similar to their on-premises systems. They can customize the cluster by selecting a Compute Engine machine type that is best suited for their application, and customize the startup scripts to install the software dependencies for their application. Input data can still be staged to Cloud Storage, and gcsfuse can be installed on the compute nodes to mount the input data.

These details are recorded in the ElastiCluster configuration file, and when computation is needed, a cluster is brought up using the command-line tool, for example:

% elasticluster start astrocluster1

The cluster, named in the configuration file as astrocluster1, is provisioned and configured as specified. The definitions in a configuration file are flexible and support different node types for head and compute nodes, Compute Engine persistent disks for scratch space, preemptible VMs to lower the cost for high throughput workloads, and GPUs for accelerated operation. An example of a basic configuration for a Slurm-based cluster with 10 compute nodes and 1 head node using 32-core virtual machines based on CentOS would look as follows:

[cluster/astrocluster1]
 cloud=google
 login=google
 setup=ansible-slurm
 security_group=default
 image_id=centos-7-v20170327
 flavor=n1-standard-32
 frontend_nodes=1
 compute_nodes=10
 ssh_to=frontend
 boot_disk_size=50

Finally, when no more jobs are running in the system, the cluster can be terminated:

% elasticluster stop astrocluster1

For larger workloads, you can:

  • Look to customize the cluster machine types to further reduce costs.
  • Add an external parallel filer to increase performance at scale.
  • Add auto scaling capabilities to add and remove additional nodes based on the queue depth.

HPC center adding burst capacity to existing clusters

Central HPC centers have tremendous capacity for compute, but because they are used by many groups across the company or organization, HPC centers tend to have consistently high utilization and long queue wait times. They are typically purchased with a particular production capacity in mind, and when unforeseen workloads are submitted into the mix they can slow progress down significantly.

In these situations, you can burst into the GCP environment by adding compute nodes temporarily to the cluster. The cluster becomes a hybrid, with the head node and some compute nodes running on-premises, and other compute nodes running on GCP. When the job queues have been drained, the additional nodes can be released.

Bursting to the cloud is convenient for a couple of reasons:

  • It maintains a consistent end-user environment for job submission and management. End users don't know or care if additional nodes are added.
  • It allows IT managers to define policies for when to burst, in order to control costs.

The greatest challenge is providing a consistent data and file namespace for user jobs across the on-premises and GCP nodes. The GCP nodes might not have access to the same internal file-systems as the on-premises nodes. In this situation, jobs that reference these files will fail to run.

If the GCP nodes are configured with internal file-access permissions, then jobs will run, but might not perform in the same manner and might create additional networking bandwidth and egress charges. In addition, parallel jobs that are split across on-premises and cloud nodes might also not perform well with the added latency between the different parts of the application.

For high-throughput jobs, using HTCondor to burst into cloud resources sidesteps many of these challenges. HTCondor supports dynamic provisioning using GlideInWMS. As jobs are submitted into the a job queue, they can trigger nodes being provisioned and added to the cluster. When they are added, the condor scheduler transfers the input files to the designated node and uses those additional nodes to execute the tasks and drain the queue.

Next steps

Read more about cluster computing use cases on GCP:

Read more about:

Get started with your cluster:

Example projects on GitHub:

Send feedback about...