Cluster Toolkit

Cluster Toolkit, formerly known as Cloud HPC Toolkit, is open-source software offered by Google Cloud which simplifies the process for you to deploy high performance computing (HPC), artificial intelligence (AI), and machine learning (ML) workloads on Google Cloud. It is designed to be highly customizable and extensible, and intends to address the deployment needs of a broad range of use cases.

Benefits

Cluster Toolkit provides you with the following benefits:

  • Fast creation and deployment of turnkey HPC, AI, and ML clusters that follow Google Cloud best practices
  • An open source solution that is configurable and extensible
  • Seamless integration with various partners such as Intel DAOS, DDN EXAscaler, and Slurm
  • Monitoring and performance visibility through integration with Cloud Monitoring

Components

Cluster Toolkit has the following main components:

  • Cluster blueprint: a YAML file that defines which cluster modules to use and how to customize them.
  • Cluster modules: the building blocks of a deployment folder. Modules are composed of Terraform or Packer configuration files.
  • ghpc engine: a Google Open Source tool that uses the information in the cluster blueprint to combine different cluster modules and produce a deployment folder.
  • Cluster deployment folder: a self-contained folder that can be used to deploy a cluster onto Google Cloud. With Cluster Toolkit, you have the added flexibility to configure a cluster to your specifications by editing the deployment folder before you deploy.

How it works

Cluster Toolkit architecture.
Figure 1. Cluster Toolkit architecture overview

You can use Cluster Toolkit to deploy clusters on Google Cloud as follows:

  1. Set up your working environment. Your working environment is the command line from which you will run your commands. This can either be a Linux or macOS command line or you can use Cloud Shell. If using a Linux or macOS command line, you need to install a few dependencies.
  2. From the command line, complete the following:

    1. Clone the Cluster Toolkit GitHub repository. This repository contains the ghpc binary, modules, cluster blueprint examples, and other resources needed for the configuration of your cluster.
    2. Build the ghpc binary.

    For detailed instructions, see Configure your environment.

  3. Use an editor to create your cluster blueprint file. Example blueprints are also available in the Cluster Toolkit GitHub repository. These blueprints can be used either directly or as a template or starting point for your custom cluster blueprint.

  4. From the command line, complete the following:

    1. Run the ghpc create command and specify your cluster blueprint. When you run this command, ghpc engine then completes the following steps:
      1. Builds a deployment folder that is based on the specified cluster blueprint. This deployment folder contains all the specifications and resources needed to deploy the cluster.
      2. Prints instructions to the command-line on how to deploy the cluster. This will provide you with the commands that you must run to deploy the cluster. These will either be Terraform or Packer commands.
    2. Run the commands provided by the ghpc engine. When you run these commands, Terraform or Packer then deploys the cluster on Google Cloud.

    For detailed instructions, see Deploy a cluster.

  5. After your cluster is deployed, you can submit jobs to your HPC cluster. You can also use Cloud Monitoring to analyze and monitor Google Cloud resources that are used by your cluster.

Limitations

Cluster Toolkit only supports creating and deleting a cluster. If you want to modify the hardware or software configuration of an active cluster, Google recommends the following steps:

  1. Delete the cluster
  2. Update the cluster blueprint
  3. Create the cluster deployment folder
  4. Deploy the cluster

What's next