Jump to Content
HPC

Build large-scale AI/ML and HPC clusters with Cluster Toolkit (formerly HPC Toolkit)

August 2, 2024
Annie Ma-Weaver

Group Product Manager, Google Cloud

Shivani Matta

Software Engineering Manager, Google Cloud

Google Cloud Summit Series

Discover the latest in AI, Security, Workspace, App Dev, & more.

Register

Update: Starting the week of September 16, 2024, Google Cloud customers with eligible support plans can access assistance for the Cluster Toolkit through the Google Cloud console. Cluster Toolkit, formerly known as Cloud HPC Toolkit, is open-source software offered by Google Cloud that simplifies the process for you to deploy HPC, AI and ML workloads on Google Cloud. The Cloud Support team will handle filed cases, ensuring that you receive timely and effective support for your Cluster Toolkit implementations. Select 'Cluster Toolkit' as the sub-category under 'Compute Engine' when creating a support ticket in the console to get in touch about any Cluster Toolkit issues.


The Cloud HPC Toolkit, now rebranded as Cluster Toolkit, simplifies the creation and management of high performance computing environments on Google Cloud. Initially focused on scientific and technical computing workloads, it has expanded to encompass AI/ML applications, reflecting its widespread adoption across various domains.

The Cluster Toolkit empowers users to focus on their workloads by streamlining cluster setup and deployment, leveraging Google Cloud's best practices, and offering flexibility for diverse computing tasks. Key benefits include:

  • Easy deployment and management of clusters: The Toolkit simplifies the process of setting up and maintaining clusters, allowing users to focus on their workloads rather than infrastructure management. The Toolkit supports multiple schedulers including Slurm, GKE, and Batch.

  • Quickstart options for HPC and AI/ML workloads: The Toolkit has a library of pre-built blueprints and modules that let users begin running their workloads quickly, accelerating time-to-value. 

  • Integration of Google Cloud best practices: The aforementioned blueprints and modules incorporate Google Cloud's recommended configurations, ensuring that clusters are set up for optimal performance and efficiency.

  • Regular updates and new features: The Toolkit is actively maintained and updated with new features and improvements, providing users with ongoing support and enhancements.

  • Open-source accessibility: The Toolkit is open-source, allowing users to customize and extend its capabilities to meet their specific needs.

What's new in Cluster Toolkit

In addition to a new name, Cluster Toolkit has several new features for HPC and AI/ML workloads:

  • A3 Mega Blueprint: This blueprint makes it easy to deploy a cluster of A3 Mega VMs ready for training large language models (LLMs) and other AI/ML workloads. Earlier in the year, we also launched the A3 Blueprint.

  • HPC VM Image: This VM Image is pre-installed with popular HPC tools and libraries, ensuring you can begin running your HPC workloads quickly with assured performance. 

  • Slurm-gcp v6: The latest version of the Slurm-gcp solution, which provides a seamless experience for running Slurm workloads on Google Cloud, is now GA. 

Guidelines for existing Toolkit customers

We've renamed our GitHub repo to “Cluster Toolkit” and some commands (e.g., ghpc is now gcluster). Existing Git operations and commands will still work, but we strongly recommend updating local clones and command names to avoid confusion.

How to get started

To get started with the Cluster Toolkit, select one of our easy-to-use HPC and AI/ML blueprints, available through our GitHub repo, and use it to set up a cluster. We also offer a variety of resources to help you get started, including documentation, quickstarts, and videos.

Posted in