Jump to Content

Enhancing the HPC experience with Slurm-GCP v6 and TPU support

June 10, 2024
Annie Ma-Weaver

Group Product Manager, Google Cloud HPC

Nick Stroud

Tech Lead, Google Cloud HPC

Try Gemini 1.5 models

Google's most advanced multimodal models in Vertex AI

Try it

On Google Cloud, our HPC-optimized infrastructure, including the AI Hypercomputer, can be deployed in multiple ways according to user preferences. For customers that want a Slurm-based environment, we recommend using the Cloud HPC Toolkit, a Google product that helps simplify the creation and management of HPC systems for AI/ML and traditional HPC workloads. The Toolkit features our Slurm-GCP offering, a set of Slurm scripts that helps automate the installation, deployment, and certain operational aspects of Slurm on Google Cloud.

Today we’re excited to announce the general availability of Slurm-GCP v6, the latest and recommended version, which will run on Slurm 23.11. This release is the result of our ongoing multi-year collaboration with the engineering experts at SchedMD.

Slurm-GCP v6 provides the following benefits, compared with v5:

Faster deployments 

A simple cluster - consisting of Slurm infrastructure with a pre-existing VPC and without deploying any file systems in parallel or using autoscaling clusters - now deploys 3x faster than the previous version.

Robust reconfiguration 

Reconfiguration is a Slurm-GCP mechanism allowing making changes to a running cluster and this process is now managed by a service that runs on each instance, providing a more consistent experience. Reconfiguration has also been enabled by default, enabling easier reconfiguration of a running cluster.

More deployments in a single project 

We have lifted the restriction on the number of clusters that can be deployed in a single project.

Fewer dependencies in the deployment environment 

Reconfiguration and compute node cleanup features are now enabled by default and no longer require users to set them up, making it easier to manage Slurm clusters. 

Full support for TPU v3 and v4 

TPU v3 and v4 are now fully supported, allowing TPU and GPU partitions to be configured alongside each other for maximum flexibility in choosing your preferred accelerators.

Start using v6 today by navigating to the Toolkit blueprint library. These include blueprints like Running the MaxText ML Benchmark on TPUs with Slurm, and Running Apptainer Containers with Slurm. Blueprints using a prior version of Slurm-gcp will contain “v5” in the name and be supported through November 2024.

Posted in