Compute

HPC made easy: Announcing new features for Slurm on GCP

March 7, 2019

Annie Ma-Weaver

Group Product Manager, Google Cloud

Aaron Blasius

Product Manager

Now we’re sharing a new set of features for Slurm running on Google Cloud Platform (GCP) including support for preemptible VMs, custom machine types, image-based instance scaling, attachable GPUs, and customizable NFS mounts. In addition, this release features improved deployment scalability and resilience.

https://storage.googleapis.com/gweb-cloudblog-publish/images/slurm.max-400x400.png

Slurm is one of the leading open-source HPC workload managers used in TOP500 supercomputers around the world. Last year, we worked with SchedMD, the core company behind Slurm, to make it easier to launch Slurm on Compute Engine.

Here’s more information about these new features:

Support for preemptible VMs and custom machine types

You can now scale up a Compute Engine cluster with Slurm and preemptible VMs, while support for custom machine types lets you run your workloads on instances with an optimal amount of CPU and memory. Both features help you achieve much lower costs for your HPC workloads: Preemptible VMs can be up to 80% cheaper than regular instances and custom machine types can generate savings of 50% or more compared to predefined types.

Image-based instance scaling

Rather than installing packages from the internet and applying script configurations, now you can create Slurm compute instances based on a Google-provided disk image. This feature significantly shortens the time required to provision each node and increases deployment resilience. Images are automatically made by provisioning an image creation node, which are then used as the basis of all other auto-scaled compute instances. This can yield a net-new cluster of 5000 nodes in under 7 minutes.

Optional, attachable GPUs

Compute Engine supports a wide variety of GPUs (e.g. NVIDIA V100, K80, T4, P4 and P100, with others on the horizon), which you can attach to your instances based on region and zone availability. Now, Slurm will automatically install the appropriate NVIDIA/CUDA drivers and software according to GPU model and compatibility, making it easy to scale up your GPU workloads on Compute Engine using Slurm.

Customizable NFS mounts and VPC flexibility

Finally, you can now set the NFS hosts of your choice for storage. Cloud Filestore is a great option if you want a fully managed NFS experience. You can also specify a pre-existing VPC or Shared VPC to host your cluster.

Getting started

This new release was built by the Slurm experts at SchedMD. You can download this release in SchedMD’s GitHub repository. For more information, check out the included README. And if you need help getting started with Slurm check out the quick start guide, and for help with the Slurm features for GCP check out the Slurm Auto-Scaling Cluster and Slurm Cluster Federation codelabs. If you have further questions, you can post on the Slurm on GCP Google discussion group, or contact SchedMD directly.

Posted in

Cost Management

Simpler billing, clearer savings: A FinOps guide to updated spend-based CUDs

By Alfonso Hernandez • 5-minute read

Serverless

High-performance inference meets serverless compute with NVIDIA RTX PRO 6000 on Cloud Run

By James Ma • 3-minute read

Compute

Unlock 2x better price-performance with Axion-based N4A VMs, now generally available

By Nate Baum • 6-minute read

Compute

Scaling WideEP Mixture-of-Experts inference with Google Cloud A4X (GB200) and NVIDIA Dynamo

By Sean Horgan • 9-minute read

HPC made easy: Announcing new features for Slurm on GCP

Annie Ma-Weaver

Aaron Blasius

Related articles

Simpler billing, clearer savings: A FinOps guide to updated spend-based CUDs

High-performance inference meets serverless compute with NVIDIA RTX PRO 6000 on Cloud Run

Unlock 2x better price-performance with Axion-based N4A VMs, now generally available

Scaling WideEP Mixture-of-Experts inference with Google Cloud A4X (GB200) and NVIDIA Dynamo