Cluster Director

AI orchestration for Kubernetes and Slurm clusters

Easily configure, deploy, and manage AI or HPC clusters. Get the automation upsides of managed infrastructure without limiting your control.

功能

A managed infrastructure service for Slurm and Kubernetes

You can access Cluster Director capabilities in two ways:

  • Directly through the control plane, API, or CLI for jobs using Slurm or custom orchestrators. This unified environment also integrates with Google Kubernetes Engine (GKE), giving you a single, powerful interface to create and oversee both Slurm and Kubernetes clusters.
  • Within GKE or Compute Engine. In these contexts, you can access the Cluster Director capabilities while working from a familiar environment.

Job scheduling, simplified

Cluster Director provides fault-tolerant and highly scalable job scheduling out of the box. The controller node is managed for you. You can easily configure the login nodes for your cluster, including machine type, source image, and boot-disk size.

Intuitive cluster management

Use the control plane to easily create, update, and delete your cluster. It also simplifies networking by allowing you to deploy clusters on a new, purpose-built VPC network or an existing one. For storage, you can create and attach a new Filestore or Google Cloud Managed Lustre instance, or connect to an existing Cloud Storage bucket.

Topology-aware placement

To maximize performance, Cluster Director is deeply integrated with Google's network topology. This ensures that VMs within a cluster are placed in close physical proximity, reducing network latency—critical for highly synchronized distributed training workloads.

Comprehensive visibility and insights

Cluster Director's integrated observability dashboard provides a clear view of your cluster's health, utilization, and performance, so you can quickly understand your system's behavior and diagnose issues in a single place. The dashboard is designed to easily scale to tens of thousands of VMs.

Uninterrupted training runs

Get foundational reliability by requesting a Bill of Health, plus additional features such as 3-tier checkpointing and advanced maintenance controls to help maximize training efficiency.

工作方式

AI infrastructure users can spend weeks wrestling with configurations before hitting 'deploy,' but it doesn't have to be that way. Learn what you can expect as a first time Cluster Director user, from preparing an environment to deployment, to turning interruptions into managed events.

What is Cluster Director?
What is Cluster Director?

常见用途

Pre-deployment qualification and preparation

Design a high-performance, reliable foundation

Before you spin up a cluster, you need assurance your accelerators will be performant and reliable from the get-go. Cluster Director provides intelligent, topology-aware placement for your TPUs and GPUs.

Every compute, networking, and storage component is validated through a rigorous, multi-stage qualification process, captured in a detailed Bill of Health that provides the ultimate proof of quality and readiness.

Cluster Director Day 0

Design a high-performance, reliable foundation

Before you spin up a cluster, you need assurance your accelerators will be performant and reliable from the get-go. Cluster Director provides intelligent, topology-aware placement for your TPUs and GPUs.

Every compute, networking, and storage component is validated through a rigorous, multi-stage qualification process, captured in a detailed Bill of Health that provides the ultimate proof of quality and readiness.

Cluster Director Day 0

Deploy your cluster

Deploy your cluster in minutes, not days

Remove the complexity of setting up a GKE or Slurm cluster. Start with validated reference architectures, choose your accelerator and storage resources, and let Cluster Director do the rest.

Deploy a fully optimized environment at any scale with Google’s best practices for performance and topology baked in, drastically reducing deployment time.

Deployment on Cluster Director

Deploy your cluster in minutes, not days

Remove the complexity of setting up a GKE or Slurm cluster. Start with validated reference architectures, choose your accelerator and storage resources, and let Cluster Director do the rest.

Deploy a fully optimized environment at any scale with Google’s best practices for performance and topology baked in, drastically reducing deployment time.

Deployment on Cluster Director

Manage your cluster

Cluster management and observability

Bridge the gap between raw infrastructure and running a job with a single console for your Slurm cluster. Get a topology view of cluster health and utilization.

When issues arise, use job-centric observability to instantly correlate metrics across the full stack with a single job ID, turning hours of guesswork into a few clicks and quickly identifying the root cause of any slowdowns.

Cluster management and observability

Cluster management and observability

Bridge the gap between raw infrastructure and running a job with a single console for your Slurm cluster. Get a topology view of cluster health and utilization.

When issues arise, use job-centric observability to instantly correlate metrics across the full stack with a single job ID, turning hours of guesswork into a few clicks and quickly identifying the root cause of any slowdowns.

Cluster management and observability

Detect hardware issues

Self-healing capabilities

You can use Cluster Director to proactively detect, remediate, and recover from infrastructure issues.

For example, you get always-on health checks, straggler detection, and an AI health predictor to proactively identify issues.

Self-healing capabilities

Self-healing capabilities

You can use Cluster Director to proactively detect, remediate, and recover from infrastructure issues.

For example, you get always-on health checks, straggler detection, and an AI health predictor to proactively identify issues.

Self-healing capabilities

价格

How Cluster Director pricing worksThere is no extra charge for using Cluster Director. You only pay for the underlying Google Cloud resources that your clusters use, such as compute, storage, and networking.
ServicesDescriptionPrice (USD)

Get started free

New users get $300 in free trial credits to use within 90 days.

Free

The Compute Engine free tier gives you one e2-micro VM instance, up to 30 GB standard persistent disk storage, and up to 1 GB of outbound data transfers per month.

Free

VM instances, storage, and networking

Review our Compute Engine pricing for more information.

Only pay for the services you use. No up-front fees. No termination charges. Pricing varies by product and usage.

起价

$0.01

(e2-micro, pay-as-you-go)

How Cluster Director pricing works

There is no extra charge for using Cluster Director. You only pay for the underlying Google Cloud resources that your clusters use, such as compute, storage, and networking.

Get started free

Description

New users get $300 in free trial credits to use within 90 days.

Price (USD)

Free

The Compute Engine free tier gives you one e2-micro VM instance, up to 30 GB standard persistent disk storage, and up to 1 GB of outbound data transfers per month.

Description

Free

VM instances, storage, and networking

Description

Review our Compute Engine pricing for more information.

Only pay for the services you use. No up-front fees. No termination charges. Pricing varies by product and usage.

Price (USD)

Starting at

$0.01

(e2-micro, pay-as-you-go)

Pricing Calculator

Estimate your monthly charges, including cluster management fees.

Need help?

Chat to us online, call us directly, or request a call back.

Ready to give it a try?

Sign up for a free trial and receive $300 in credits to use within 90 days.

Have a large project?

Create and connect to a Slurm cluster in Cluster Director by using an AI/ML training template

Choose a learning path, build your skills, and validate your knowledge with Google Cloud Skills Boost

Get technical support or provide product feedback