Cluster Director overview

This document describes the key components of Cluster Director that you can use to deploy and manage accelerators. For more details on Cluster Director system architecture, see System architecture.

Cluster Director (formerly known as Hypercompute Cluster) lets you deploy and manage a group of accelerators as a single unit, with the following characteristics:

  • AI infrastructure resources colocation

  • Cluster topology-aware placement

  • Cluster operational mode

  • Cluster maintenance scheduling and controls

  • Cluster monitoring and diagnostic tooling

You can access Cluster Director by using the Compute Engine API or Google Kubernetes Engine (GKE) API. These APIs natively integrate with Cluster Director capabilities.

Cluster Director components

This section describes the core features and services that make up the Cluster Director suite when your virtual machine (VM) instances use a supported accelerator-optimized machine type.

AI infrastructure resources colocation

You can request host machines that Compute Engine provisions as close together as possible. These machines offer the following features:

This resource arrangement minimizes network hops and optimizes for lowest network latency. To learn more about how to obtain capacity to deploy densely allocated blocks of accelerator-optimized machines, see Capacity overview.

Cluster topology-aware placement

After you create VMs or clusters, you can get topology information at the node and cluster levels. This information helps you do the following:

  • Adjust your application or workload design to further minimize network latency.

  • Understand and troubleshoot network latency and performance issues for VMs that communicate frequently with each other. These issues can occur if the VMs are unexpectedly located far apart.

For more information, see View VMs topology.

Cluster operational mode

When you reserve capacity to create VMs or clusters, the machine type that you reserve determines the cluster operational mode for the VMs. This mode specifies how your VMs behave after host errors or faulty host reports. The available operational modes for a VM are managed mode, where Compute Engine automatically replaces any faulty machines but holds back part of your reserved capacity to help ensure that your VMs have the necessary resources to restart. Or all capacity mode, where you have access to your full reserved capacity but are responsible for managing failures and planned maintenance.

For more information, see Reservation operational mode.

Cluster maintenance scheduling and controls

You control maintenance by using topology-aware scheduling in a block of resources. This capability helps synchronize upgrades so that your workloads are more resilient to host events and minimize disruptions. This approach helps improve the goodput of your workload.

To facilitate full control of maintenance events, you can use the following features:

Maintenance scheduling type

When you reserve capacity to create VMs or clusters, you can define how Compute Engine maintains the infrastructure that your VMs run on. Based on the machine type that you want to use for your VMs, you can choose between synchronized maintenance across VMs (grouped), or different maintenance schedules (independent).

For more information, see Maintenance scheduling types.

Manage host events

After you create VMs and start your workload, you can set up alerts and receive notifications when maintenance for your VMs or reserved blocks is scheduled, starts, or is completed. You can also view and, if needed, manually start maintenance on a VM or reserved block before its scheduled time. These options help you proactively control and minimize downtimes to your workloads.

For more information, see the following:

Cluster monitoring and diagnostic tooling

For monitoring and troubleshooting, Cluster Director includes the following services:

Supported accelerator machine types

Cluster Director supports the following accelerator-optimized machine types:

What's next?