Cluster Director overview

This document describes the key components of Cluster Director that you can use to deploy and manage accelerators. For more details on Cluster Director system architecture, see System architecture.

Cluster Director (formerly known as Hypercompute Cluster) lets you deploy and manage a group of accelerators as a single unit, with the following characteristics:

AI infrastructure resources colocation
Cluster topology-aware placement
Cluster operational mode
Cluster maintenance scheduling and controls
Cluster monitoring and diagnostic tooling

You can access Cluster Director by using the Compute Engine API or Google Kubernetes Engine (GKE) API. These APIs natively integrate with Cluster Director capabilities.

Cluster Director components

This section describes the core features and services that make up the Cluster Director suite when your virtual machine (VM) instances use a supported accelerator-optimized machine type.

AI infrastructure resources colocation

You can request host machines that Compute Engine provisions as close together as possible. These machines offer the following features:

Compute Engine provisions the machines as blocks of resources.
A dynamic machine learning (ML) network fabric interconnects the machines.

This resource arrangement minimizes network hops and optimizes for lowest network latency. To learn more about how to obtain capacity to deploy densely allocated blocks of accelerator-optimized machines, see Capacity overview.

Cluster topology-aware placement

After you create VMs or clusters, you can get topology information at the node and cluster levels. This information helps you do the following:

Adjust your application or workload design to further minimize network latency.
Understand and troubleshoot network latency and performance issues for VMs that communicate frequently with each other. These issues can occur if the VMs are unexpectedly located far apart.

For more information, see View VMs topology.

Cluster operational mode

When you reserve capacity to create VMs or clusters, the machine type that you reserve determines the cluster operational mode for the VMs. This mode specifies how your VMs behave after host errors or faulty host reports. The available operational modes for a VM are managed mode, where Compute Engine automatically replaces any faulty machines but holds back part of your reserved capacity to help ensure that your VMs have the necessary resources to restart. Or all capacity mode, where you have access to your full reserved capacity but are responsible for managing failures and planned maintenance.

For more information, see Reservation operational mode.

Cluster maintenance scheduling and controls

You control maintenance by using topology-aware scheduling in a block of resources. This capability helps synchronize upgrades so that your workloads are more resilient to host events and minimize disruptions. This approach helps improve the goodput of your workload.

To facilitate full control of maintenance events, you can use the following features:

Maintenance scheduling type
Manage host events

Maintenance scheduling type

When you reserve capacity to create VMs or clusters, you can define how Compute Engine maintains the infrastructure that your VMs run on. Based on the machine type that you want to use for your VMs, you can choose between synchronized maintenance across VMs (grouped), or different maintenance schedules (independent).

For more information, see Maintenance scheduling types.

Manage host events

After you create VMs and start your workload, you can set up alerts and receive notifications when maintenance for your VMs or reserved blocks is scheduled, starts, or is completed. You can also view and, if needed, manually start maintenance on a VM or reserved block before its scheduled time. These options help you proactively control and minimize downtimes to your workloads.

For more information, see the following:

Cluster monitoring and diagnostic tooling

For monitoring and troubleshooting, Cluster Director includes the following services:

Faulty host reporting, which you can use to flag issues with individual host machines.
Support for Cloud Monitoring metrics, which help you monitor networks and GPUs performance.

Supported accelerator machine types

Cluster Director supports the following accelerator-optimized machine types:

Cluster Director overview Stay organized with collections Save and categorize content based on your preferences.