Cluster Director (formerly known as Hypercompute Cluster) lets you deploy and manage a group of accelerators as a single unit with physically colocated VMs, targeted workload placement, advanced cluster maintenance controls, and topology-aware scheduling. Cluster Director can be accessed directly through Compute Engine APIs, or through Google Kubernetes Engine, which natively integrates with Cluster Director capabilities.
Components
This section describes the core features and services that make up the Cluster Director suite.
Dense colocation of accelerator resources
You can request host machines that are allocated physically close to each other, provisioned as blocks of resources, and are interconnected with a dynamic ML network fabric. This arrangement of resources helps to minimize network hops and optimize for the lowest latency.
To learn how to deploy these densely allocated blocks of A3 Ultra or A4 accelerator machines, see Request capacity.
Topology aware scheduling
You can get topology information at the node and cluster levels that can be used for job placement. For more information, see View VMs topology.
Advanced maintenance scheduling and controls
You have full control over the maintenance of VM instances within a block of resources, and can synchronize upgrades to ensure your workloads are more resilient to host errors and have minimal disruptions. This approach improves the goodput for your workloads.
To facilitate full control of maintenance events, you can set up alerts and receive notifications when maintenance is scheduled, starting, or being completed. To learn more about maintenance of these blocks of resources, see the following:
You can also define how you want maintenance to behave for your blocks of resources. You can choose between the following maintenance scheduling types: grouped or independent. To learn more about maintenance scheduling types, see Maintenance scheduling types.
Monitoring and diagnostic tooling
For monitoring and troubleshooting, Cluster Director includes services such as the faulty host reporting, which you can use to flag issues with individual host machines. To help to reduce the overhead of managing your cluster, services are also available for monitoring network and GPU performance.
What's next?
- Review terminology.
- Choose a deployment strategy.