Hypercompute Cluster

Hypercompute Cluster is an infrastructure building block that is designed to deploy and maintain a large number of accelerator-optimized virtual machine (VM) instances as a single, homogenous unit. These single, homogenous units are also configured to deliver non-blocking network traffic.

Hypercompute Cluster lets you deploy and operate clusters for large scale AI workloads for high performance, ML goodput and uptime. With Hypercompute Cluster you can scale to tens of thousands of accelerator-optimized VMs in a single cluster while providing observability and health diagnostics for the cluster as a single unit.

Features

The following are the key features of Hypercompute Cluster:

  • Dense co-location of accelerator resources: your host machines are allocated physically close to each other, provisioned as blocks of resources, and are interconnected with a dynamic ML network fabric to minimize network hops and optimize for the lowest latency.

    To learn how to deploy these densely allocated blocks of A3 Ultra machines, see Request capacity.

  • Topology aware scheduling: provides topology information at node and cluster level that can be used for job placement. For more information, see View VMs topology.

  • Advanced maintenance scheduling and controls: you have full control over the maintenance of VM instances within a block of resources, and can synchronize upgrades to ensure your workloads are more resilient to host errors and have minimal disruptions. This improves the goodput for your workloads.

    To facilitate full control of maintenance events, you can set up alerts and receive notifications when maintenance is scheduled, starting, or being completed. To learn more about maintenance of these blocks of resources, see Manage host events.

    You can also define the way you want maintenance to behave for your blocks of resources. You can choose between the following maintenance scheduling types: grouped or independent. To learn more about maintenance scheduling types, see Maintenance scheduling types.

  • Orchestrator integration: these clusters consist of a large number of compute resources that require complex management and orchestration. To simplify management, Hypercompute Clusters are designed for integrating with orchestrators such as GKE and Slurm.

  • Deployment, monitoring, and diagnostic tooling:

    • Cluster deployment tools: Cluster Toolkit is open-source software offered by Google Cloud that provides the deployment tool for Hypercompute Cluster. Cluster Toolkit can deploy both GKE and Slurm clusters.
    • Monitoring and troubleshooting tools: the faulty host reporting tool can be used to flag issues with individual host machines. Tools are also available for monitoring network and GPU performance. These help to reduce the overhead for management of your cluster.

What's next?