Cluster management capabilities

The A4X, A4, and A3 Ultra machine series are designed to enable you to run large-scale artificial intelligence (AI) and machine learning (ML) clusters and provide the following cluster management capabilities:

AI infrastructure resources colocation
Cluster topology-aware placement
Cluster operational mode
Cluster maintenance scheduling and controls
Cluster monitoring and diagnostic tooling

AI infrastructure resources colocation

When you use A4X, A4, and A3 Ultra, you can request host machines that Compute Engine provisions as close together as possible. These machines offer the following features:

Compute Engine provisions the machines as blocks of resources.
A dynamic machine learning (ML) network fabric interconnects the machines.

This resource arrangement minimizes network hops and optimizes for lowest network latency. To learn more about how to obtain capacity to deploy densely allocated blocks of accelerator-optimized machines, see Capacity overview.

Cluster topology-aware placement

After you create VMs or clusters of A4X, A4, and A3 Ultra VMs, you can get topology information at the node and cluster levels. This information helps you do the following:

Adjust your application or workload design to further minimize network latency.
Understand and troubleshoot network latency and performance issues for VMs that communicate frequently with each other. These issues can occur if the VMs are unexpectedly located far apart.

For more information, see View VMs topology.

Cluster operational mode

When you reserve capacity to create VMs or clusters with A4X, A4, or A3 Ultra machines, the machine type that you reserve determines the cluster operational mode for the VMs. This mode specifies how your VMs behave after host errors or faulty host reports. The available operational modes for a VM are managed mode, where Compute Engine automatically replaces any faulty machines but holds back part of your reserved capacity to help ensure that your VMs have the necessary resources to restart. Or all capacity mode, where you have access to your full reserved capacity but are responsible for managing failures and planned maintenance.

For more information, see Reservation operational mode.

Cluster maintenance scheduling and controls

You control maintenance of A4X, A4, and A3 Ultra machines by using topology-aware scheduling in a block of resources. This capability helps synchronize upgrades so that your workloads are more resilient to host events and minimize disruptions. This approach helps improve the goodput of your workload.

To facilitate full control of maintenance events, you can use the following features:

Maintenance scheduling type
Manage host events

Maintenance scheduling type

When you reserve capacity to create VMs or clusters of A4X, A4, or A3 Ultra machines, you can define how Compute Engine maintains the infrastructure that your VMs run on. Based on the machine type that you want to use for your VMs, you can choose between synchronized maintenance across VMs (grouped), or different maintenance schedules (independent).

For more information, see Maintenance scheduling types.

Manage host events

After you create A4X, A4, or A3 Ultra VMs and start your workload, you can set up alerts and receive notifications when maintenance for your VMs or reserved blocks is scheduled, starts, or is completed. You can also view and, if needed, manually start maintenance on a VM or reserved block before its scheduled time. These options help you proactively control and minimize downtimes to your workloads.

For more information, see the following:

Cluster monitoring and diagnostic tooling

For monitoring and troubleshooting, A4X, A4, and A3 Ultra machines include the following services:

Faulty host reporting, which you can use to flag issues with individual host machines.
Support for Cloud Monitoring metrics, which help you monitor networks and GPUs performance.