This document describes the network services you configure for AI Hypercomputer cluster and VM deployments. The specific network services you configure for AI Hypercomputer depend on the deployment option you choose for your VMs or clusters.
This document is intended for architects, network engineers, and developers who want to understand the network services for their AI Hypercomputer deployments. This document assumes you have a basic familiarity with cloud networking and distributed computing concepts. For more information about deployment options, see VM and cluster creation overview.
This document details the network services you configure for the following deployment options:
- Networking for a GKE deployment with a default configuration
- Networking for a GKE deployment using a custom configuration
- Networking for Slurm cluster deployment
- Networking for Compute Engine instances
Configure networking for default GKE deployments
When you create an AI-optimized GKE cluster with default settings, you define your network settings in the Cluster Toolkit blueprint. The blueprint changes based on the machine type you select. For example, the Cluster Toolkit blueprint deploys a GKE cluster with an A4 machine.
The blueprint sets up the network in the following ways:
- Uses the default VPC: The blueprint uses the default Virtual Private Cloud network for the main GKE cluster.
- Creates two additional VPCs: The blueprint sets up two distinct Virtual Private Cloud networks. One is for a second host Network Interface Card (NIC), and the other is for Graphics Processing Unit (GPU)-to-GPU Remote Direct Memory Access (RDMA) traffic. By using this multiple VPC setup, you can improve network isolation. For more information, see Multi-VPC environment.
- Defines IP address ranges: The blueprint sets the private IP address space for your GKE nodes. It configures secondary IP ranges for Pods and Services. GKE uses IP address aliasing to prevent IP address conflicts.
- Applies an RDMA-optimized network profile: The blueprint applies a pre-set, Google-managed network profile to the VPC used for GPU traffic. This profile automatically configures the network for the high-speed and low-delay performance that RDMA needs. For more information, see Network profiles for specific use cases.
- Automates subnet creation for RDMA: To ensure the best performance, the blueprint automatically creates eight dedicated subnets within the RDMA VPC. It creates one subnet for each of the eight RDMA NICs on an accelerator VM.
- Configures firewall rules: The blueprint sets up firewall rules that allow all Transmission Control Protocol (TCP), User Datagram Protocol (UDP), and Internet Control Message Protocol (ICMP) traffic between nodes within the cluster. This allows nodes to communicate freely. It also configures an authorized Classless Inter-Domain Routing (CIDR) range to limit access to the GKE cluster's control plane for security reasons.
Networking for GKE deployments with custom configuration
When you require more granular control than the default Cluster Toolkit blueprints provide, manually configure the network objects for an AI-optimized GKE cluster. This approach lets you tailor the network setup to your workload-specific needs.
The configuration that you use depends on whether you plan to run distributed AI workloads:
- For non-distributed workloads: Create a GKE cluster without GPUDirect RDMA. This method uses a single VPC network for all communication.
- For distributed workloads: Create a GKE cluster with GPUDirect RDMA enabled. Enabling GPUDirect RDMA is essential for achieving optimal performance at scale. This configuration involves a multi-VPC environment that separates general-purpose traffic from high-bandwidth, low-latency GPU-to-GPU communication.
For detailed, step-by-step instructions on creating a custom AI-optimized GKE cluster for both scenarios, see Create a custom AI-optimized GKE cluster.
Networking for Slurm cluster deployments
You can use Cluster Toolkit to deploy high performance computing (HPC), AI, and ML workloads on Google Cloud through highly customizable and extensible blueprints. For example, when you create an AI-optimized Slurm cluster with an A4 machine type. This section explains the network services configured in the A4 blueprint, which helps you understand the network settings you can change when creating Slurm clusters.
During deployment, the Cluster Toolkit blueprint uses Packer to automatically build a custom operating system (OS) image. Packer creates the image by launching a temporary VM and running scripts to customize the boot disk. You can customize the image using startup scripts, shell scripts, or Ansible playbooks. The blueprint then uses this custom image to install the required system software for cluster and workload management on the Slurm nodes.
The network components the blueprint configures are as follows:
- Creates three distinct VPCs: The blueprint creates a primary VPC for the Slurm control plane, a secondary VPC for general host-level traffic, and a dedicated high-performance VPC for GPU-to-GPU communication. This separation prevents management traffic from interfering with the workload data plane. For more information, see Multi-VPC environment.
- Applies an RDMA-optimized network profile: For the GPU data plane, the blueprint applies a pre-configured, Google-managed network profile optimized for RoCE. It automatically creates eight subnets, one for each RDMA NIC on the accelerator VMs. For more information, see Network profiles for specific use cases.
- Reserves an IP address range for shared storage: The blueprint sets a dedicated IP
address range required by the Filestore service.
Filestore provides the shared
/home
directory for the cluster. - Provides an isolated image-build network: The blueprint creates a temporary VPC used only during the process of building the custom VM image for the cluster nodes. This provides an isolated network environment for Packer operations.
For more deployment options, see the Cluster Toolkit documentation.
Networking for Compute Engine instances
With Compute Engine, you can create standalone VMs, VM instances in bulk, and managed instance groups (MIGs) for various accelerator-optimized machine types.
These machine types require a multi-VPC network configuration to handle different kinds of traffic. This configuration separates general host-to-host traffic from high-bandwidth GPU-to-GPU communication. The specific network requirements vary depending on the machine type.
For detailed information about the NICs and network configuration for your machine type, see Review network bandwidth and NIC arrangement.
For step-by-step instructions on how to create these VPC networks, see Create VPC networks.
What's next
- To identify the best deployment for your workload, see Recommended configurations.
- To understand the use case for each deployment option, see VM and cluster creation overview.
- To create an AI-optimized GKE cluster with default configuration, see Create an AI-optimized GKE cluster with default configuration.
- To create a custom AI-optimized GKE cluster, see Create a custom AI-optimized GKE cluster.
- To create an AI-optimized Slurm cluster with an A4 machine type, see Create an AI-optimized Slurm cluster with an A4 machine type.
- To create an AI-optimized instance with A4 or A3 Ultra, see Create an AI-optimized instance with A4 or A3 Ultra.