This document explains how to run high performance computing (HPC) workloads on Google Kubernetes Engine (GKE) clusters that use the H4D machine series and remote direct memory access (RDMA).
H4D is a machine series in the HPC-optimized machine family for Compute Engine. The machine series is optimized for high performance, low cost, and scalability. H4D works well for applications that scale across multiple nodes. H4D instances configured to use RDMA support up to 200 Gbps network bandwidth between nodes.
The instructions on this page use the Google Cloud CLI and allow you maximum flexibility in configuring your cluster environment. Alternatively, you can use Cluster Toolkit to quickly create a production-ready GKE cluster which uses H4D. For more information, see the GKE H4D Blueprint.
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
- Use flex-start provisioning mode to obtain H4D VMs. Or, if you need resources for more than 90 days, or more than 256 H4D VMs in a request, contact your account team. You can also provision H4D VMs on demand, subject to available capacity in your region.
- Use GKE version 1.32.6-gke.1060000 or later to create a node pool with reserved H4D VMs in GKE Standard mode.
Use GKE version 1.33.2-gke.4731000 or later to create the following:
- H4D nodes with flex-start
- H4D nodes with Autopilot
- H4D nodes with cluster autoscaling in Standard clusters
- H4D nodes with node auto-provisioning in Standard clusters
Use only locations where the H4D machine type is available. For more information, see the table in Available regions and zones, filtering for
H4D
.Use only Container-Optimized OS node images.
Review the H4D limitations.
Review how to handle host maintenance, because H4D machine types don't support live migration. For more information, see Maintenance experience for H4D instances.
Replace the following values for the commands in the next sections:
PROJECT_ID
: your Google Cloud project ID.CLUSTER_NAME
: the name of your cluster.CONTROL_PLANE_LOCATION
: the Compute Engine location of the control plane of your cluster. Provide a region for regional clusters, or a zone for zonal clusters. Regional clusters are recommended for production workloads. For regional clusters, the region must include a zone in which H4D is available. For zonal clusters, the zone must have H4D availability.COMPUTE_ZONE
: the zone of your node pool. This must be a zone in which H4D is available. You can't create a multi-zone node pool if you want the H4D nodes to work with RDMA.RDMA_NETWORK_PREFIX
: the RDMA network prefix (for example,h4d-rdma
).RDMA_SUBNET_CIDR
: the RDMA subnet CIDR range. Ensure that this range doesn't overlap with the cluster's default networks.NODE_POOL_NAME
: the name of your H4D node pool.NODE_COUNT
: the number of H4D nodes to create in the node pool.H4D_MACHINE_TYPE
: the H4D machine type to use (for example,h4d-highmem-192-lssd
).
Create VPCs and subnets
Configure the default Virtual Private Cloud (VPC) and subnet for the cluster. For the RDMA network interface card (NIC), create a dedicated VPC and subnet. The VPC created with the following instructions uses, as required, an RDMA network profile.
Create an HPC VPC for the RDMA NICs:
gcloud compute --project=PROJECT_ID \ networks create RDMA_NETWORK_PREFIX-net \ --network-profile=COMPUTE_ZONE-vpc-falcon \ --subnet-mode=custom
Create a subnet for the RDMA network:
gcloud compute --project=PROJECT_ID \ networks subnets create \ RDMA_NETWORK_PREFIX-sub-0 \ --network=RDMA_NETWORK_PREFIX-net \ --region=CONTROL_PLANE_LOCATION \ --range=RDMA_SUBNET_CIDR
Create a GKE cluster with multi-networking
Create the GKE cluster with multi-networking enabled. Optionally, with this command, you can explicitly provide the secondary CIDR ranges for services and Pods.
Run the following command:
gcloud container clusters create CLUSTER_NAME --project PROJECT_ID \
--enable-dataplane-v2 --enable-ip-alias --location=CONTROL_PLANE_LOCATION \
--enable-multi-networking \
[--services-ipv4-cidr=SERVICE_CIDR \
--cluster-ipv4-cidr=POD_CIDR]
If you use these optional flags, replace the following additional values:
SERVICE_CIDR
: the secondary CIDR range for services.POD_CIDR
: the secondary CIDR range for Pods.
When you use these flags, verify that the CIDR ranges don't overlap
with subnet ranges for additional node networks. For example,
SERVICE_CIDR=10.65.0.0/19
and
POD_CIDR=10.64.0.0/19
.
Create GKE network objects
Configure the VPC network by using GKE network parameter sets.
Apply the GKENetworkParamSet
and Network
objects:
kubectl apply -f - <<EOF
apiVersion: networking.gke.io/v1
kind: GKENetworkParamSet
metadata:
name: rdma-0
spec:
vpc: RDMA_NETWORK_PREFIX-net
vpcSubnet: RDMA_NETWORK_PREFIX-sub-0
deviceMode: RDMA
---
apiVersion: networking.gke.io/v1
kind: Network
metadata:
name: rdma-0
spec:
type: "Device"
parametersRef:
group: networking.gke.io
kind: GKENetworkParamSet
name: rdma-0
EOF
Create an H4D node pool
Create a node pool that uses H4D and connects to the RDMA network. You can use reservation-bound H4D nodes and compact placement. Or, you can use H4D nodes provisioned with flex-start. Select the tab that corresponds to your consumption option:
Reservation-bound
Create a resource policy for compact placement. Compact placement optimizes performance for tightly-coupled HPC workloads—which run across multiple nodes—by ensuring that nodes are physically located relative to each other within a zone.
Run the following command:
gcloud compute resource-policies create group-placement POLICY_NAME \ --region REGION --collocation collocated
Replace the following values:
POLICY_NAME
: the name of the resource policy (for example,h4d-compact
).REGION
: the region of your cluster.
Create a node pool that uses H4D and connects to the RDMA network:
gcloud container node-pools create NODE_POOL_NAME --project PROJECT_ID \ --location=CONTROL_PLANE_LOCATION --cluster CLUSTER_NAME --num-nodes=NODE_COUNT \ --node-locations=COMPUTE_ZONE \ --machine-type H4D_MACHINE_TYPE \ --additional-node-network network=RDMA_NETWORK_PREFIX-net,subnetwork=RDMA_NETWORK_PREFIX-sub-0 \ --placement-policy POLICY_NAME \ --max-surge-upgrade 0 \ --max-unavailable-upgrade MAX_UNAVAILABLE
Replace
MAX_UNAVAILABLE
with the maximum number of nodes that can be unavailable at the same time during a node pool upgrade. For compact placement, we recommend fast no surge upgrades to optimize the likelihood of finding colocated nodes during upgrades.
Flex-start
Create a node pool that uses H4D nodes provisioned with flex-start, and connects to the RDMA network:
gcloud container node-pools create NODE_POOL_NAME --project PROJECT_ID \
--location=CONTROL_PLANE_LOCATION --cluster CLUSTER_NAME \
--node-locations=COMPUTE_ZONE \
--machine-type H4D_MACHINE_TYPE \
--additional-node-network network=RDMA_NETWORK_PREFIX-net,subnetwork=RDMA_NETWORK_PREFIX-sub-0 \
--flex-start --enable-autoscaling --reservation-affinity=none \
--min-nodes=0 --max-nodes=MAX_NODES --num-nodes=0
Replace MAX_NODES
with the maximum number of nodes
to automatically scale to for the specified node pool per zone.
Prepare your Docker image
Prepare your image by using the following example Dockerfile:
FROM rockylinux:8.9
RUN dnf install https://depot.ciq.com/public/files/gce-accelerator/irdma-kernel-modules-el8-x86_64/irdma-repos.rpm -y
RUN dnf install rdma-core libibverbs-utils librdmacm-utils infiniband-diags perftest -y
Rocky 8 is the recommended container based image that supports RDMA. iRDMA driver might not yet be broadly available in other Linux distributions.
Configure your manifests for RDMA
Enable RDMA by adding the following annotations to your Pod metadata:
metadata:
annotations:
networking.gke.io/default-interface: 'eth0'
networking.gke.io/interfaces: |
[
{"interfaceName":"eth0","network":"default"},
{"interfaceName":"eth1","network":"rdma-0"},
]
Test RDMA with rping
Verify RDMA functionality by running rping
between a server and client Pod:
On the server Pod, run the
rping
command:rping -s
On the client Pod, run the
rping
command:rping -c -C 2 -d -a SERVER_IP
Replace
SERVER_IP
with the server Pod's IP address.The output, if successful, resembles the following:
created cm_id 0x5b597bf94800 cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x5b597bf94800 (parent) cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0x5b597bf94800 (parent) rdma_resolve_addr - rdma_resolve_route successful created pd 0x5b597bf94fa0 created channel 0x5b597bf96830 created cq 0x5b597bf94ff0 created qp 0x5b597bf96c00 rping_setup_buffers called on cb 0x5b597bf8c820 allocated & registered buffers... cq_thread started. cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x5b597bf94800 (parent) ESTABLISHED rdma_connect successful RDMA addr 5b597bf8cd80 rkey dadac8c4 len 64 send completion recv completion RDMA addr 5b597bf8cff0 rkey 86ef015f len 64 send completion recv completion RDMA addr 5b597bf8cd80 rkey dadac8c4 len 64 send completion recv completion RDMA addr 5b597bf8cff0 rkey 86ef015f len 64 send completion recv completion rping_free_buffers called on cb 0x5b597bf8c820 destroy cm_id 0x5b597bf94800
What's next
- Learn more about high performance computing.
- Some HPC workloads require a Message Passing Interface (MPI) to run tightly-coupled, multi-node workloads with RDMA. For more information about setting MPI in your cluster for your H4D nodes, see Run MPI Workloads on GKE H4D.