Version 1.5. This version is no longer supported. For more information see the version support policy.

Scalability

This page describes best practices for creating, configuring, and operating GKE on-prem clusters to accommodate workloads that approach Kubernetes scability limits.

Scalability limits

Take the following limits into account when designing your applications on GKE on-prem:

Each admin cluster supports up to 20 user clusters, including both high availability (HA) and non-HA user clusters.
Each user cluster supports up to:
- 250 nodes
- 7500 Pods
- 500 LoadBalancer Services in bundled load balancing mode (Seesaw), or 250 LoadBalancer Services in integrated load balancing mode (F5).
For each node, you can create a maximum of 110 Pods (each Pod can consist of 1-2 containers). This includes Pods running add-on system services.

Understanding limits

Since GKE on-prem is a complex system with a large integration surface, cluster scalability involves many interrelated dimensions. For example, GKE on-prem can scale through the number of nodes, Pods, or Services. Stretching more than one dimension at the time can cause problems even in smaller clusters. For example, scheduling 110 Pods per node in a 250 node cluster can overstretch the number of Pods, Pods per node, and nodes.

See Kubernetes Scalability thresholds for more details.

Scalability limits are also sensitive to the vSphere configuration and hardware your cluster is running on. These limits are verified in an environment that is likely different than yours. Therefore, you may not reproduce the exact numbers when the underlying environment is the limiting factor.

Preparing to scale

As you prepare to scale, consider the requirements and limitations for vSphere infrastructure, Kubernetes networking, GKE Hub, and Cloud Logging and Cloud Monitoring.

vSphere infrastructure

This section discusses scalability considerations for CPU, memory, storage, disk and network I/O requirements, as well as Node IP addresses.

CPU, memory, and storage requirements

The following requirements apply to control plane VMs:

The admin cluster control plane and add-on nodes can support up to 20 user clusters, including both HA and non-HA user clusters. Therefore, no tuning is required on the admin cluster.
The default user cluster control-plane VM configuration (4 CPUs, 8GB memory, and 40GB storage) is the minimum setup required to run up to 250 nodes in a user cluster.

See the CPU, RAM, and storage requirements for each individual VM.

Disk and network I/O requirements

Data-intensive workloads and certain control plane components are sensitive to disk and network I/O latency. For example, 500 sequential IOPS (for example, a typical local SSD or a high performance virtualized block device) is typically needed for etcd's performance and stability in a cluster with dozens of nodes and thousands of Pods.

Node IP address

Each GKE on-prem node requires one DHCP or statically-assigned IP address.

For example, 308 IP addresses are needed in a setup with one non-HA user cluster with 50 nodes and one HA user cluster with 250 nodes. The following table shows the breakdown of the IP addresses:

Node type	Number of IP addresses
Admin cluster control-plane VM	1
Admin cluster addon node VMs	3
User cluster 1 (non-HA) control-plane VM	1
User cluster 1 node VMs	50
User cluster 2 (HA) control-plane VMs	3
User cluster 2 node VMs	250
Total	308

Kubernetes networking

This section discusses scalability considerations for the Pod CIDR block and Kubernetes Services.

Pod CIDR block

The Pod CIDR block is the CIDR block for all Pods in a user cluster. From this range, smaller /24 blocks are assigned to each node. If you need an N node cluster, you must ensure this block is large enough to support N /24 blocks.

The following table describes the maximum number of nodes supported by different Pod CIDR block sizes:

Pod CIDR block size	Max number of nodes supported
/19	32
/18	64
/17	128
/16	256

The default Pod CIDR block is 192.168.0.0/16, which supports 256 nodes. The default Pod CIDR block allows you to create a cluster with 250 nodes, which is the maximum number of nodes GKE on-prem supports in a user cluster.

Kubernetes Services

This section discusses scalability considerations for the Service CIDR block and Load Balancer.

Service CIDR block

The Service CIDR block is the CIDR block for all Services in a user cluster. The Services discussed in this section refer to the Kubernetes Services with type LoadBalancer.

The following table describes the maximum number of Services supported by different Service CIDR block sizes:

Service CIDR block size	Max number of Services supported
/20	4,096
/19	8,192
/18	16,384

The default value is 10.96.0.0/12, which supports 1,048,576 services. The default Service CIDR block allows you to create a cluster with 500 Services, which is the maximum number of Services GKE on-prem supports in a user cluster.

Load Balancer

There is a limit on the number of nodes in your cluster, and on the number of Services that you can configure on your load balancer.

For bundled load balancing (Seesaw), there is also a limit on the number of health checks. The number of health checks depends on the number of nodes and the number of traffic local Services. A traffic local Service is a Service that has its externalTrafficPolicy set to Local.

The following table describes the maximum number of Services, nodes, and health checks for Bundled load balancing (Seesaw) and Integrated load balancing (F5):

	Bundled load balancing (Seesaw)	Integrated load balancing (F5)
Max Services	500	250 ²
Max nodes	250	250 ²
Max health checks	N + (L * N) <= 10K, where N is the number of nodes, and L is the number of traffic local services ¹	N/A ²

¹ For example, suppose you have 100 nodes and 99 traffic local Services. The number of health checks is 100 + (99 * 100) = 10,000, which is within the 10K limit.

² Consult F5 for more information. This number is affected by factors such as your F5 hardware model number, virtual instance CPU/memory, and licenses.

GKE Hub

By default, you can register a maximum of 15 user clusters. To register more clusters in GKE Hub, you can submit a request to increase your quota in the Google Cloud console.

Cloud Logging and Cloud Monitoring

Cloud Logging and Cloud Monitoring help you track your resources.

Running many nodes in a user cluster

The CPU and memory usage of the in-cluster agents deployed in a user cluster scale in terms of the number of nodes and Pods in a user cluster.

Cloud logging and monitoring components such as prometheus-server, stackdriver-prometheus-sidecar, and stackdriver-log-aggregator have different CPU and memory resource usage based on the number of nodes and the number of Pods. Before you scale up your cluster, set the resources request and limit according to the estimated average usage of these components. The following table shows estimates for the average amount of usage for each component:

Number of nodes	Container name	Estimated CPU usage		Estimated memory usage
Number of nodes	Container name	0 pods/node	30 pods/node	0 pods/node	30 pods/node
3 to 50	stackdriver-log-aggregator	150m	170m	1.6G	1.7G
	prometheus-server	100m	390m	650M	1.3G
	stackdriver-prometheus-sidecar	100m	340m	1.5G	1.6G
51 to 100	stackdriver-log-aggregator	220m	1100m	1.6G	1.8G
	prometheus-server	160m	500m	1.8G	5.5G
	stackdriver-prometheus-sidecar	200m	500m	1.9G	5.7G
101 to 250	stackdriver-log-aggregator	450m	1800m	1.7G	1.9G
	prometheus-server	400m	2500m	6.5G	16G
	stackdriver-prometheus-sidecar	400m	1300m	7.5G	12G

Make sure you have nodes big enough to schedule the Cloud Logging and Cloud Monitoring components. One way to do this is to create a small cluster first, edit the Cloud Logging and Cloud Monitoring component resources according to the table above, create a node pool to accommodate the components, and then gradually scale up the cluster to a larger size.

You can choose to maintain a node pool just big enough for the monitoring and logging components to prevent other pods from being scheduled to the node pool. To do this, you must add the following taints to the node pool:

taints:
  - effect: NoSchedule
    key: node-role.gke.io/observability

This prevents other components from being scheduled on the node pool and prevents user workloads from being evicted due to the monitoring components' resource consumption.

Running many user clusters in an admin cluster

The CPU and memory usage of the logging and monitoring components deployed in an admin cluster scale according to the number of user clusters.

The following table describes the amount of admin cluster node CPU and memory required to run a large number of user clusters:

Number of user clusters	Admin cluster node CPU	Admin cluster node memory
0 to 10	4 CPUs	16GB
11 to 20	4 CPUs	32GB

For example, if there are 2 admin cluster nodes and each has 4 CPUs and 16GB memory, you can run 0 to 10 user clusters. To create more than 10 user clusters, you must first resize the admin cluster node memory from 16GB to 32GB.

To change the admin cluster node's memory, edit the MachineDeployment configuration:

Run the following command:
```
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG edit machinedeployment gke-admin-node
```
where ADMIN_CLUSTER_KUBECONFIG is the path of the kubeconfig file for your admin cluster.
Change the spec.template.spec.providerSpec.value.machineVariables.memory field to 32768.
Save the edit. The admin cluster nodes are recreated with 32GB memory.

Auto-scaling system components

GKE on-prem automatically scales the system components in the cluster according to the number of nodes without any need for you to change configurations. You can use the information in this section for resource planning.

GKE on-prem automatically performs vertical scaling by scaling the CPU and memory requests/limits of the following system components using addon-resizer:

kube-state-metrics is a Deployment running on cluster worker nodes that listens to the Kubernetes API server and generates metrics about the state of the objects. The CPU and memory request and limits scale based on the number of nodes.

The following table describes the resource requests/limits set by the system, given the number of nodes in a cluster:

Number of nodes Approximate¹ CPU request/limit (milli) Approximate¹ memory request/limit (Mi)

3 to 5 105 110

6 to 250 100 + num_nodes 100 + (2 * num_nodes)

¹ There is a margin of +-5% to reduce the number of component restarts during scaling.

For example, in a cluster with 50 nodes, the CPU request/limit are set to 150m/150m and the memory request/limit are set to 200Mi/200Mi. In a cluster with 250 nodes, the CPU request/limit are set to 350m/350m and the memory request/limit are set to 600Mi.

Number of nodes	Approximate¹ CPU request/limit (milli)	Approximate¹ memory request/limit (Mi)
3 to 5	105	110
6 to 250	100 + num_nodes	100 + (2 * num_nodes)

metrics-server is a deployment running on cluster worker nodes that is used by Kubernetes built-in autoscaling pipelines.

Expand to see resource requests/limits set by the system, given the number of nodes in a cluster.

Number of nodes	Approximate¹ CPU request/limit (milli)	Approximate¹ memory request/limit (Mi)
3 to 5	43	5
6 to 7	44	63
8 to 11	46	79
12 to 16r	48	99
17 to 25	53	135
26 to 37	59	183
38 to 56	68	259
57 to 85	83	375
86 to 128	104	547
129 to 192	136	803
193 to 250	184	1187

¹There is a margin of +-5% to reduce the number of component restarts during scaling.

GKE on-prem automatically performs horizontal scaling by scaling the number of replicas of the following system components:
- kube-dns is the DNS solution used for service discovery in GKE on-prem. It runs as a Deployment on user cluster worker nodes. GKE on-prem automatically scales the number of replicas according to the number of nodes and CPU cores in the cluster. With each addition/deletion of 16 nodes or 256 cores, 1 replica is increased/decreased. If you have a cluster of N nodes and C cores, you can expect max(N/16, C/256) replicas. Note that kube-dns has been updated since GKE on-prem 1.4 to support 1500 concurrent requests per second.
- calico-typha is a component for supporting Pod networking in GKE on-prem. It runs as a Deployment on user cluster worker nodes. GKE on-prem automatically scales the number of replicas based on the number of nodes in the cluster. There is 1 replica of calico-typha in a cluster with less than 200 nodes, and 2 replicas in a cluster with 200 or more nodes.
- ingress-gateway/istio-pilot are the components for supporting cluster ingress, and they run as a Deployment on user cluster worker nodes. Depending on the amount of traffic ingress-gateway is handling, GKE on-prem uses Horizontal Pod Autoscaler to scale the number of replicas based on their CPU usage, with a minimum of 2 replicas and maximum of 5 replicas.

Best practices

This section describes best practices for scaling your resources.

Scale your cluster in stages

The creation of a Kubernetes node involves cloning the node OS image template into a new disk file, which is an I/O-intensive vSphere operation. There is no I/O isolation between the clone operation and workload I/O operations. If there are too many nodes being created at the same time, the clone operations take a long time to finish and may affect the performance and stability of the cluster and existing workloads.

Ensure that the cluster is scaled in stages depending on your vSphere resources. For example, to resize a cluster from 3 to 250 nodes, consider scaling it in stages from 80 to 160 to 250, which helps reduce the load on vSphere infrastructure.

Optimize etcd disk I/O performance

etcd is a key value store used as Kubernetes' backing store for all cluster data. Its performance and stability are critical to a cluster's health and are sensitive to disk and network I/O latency.

Optimize the I/O performance of the vSphere datastore used for the control-plane VMs by following these recommendations:
- Follow the etcd hardware requirements.
- Use SSD or all flash storage.
Latency of a few hundreds of milliseconds indicates a bottleneck on the disk or network I/O and may result in an unhealthy cluster. Monitor and set alert thresholds for the following etcd I/O latency metrics:
- etcd_disk_backend_commit_duration_seconds
- etcd_disk_wal_fsync_duration_seconds

Optimize node boot disk I/O performance

Pods use ephemeral storage for their internal operations, like saving temporary files. Ephemeral storage is consumed by the container's writable layer, logs directory, and emptyDir volumes. Ephemeral storage comes from the GKE on-prem node's file system, which is backed by the boot disk of the node.

As there is no storage I/O isolation on Kubernetes nodes, applications that consume extremely high I/O on their ephemeral storage can potentially cause node instability by starving system components like Kubelet and the docker daemon of resources.

Ensure that the I/O performance characteristics of the datastore on which the boot disks are provisioned can provide the right performance for the application's use of ephemeral storage and logging traffic.

Monitor physical resource contention

Be aware of vCPU to pCPU ratios and memory overcommitment. A suboptimal ratio or memory contention at the physical hosts may cause VM performance degradation. You should monitor the physical resource utilization at host level and allocate enough resources to run your large clusters.